|
| 1 | +#lang scribble/manual |
| 2 | + |
| 3 | +@(require (for-label (except-in racket compile ... struct?) a86)) |
| 4 | +@(require redex/pict |
| 5 | + racket/runtime-path |
| 6 | + scribble/examples |
| 7 | + "utils.rkt" |
| 8 | + "ev.rkt" |
| 9 | + "../fancyverb.rkt" |
| 10 | + "../utils.rkt") |
| 11 | + |
| 12 | +@(define codeblock-include (make-codeblock-include #'h)) |
| 13 | + |
| 14 | +@(define (shellbox . s) |
| 15 | + (parameterize ([current-directory (build-path notes "outlaw")]) |
| 16 | + (filebox (emph "shell") |
| 17 | + (fancyverbatim "fish" (apply shell s))))) |
| 18 | + |
| 19 | +@(require (for-syntax "../utils.rkt" racket/base "utils.rkt")) |
| 20 | +@(define-syntax (shell-expand stx) |
| 21 | + (syntax-case stx () |
| 22 | + [(_ s ...) |
| 23 | + (parameterize ([current-directory (build-path notes "abscond")]) |
| 24 | + (begin (apply shell (syntax->datum #'(s ...))) |
| 25 | + #'(void)))])) |
| 26 | + |
| 27 | +@;{ Have to generate a-whole.rkt before listing it below.} |
| 28 | +@(shell-expand "racket -t combine.rkt -m a.rkt > a-whole.rkt") |
| 29 | + |
| 30 | +@(ev '(require rackunit a86)) |
| 31 | +@(ev `(current-directory ,(path->string (build-path notes "outlaw")))) |
| 32 | +@(void (ev '(with-output-to-string (thunk (system "make runtime.o"))))) |
| 33 | +@(void (ev '(current-objs '("runtime.o")))) |
| 34 | +@(for-each (λ (f) (ev `(require (file ,f)))) |
| 35 | + '(#;"interp.rkt" "compile.rkt" "compile-expr.rkt" "compile-literals.rkt" "compile-datum.rkt" "utils.rkt" "ast.rkt" "parse.rkt" "types.rkt" "unload-bits-asm.rkt")) |
| 36 | + |
| 37 | +@(define this-lang "Outlaw") |
| 38 | + |
| 39 | +@title[#:tag this-lang]{@|this-lang|: self-hosting} |
| 40 | + |
| 41 | +@src-code[this-lang] |
| 42 | + |
| 43 | +@emph{The king is dead, long live the king!} |
| 44 | + |
| 45 | +@table-of-contents[] |
| 46 | + |
| 47 | +@section[#:tag-prefix "neerdowell"]{Bootstrapping the compiler} |
| 48 | + |
| 49 | +Take stock for a moment of the various language features we've built |
| 50 | +over the course of these notes and assignments: we've built a |
| 51 | +high-level language with built-in data types like booleans, integers, |
| 52 | +characters, pairs, lists, strings, symbols, vectors, boxes. Users can |
| 53 | +define functions, including recursive functions. Functions are |
| 54 | +themselves values and can be constructed anonymously with |
| 55 | +@racket[lambda]. We added basic I/O facilities. We added the ability |
| 56 | +to overload functions based on the number of arguments received using |
| 57 | +@racket[case-lambda], the ability to define variable arity functions |
| 58 | +using rest arguments, and the ability to call functions with arguments |
| 59 | +from a list using @racket[apply]. Users can defined their own |
| 60 | +structure types and use pattern matching to destructure values. |
| 61 | +Memory management is done automatically by the run-time system. |
| 62 | + |
| 63 | +It's a pretty full-featured language and there are lots of interesting |
| 64 | +programs we could write in our language. One of the programs we could |
| 65 | +@emph{almost} write is actually the compiler itself. In this section, |
| 66 | +let's bridge the gap between the features of Racket our compiler uses |
| 67 | +and those that our compiler implements and then explore some of the |
| 68 | +consequences. |
| 69 | + |
| 70 | + |
| 71 | +We'll call it @bold{Outlaw}. |
| 72 | + |
| 73 | +@section[#:tag-prefix "outlaw"]{Features used by the Compiler} |
| 74 | + |
| 75 | +Let's take a moment to consider all of the language features we |
| 76 | +@emph{use} in our compiler source code, but we haven't yet |
| 77 | +implemented. Open up the source code for, e.g. @secref{Neerdowell}, |
| 78 | +and see what you notice: |
| 79 | + |
| 80 | +@itemlist[ |
| 81 | + |
| 82 | +@item{Modules: programs are not monolithic; they are broken into |
| 83 | +@bold{modules} in separate files like @tt{compile-stdin.rkt}, |
| 84 | +@tt{parse.rkt}, @tt{compile.rkt}, etc.} |
| 85 | + |
| 86 | +@item{a86: our compiler relies heavily on the @secref{a86} library |
| 87 | +that provides all of the constructors for a86 instructions and |
| 88 | +functions like @racket[asm-display] for printing a86 instructions |
| 89 | +using NASM syntax.} |
| 90 | + |
| 91 | +@item{Higher-level I/O: at the heart of the front-end of our compiler |
| 92 | +is the use of Racket's @racket[read] function, which reads in an |
| 93 | +s-expression. We also use things like @racket[read-line] which reads |
| 94 | +in a line of text and returns it as a string.} |
| 95 | + |
| 96 | +@item{Lots and lots of Racket functions: our compiler makes use of |
| 97 | +lots of built-in Racket functions that we haven't implemented. These |
| 98 | +are things like @racket[length], @racket[map], @racket[foldr], |
| 99 | +@racket[filter], etc. Even some of the functions we have implemented |
| 100 | +have more featureful counterparts in Racket which we use. For |
| 101 | +example, our @racket[+] primitve takes two arguments, while Racket's |
| 102 | +@racket[+] function can take any number of arguments.} |
| 103 | + |
| 104 | +@item{Primitives as functions: the previous item brings up an |
| 105 | +important distinction between our language and Racket. For us, |
| 106 | +things like @racket[+] are @bold{primitives}. Primitives are |
| 107 | +@emph{not} values. You can't return a primitive from a function. You |
| 108 | +can't make a list of primitives. This means even if we had a |
| 109 | +@racket[map] function, you couldn't pass @racket[add1] as an argument, |
| 110 | +since @racket[add1] is not a value. In Racket, there's really no such |
| 111 | +thing as a primitive; things like @racket[add1], @racket[+], |
| 112 | +@racket[cons?], etc. are all just functions.} |
| 113 | + |
| 114 | +] |
| 115 | + |
| 116 | +If we want our compiler to be written in the language it implements we |
| 117 | +have to deal with this gap in some way. For each difference between |
| 118 | +what we implement and what we use, we basically only have two ways to |
| 119 | +proceed: |
| 120 | + |
| 121 | +@itemlist[#:style 'ordered |
| 122 | + |
| 123 | + @item{rewrite our compiler source code to @emph{not} use |
| 124 | +that feature, or} |
| 125 | + |
| 126 | + @item{implement it.} |
| 127 | +] |
| 128 | + |
| 129 | +Let's take some of these in turn. |
| 130 | + |
| 131 | +@section[#:tag-prefix "outlaw"]{Punting on Modules} |
| 132 | + |
| 133 | +Our compiler currently works by compiling a whole program, which we |
| 134 | +assume is given all at once as input to the compiler. The compiler |
| 135 | +source code, on the other hand, is sensibly broken into seperate |
| 136 | +modules. |
| 137 | + |
| 138 | +We @emph{could} think about designing a module system for our |
| 139 | +language. We'd have to think about how seperate compilation of |
| 140 | +modules would work. At a minimum our compiler would have to deal with |
| 141 | +resolving module references made through @racket[require]. |
| 142 | + |
| 143 | +While module systems are a fascinating and worthy topic of study, we |
| 144 | +don't really have the time to do them justice and instead we'll opt to |
| 145 | +punt on the module system. Instead we can rewrite the compiler source |
| 146 | +code as a single monolithic source file. |
| 147 | + |
| 148 | +That's not a very good software engineering practice and it will be a |
| 149 | +bit of pain to maintain the complete @this-lang source file. As a |
| 150 | +slight improvement, we can write a little utility program that given a |
| 151 | +file containing a module will recursively follow all @racket[require]d |
| 152 | +files and print out a single, @racket[require]-free program that |
| 153 | +includes all of the modules that comprise the program. |
| 154 | + |
| 155 | +Let's see an example of the @tt{combine.rkt} utility in action. |
| 156 | + |
| 157 | +Suppose we have a program that consists of the following files: |
| 158 | + |
| 159 | +@codeblock-include["outlaw/a.rkt"] |
| 160 | +@codeblock-include["outlaw/b.rkt"] |
| 161 | +@codeblock-include["outlaw/c.rkt"] |
| 162 | + |
| 163 | +Then we can combine these files into a single program |
| 164 | +as follows: |
| 165 | + |
| 166 | +@shellbox["racket -t combine.rkt -m a.rkt > a-whole.rkt"] |
| 167 | + |
| 168 | +@codeblock-include["outlaw/a-whole.rkt"] |
| 169 | + |
| 170 | + |
| 171 | +This gives us a rudimentary way of combining modules into a single |
| 172 | +program that can be compiled with our compiler. The idea will be that |
| 173 | +we construct a single source file for our compiler by running |
| 174 | +@tt{combine.rkt} on @tt{compile-stdin.rkt}. The resulting file will |
| 175 | +be self-contained and include everything @tt{compile-stdin.rkt} |
| 176 | +depends upon. |
| 177 | + |
| 178 | +It's worth recognizing that this isn't a realistic alternative to |
| 179 | +having a module system. In particular, combining modules in this way |
| 180 | +breaks usual abstractions provided by modules. For example, it's |
| 181 | +common for modules to define their own helper functions or stateful |
| 182 | +data that are not exported (via @racket[provide]) outside the module. |
| 183 | +This ensures that clients of the module cannot access potentially |
| 184 | +sensitive data or operations or mess with invariants maintained by a |
| 185 | +module's exports. Our crude combination tool does nothing to enforce |
| 186 | +these abstraction barriers. |
| 187 | + |
| 188 | +That's an OK compromise to make for now. The idea is that |
| 189 | +@tt{combine.rkt} doesn't have to work @emph{in general} for combining |
| 190 | +programs in a meaning-preserving way. It just needs to work for one |
| 191 | +specific program: our compiler. |
| 192 | + |
| 193 | +@section[#:tag-prefix "outlaw"]{Bare-bones a86} |
| 194 | + |
| 195 | +Our compiler makes heavy use of the @secref{a86} library that provides |
| 196 | +all of the constructors for a86 instructions and functions like |
| 197 | +@racket[asm-display] for printing a86 instructions using NASM syntax. |
| 198 | +That library is part of the @tt{langs} package. |
| 199 | + |
| 200 | +The library at its core provides structures for representing a86 |
| 201 | +instructions and some operations that work on instructions. While the |
| 202 | +library has a bunch of functionality that provides for good, early |
| 203 | +error checking when you construct an instruction or a whole a86 |
| 204 | +program, we really only need the structures and functions of the |
| 205 | +library. |
| 206 | + |
| 207 | +To make the compiler self-contained we can build our own bare-bones |
| 208 | +version of the a86 library and include it in the compiler. |
| 209 | + |
| 210 | +For example, here's the module that defines an AST for a86 instructions: |
| 211 | + |
| 212 | +@codeblock-include["outlaw/a86/ast.rkt"] |
| 213 | + |
| 214 | +And here's the module that implements the needed operations for |
| 215 | +writing out instructions in NASM syntax: |
| 216 | + |
| 217 | +@codeblock-include["outlaw/a86/printer.rkt"] |
| 218 | + |
| 219 | +OK, so now we've made a86 a self-contained part of the the compiler. |
| 220 | +The code consists of a large AST definition and some functions that |
| 221 | +operate on the a86 AST data type. The printer makes use of some Racket |
| 222 | +functions we haven't used before, like @racket[system-type] and |
| 223 | +@racket[number->string], and also some other high-level IO functions |
| 224 | +like @racket[write-string]. We'll have to deal with these features, |
| 225 | +so while we crossed one item of our list (a86), we added a few more, |
| 226 | +hopefully smaller problems to solve. |
| 227 | + |
| 228 | +@section[#:tag-prefix "outlaw"]{Racket functions, more I/O, and primitives} |
| 229 | + |
| 230 | +We identified three more gaps between our compiler's implementation |
| 231 | +language and its implemented language: lots of Racket functions like |
| 232 | +@racket[length], @racket[map], etc., more I/O functions that operate |
| 233 | +at a higher-level than our @racket[write-byte] and @racket[read-byte] |
| 234 | +such as @racket[write-string], @racket[read], @racket[read-line], |
| 235 | +etc., and finally the issue that primitives are not values. |
| 236 | + |
| 237 | +There are many ways we could proceed from here. We could, for |
| 238 | +example, spend some time adding new primitives to our compiler |
| 239 | +that implement all the missing functionality like @racket[length], |
| 240 | +@racket[write-string], and others. |
| 241 | + |
| 242 | +Let's consider adding a @racket[length] primitive. It's not terribly |
| 243 | +difficult. We could add a unary operation called @racket['length], |
| 244 | +which would emit the following code: |
| 245 | + |
| 246 | +@#reader scribble/comment-reader |
| 247 | +(racketblock |
| 248 | +;; assume list is in rax |
| 249 | +(let ((done (gensym 'done)) |
| 250 | + (loop (gensym 'loop))) |
| 251 | + (seq (Mov r8 0) ; count = 0 |
| 252 | + (Label loop) |
| 253 | + (Cmp rax (imm->bits '())) ; if empty, done |
| 254 | + (Je done) |
| 255 | + (assert-cons rax) ; otherwise, should be a cons |
| 256 | + (Xor rax type-cons) |
| 257 | + (Mov rax (Offset rax 0)) ; move cdr into rax |
| 258 | + (Add r8 (imm->bits 1)) ; increment count |
| 259 | + (Jmp loop) ; loop |
| 260 | + (Label done) |
| 261 | + (Mov rax r8))) ; return count |
| 262 | +) |
| 263 | + |
| 264 | +We can play around an make sure this assembly code is actually |
| 265 | +computing the length of the list in @racket['rax]: |
| 266 | + |
| 267 | +@(void (ev '(current-objs '()))) |
| 268 | + |
| 269 | +@#reader scribble/comment-reader |
| 270 | +(ex |
| 271 | +(require neerdowell/parse |
| 272 | + neerdowell/compile-datum |
| 273 | + neerdowell/compile-ops |
| 274 | + neerdowell/types) |
| 275 | +(require a86) |
| 276 | + |
| 277 | +;; Datum -> Natural |
| 278 | +;; Computes the length of d in assembly |
| 279 | +(define (length/asm d) |
| 280 | + (bits->value |
| 281 | + (asm-interp |
| 282 | + (seq (Global 'entry) |
| 283 | + (Label 'entry) |
| 284 | + (compile-datum d) |
| 285 | + ; assume list is in rax |
| 286 | + (let ((done (gensym 'done)) |
| 287 | + (loop (gensym 'loop))) |
| 288 | + (seq (Mov r8 0) ; count = 0 |
| 289 | + (Label loop) |
| 290 | + (Cmp rax (imm->bits '())) ; if empty, done |
| 291 | + (Je done) |
| 292 | + (assert-cons rax) ; otherwise, should be a cons |
| 293 | + (Xor rax type-cons) |
| 294 | + (Mov rax (Offset rax 0)) ; move cdr into rax |
| 295 | + (Add r8 (imm->bits 1)) ; increment count |
| 296 | + (Jmp loop) ; loop |
| 297 | + (Label done) |
| 298 | + (Mov rax r8))) ; return count |
| 299 | + (Ret) |
| 300 | + (Label 'raise_error_align) ; dummy version, returns -1 |
| 301 | + (Mov rax -1) |
| 302 | + (Ret))))) |
| 303 | + |
| 304 | +(length/asm '()) |
| 305 | +(length/asm '(1 2 3)) |
| 306 | +(length/asm '(1 2 3 4 5 6)) |
| 307 | +) |
| 308 | + |
| 309 | +Looks good. |
| 310 | + |
| 311 | +Alternatively, instead of a primitive, we could add a @racket[length] |
| 312 | +@emph{function} by creating a static function value and binding it to |
| 313 | +the variable @racket[length]. The code for the function would |
| 314 | +essentially be the same as the primitive above: |
| 315 | + |
| 316 | +@#reader scribble/comment-reader |
| 317 | +(racketblock |
| 318 | +(seq (Data) |
| 319 | + (Label 'length_func) ; the length closure |
| 320 | + (Dq 'length_code) ; points to the length code |
| 321 | + (Text) |
| 322 | + (Label 'length_code) ; code for length |
| 323 | + (Cmp r15 1) ; expects 1 arg |
| 324 | + (Jne 'raise_error_align) |
| 325 | + (Pop rax) |
| 326 | + ; ... length code from above |
| 327 | + (Add rsp 8) ; pop off function |
| 328 | + (Ret)) |
| 329 | +) |
| 330 | + |
| 331 | + |
| 332 | +The @racket[compile] function could push the binding for |
| 333 | +@racket[length] (and potentially other built-in functions) on the |
| 334 | +stack before executing the instructions of the program compiled in an |
| 335 | +environment that included @racket['length]. This would effectively |
| 336 | +solve the problem for @racket[length]. |
| 337 | + |
| 338 | +We'd have to do something similar for @racket[map], @racket[foldr], |
| 339 | +@racket[memq], and everything else we needed. |
| 340 | + |
| 341 | + |
| 342 | +The @emph{problem} with this approach is will be spending a bunch of |
| 343 | +time writing lots and lots of assembly code. An activity we had hoped |
| 344 | +to avoid by building a high-level programming language! Even worse, |
| 345 | +some of the functions we'd like to add, e.g. @racket[map], will be |
| 346 | +much more complicated to write in assembly compared to @racket[length]. |
| 347 | + |
| 348 | +But here's the thing. Consider a Racket definition of @racket[length]: |
| 349 | + |
| 350 | +@#reader scribble/comment-reader |
| 351 | +(racketblock |
| 352 | +(define (length xs) |
| 353 | + (match xs |
| 354 | + ['() 0] |
| 355 | + [(cons _ xs) (add1 (length xs))])) |
| 356 | +) |
| 357 | + |
| 358 | +Note that this definition is within the language we've built. Instead |
| 359 | +of writing the assembly code for @racket[length], we could write a |
| 360 | +definition in @this-lang and simply compile it to obtain assembly code |
| 361 | +that implements a @racket[length] function. |
| 362 | + |
| 363 | +Many of the functions we need in the compiler can be built up this |
| 364 | +way. Instead of spending our time writing and debugging assembly |
| 365 | +code, which is difficulty to do, we can simply write some Racket code. |
| 366 | + |
| 367 | +With this, we will introduce a @bold{standard library}. The idea is that |
| 368 | +the standard library, like the run-time system, is a bundle of code that |
| 369 | +will accompany every executable; it will provide a set of built-in functions |
| 370 | +and the compiler will be updated to compile programs in the environment of |
| 371 | +everything provided by the standard library. |
| 372 | + |
| 373 | + |
| 374 | +@section[#:tag-prefix "outlaw"]{Building a standard library} |
| 375 | + |
| 376 | +... |
| 377 | + |
| 378 | +@section[#:tag-prefix "outlaw"]{Parsing primitives, revisited} |
| 379 | + |
| 380 | +... |
| 381 | + |
| 382 | +@section[#:tag-prefix "outlaw"]{A few more primitives} |
| 383 | + |
| 384 | +... |
| 385 | + |
| 386 | +@section[#:tag-prefix "outlaw"]{Dealing with I/O} |
| 387 | + |
| 388 | +... |
| 389 | + |
| 390 | +@section[#:tag-prefix "outlaw"]{Putting it all together} |
| 391 | + |
| 392 | +... |
0 commit comments