cmsc430.github.io/www/notes/abscond.scrbl at 295bcef229abe4d6b2635f59e0039ea6f129afd1 · cmsc430/cmsc430.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
#lang scribble/manual

@(require (for-label (except-in racket compile) a86/ast a86/printer))
@(require scribble/examples
	  redex/reduction-semantics
          redex/pict
	  (only-in pict scale)
	  (only-in racket system)
	  "../fancyverb.rkt"
	  "../utils.rkt"
	  "utils.rkt"
	  "ev.rkt")

@(define codeblock-include (make-codeblock-include #'here))

@(ev '(require rackunit a86))
@(for-each (λ (f) (ev `(require (file ,(path->string (build-path langs "abscond" f))))))
	   '("main.rkt" "correct.rkt"))

@(define (shellbox . s)
   (parameterize ([current-directory (build-path langs "abscond")])
     (filebox (emph "shell")
              (fancyverbatim "fish" (apply shell s)))))

@(require (for-syntax "../utils.rkt" racket/base "utils.rkt"))
@(define-syntax (shell-expand stx)
   (syntax-case stx ()
     [(_ s ...)
      (parameterize ([current-directory (build-path langs "abscond")])
        (begin (apply shell (syntax->datum #'(s ...)))
	       #'(void)))]))

@;{ Have to compile 42.s (at expand time) before listing it }
@(shell-expand "cat 42.rkt | racket -t compile-stdin.rkt -m > 42.s")


@(define this-lang "Abscond")
@(define prefix (string-append this-lang "-"))

@title[#:tag this-lang]{@|this-lang|: a language of numbers}

@src-code[this-lang]

@emph{Let's Make a Programming Language!}

@table-of-contents[]

@section[#:tag-prefix prefix]{Overview}

A compiler is just one (optional!) component of a @emph{programming
language}.  So if you want to make a compiler, you must first settle
on a programming language to compile.

The specification of a programming language consists of two parts: the
syntax, which specifies the @bold{form} of programs, and semantics,
which specifies the @bold{meaning} of programs.

Syntax, while important, is a fairly superficial aspect of
programming languages. The real heart of a programming
language is its semantics and we will spend more time
concerned this aspect.

There are a few common ways a language's meaning is specified:

@itemlist[

@item{By example.}

@item{By informal description.}

@item{By reference to an implementation, often an interpreter.}

@item{By formal (mathematical) definition.}

]

Each approach has its advantages and disadvantages.  Examples are
concise and unambiguous, but incomplete.  Informal (prose)
descriptions can be intuitive, but open to interpretation and
ambiguity.  Reference implementations provide precise, executable
specifications, but may over specify language details by tying them to
implementation artifacts.  Formal definitions balance precision while
allowing for under-specification, but require detailed definitions and
training to understand.

For the purposes of this course, we will use interpreters to specify
the meaning of programs.  The interpreters provide a specification for
the compilers we write and make precise what means for a compiler to
be @emph{correct}.  Any time the compiler produces code that, when
run, produces a different result that the interpreter produces for the
same program, the compiler is broken (or the specification is wrong).
Interpreters are useful for specifying what the compiler should do and
sometimes writing interpreters is also useful for informing @emph{how}
it should do it.


To begin, let's start with a dead simple programming language called
@bold{Abscond}.  The only kind of expression in Abscond are integer
literals.  Running an abscond program just produces that integer.
(Told you it was simple.)

@section[#:tag-prefix prefix]{Concrete syntax for Abscond}

We will simplify matters of syntax by using the Lisp
notation of s-expression for the @bold{concrete} form of
program phrases. The job of a @emph{parser} is to construct
an abstract syntax tree from the textual representation of a
program. We will consider parsing in two phases:

@itemlist[
 @item{the first converts a stream of textual input into an
  s-expression, and}
 @item{the second converts an s-expression into an instance
       of a datatype for representing expressions called an @bold{AST}.}
 ]

For the first phase, we rely on the @racket[read] function
to take care of converting strings to s-expressions. In
order to parse s-expressions into ASTs, we will write fairly
straightforward functions that convert between the
representations.

Abscond, like the other languages studied in this course, is
designed to be a subset of Racket. This has two primary benefits:

@itemlist[
 @item{the Racket interpreter and compiler can be used as a reference implementation of the languages we build, and}
 @item{there are built-in facilities for reading and writing
  data in the parenthezised form that Racket uses, which we can borrow to make parsing easy.}
 ]

The concrete form of an Abscond program will consist of, like Racket, the line of text:

@verbatim{#lang racket}

followed by a (concrete) expression.  The grammar of expressions is
very simple:


@(define-language A
  (e ::= integer))

@centered{@render-language[A]}

So, @racket[0], @racket[120], @racket[-42], etc. are
concrete Abscond expressions and a complete Abscond program
looks like this:

@codeblock-include["abscond/42.rkt"]

Reading Abscond programs from ports, files, strings,
etc. consists of reading (and ignoring) the
@tt{#lang racket}
line and then using the @racket[read] function to
parse the concrete expression as an s-expression.


@section[#:tag-prefix prefix]{Abstract syntax for Abscond}

While not terribly useful for a language as overly simplistic as Abscond, we use
an AST datatype for representing expressions and another syntactic categories.
For each category, we will have an appropriate constructor.  In the case of Abscond
all expressions are integers, so we have a single constructor, @racket[Lit].
A datatype for representing expressions can be defined as:

@codeblock-include["abscond/ast.rkt"]


The parser for Abscond checks that a given s-expression is
an integer and constructs an instance of the AST datatype if
it is, otherwise it signals an error:
@codeblock-include["abscond/parse.rkt"]

@ex[
(parse 5)
(parse 42)
(eval:error (parse #t))]


@section[#:tag-prefix prefix]{Meaning of Abscond programs}

The meaning of an Abscond program is simply the number itself.  So
@racket[(Lit 42)] evaluates to @racket[42].

We can write an ``interpreter'' that consumes an expression and
produces it's meaning:

@codeblock-include["abscond/interp.rkt"]

@#reader scribble/comment-reader
(examples #:eval ev
(interp (Lit 42))
(interp (Lit -8))
)

The @racket[interp] function specifies the meaning of expressions,
i.e. elements of the type @tt{Expr}.  This language is so simple, the
@racket[interp] function really doesn't @emph{do} much of anything,
but this will change as the langauge grows.


We can add a command line wrapper program for interpreting Abscond
programs from stdin:

@codeblock-include["abscond/interp-stdin.rkt"]

The details here aren't important (and you won't be asked to write
this kind of code), but this program @racket[read]s the contents of a
file given on the command line.  If it's an integer, i.e. a
well-formed Abscond program, then it runs the intepreter and displays
the result.

For example, interpreting the program @tt{42.rkt} shown above:
@shellbox["cat 42.rkt | racket -t interp-stdin.rkt -m"]

@;{
Even though the semantics is obvious, we can provide a formal
definition of Abscond using @bold{operational semantics}.

An operational semantics is a mathematical definition that
characterizes the meaning of programs.  We will defined the semantics
of Abscond as a @emph{binary relation} between programs and their
meanings.  So in the setting of Abscond, this binary relation will be
a set of pairs of expressions and integers.  This relation will be
defined inductively using @emph{inference rules}.  For such a simple
language, just a single inference rule suffices:

@(define-judgment-form A
  #:mode (𝑨 I O)
  #:contract (𝑨 e i)
  [----------
   (𝑨 (Lit i) i)])

@(centered (render-judgment-form 𝑨))

Here, we are defining a binary relation, called
@render-term[A 𝑨], and saying every integer literal
expression is paired with the integer itself in the
relation. So @math{((Lit 2),2)} is in @render-term[A 𝑨],
@math{((Lit 5),5)} is in @render-term[A 𝑨], and so on.

The inference rules define the binary relation by defining the
@emph{evidence} for being in the relation.  The rule makes use of
@emph{meta-variables} drawn from the non-terminals of the language
grammar.  A pair is in the relation if you can construct an instance
of the rule (substituting some integer for @math{i}) in the rule.

(This part probably seems opaque at the moment, but it will become
clearer as we work through more examples, so don't worry.)


The operational semantics @emph{defines} the meaning of Abscond
programs.  The intepreter @emph{computes} that meaning.  We can view
the semantics as a specification, and the interpreter as an
implementation.

Characterizing the correctness of the interpreter boils down to the
following statement:


@bold{Interpreter Correctness}: @emph{For all expressions @racket[e]
and integers @racket[i], if (@racket[e],@racket[i]) in @render-term[A
𝑨], then @racket[(interp e)] equals @racket[i].}

We now have a complete (if overly simple) programming language with an
operational semantics and an interpreter, which is (obviously)
correct.  Now let's write a compiler.
}

@section[#:tag-prefix prefix]{Toward a Compiler for Abscond}

A compiler, like an interpreter, is an implementation of a programming
language.  The key difference is that a compiler stages the work of
interpreting a program into two phases.  The first translates the
original program (the source language) into a program in another
programming language (the target language).  The second runs this
program.  These phases, often called @bold{compile-time} and
@bold{run-time}.  The program that does the translation is the
@bold{compiler}.  The program that does the running of the translated
program is the @bold{run-time system}.

So in general, the relationship between an interpreter and compiler is

@verbatim{
(source-interp e) = (target-interp (source-compile e))
}

We can in principle choose any target language we'd like.  For this
class, we will choose the @bold{x86-64} instruction set architecture.

There are several reasons for this choice:

@itemlist[

@item{it is a low-level language, so compiling from a high-level
language to x86-64 will require building everything from scratch,}

@item{it is the programming language at the ``bottom'' of your
computer; it's interpreter is implemented in hardware on your
computer's CPU,}

@item{it is one of the two dominant computing architectures (the other
being ARM) in use today, and}

@item{it is a mature technology with good tools and materials.}
]

So our compiler will emit x86 assembly code.  To make our lives a bit
easier, we will write the run-time system in C.  Let's start with the
Abscond runtime:

@filebox-include[fancy-c abscond "main.c"]

This C program provides the main entry point for running an Abscond
program.  It relies upon a function @tt{print_result} which is defined
as follows:

@filebox-include[fancy-c abscond "print.h"]
@filebox-include[fancy-c abscond "print.c"]

Separating out @tt{print_result}, which at this point is just a simple
@tt{printf} statement, seems like overkill, but it will be useful in
the future as the language and its set of values gets more complicated.

The runtime must be linked against an object file that provides the
definition of @tt{entry}; this is the code our compiler will emit.

The @tt{entry} function computes the result of running the
Abscond code, i.e. an integer.  Here we are taking advantage of the
x86-64 architecture by using 64-bit signed integers by using the
@tt{int64_t} C type.

The runtime system calls the function and prints the result.

We can compile the run-time system to get an object file.  We'll use
@tt{gcc} for compiling C:

@shellbox["gcc -m64 -c -o main.o main.c"]
@shellbox["gcc -m64 -c -o print.o print.c"]

This creates @tt{main.o} and @tt{print.o}; linking these file together
with an object file that contains @tt{entry} will produce an
executable that, when run, will carry out the execution of an Abscond
program.

@section[#:tag-prefix prefix]{An Example}

Before trying to write the Abscond compiler, let's first make an
example of what we would like the compiler to produce for a particular
example.  Let's say the Abscond program is @racket[42].  What should
the assembly code for this program look like?  Here we have to learn a
bit about the x86-64 assembly language.

@filebox-include[fancy-nasm abscond "42.s"]

@margin-note{Note: on macOS, labels must be prepended with @tt{_},
while on Linux they are not; e.g. @tt{_entry} vs @tt{entry}.}

Above is a x86-64 program, written in NASM syntax.  We will be using
@tt{nasm} as our assembler in this class because it is widely used and
available on most platforms.

@itemlist[#:style 'numbered

@item{The first line declares a global label (@tt{entry}), an
entry point in to the code below.}

@item{The next line declares the start of a section of code consisting
of textual instructions.}

@item{The third line contains the @tt{entry} label, i.e. the start of
the @tt{entry} code.  When the run-time systems calls @tt{entry}, it
will jump to this point in the code.}

@item{The fourth line is an instruction to move the integer literal 42
into the @tt{rax} register.  By convention, whatever is in the
@tt{rax} register when code returns control to the caller will hold
the return value.}

@item{The final line is an instruction to return control to the
caller.}
]

To assemble this program into an object file, we can run the @tt{nasm}
assembler:

@shellbox[
  (format "nasm -f ~a -o 42.o 42.s"
          (case (system-type 'os)
            [(macosx) "macho64"]
            [(unix) "elf64"]))]

@margin-note{Note: on macOS, the format option @tt{-f} should be
@tt{macho64}; on Linux it should be @tt{elf64}.}

This creates @tt{42.o}, an object file containing the instructions
above (in binary format).

We can link this file with the run-time to produce an executable file:

@shellbox["gcc main.o print.o 42.o -o 42.run"]

This creates the file @tt{42.run}, an exectuable program:

@shellbox["./42.run"]

We now have a working example.  The remaining work will be to design a
compiler that takes an Abscond program and emits a file like
@tt{42.s}, but with the appropriate integer literal.

@section[#:tag-prefix prefix]{A Compiler for Abscond}

We will now write a compiler for Abscond.  To heart of the compiler
will be a function with the following signature:

@#reader scribble/comment-reader
(racketblock
;; Expr -> Asm
(define (compile e) ...)
)

Where @tt{Asm} is a data type for representing assembly programs,
i.e. it will be the AST of x86-64 assembly.

So the AST representation of our example is:

@racketblock[
(list (Label 'entry)
      (Mov 'rax 42)
      (Ret))
]

Writing the @racket[compile] function is easy:

@codeblock-include["abscond/compile.rkt"]

@#reader scribble/comment-reader
(examples #:eval ev
(compile (Lit 42))
(compile (Lit 38))
)

To convert back to the concrete NASM syntax, we use
@racket[asm-display].

@margin-note{Note: the printer takes care of the macOS vs Linux label
convention by detecting the underlying system and printing
appropriately.}

@#reader scribble/comment-reader
(examples #:eval ev
(asm-display (compile (Lit 42))))

Putting it all together, we can write a command line compiler much
like the command line interpreter before, except now we emit assembly
code:

@codeblock-include["abscond/compile-stdin.rkt"]

Example:

@shellbox["cat 42.rkt | racket -t compile-stdin.rkt -m"]

Using a Makefile, we can capture the whole compilation dependencies as:

@margin-note{Note: the appropriate object file format is detected
based on the operating system.}

@filebox-include[fancy-make abscond "Makefile"]

And now compiling Abscond programs is easy-peasy:

@shellbox["make 42.run" "./42.run"]

It's worth taking stock of what we have at this point, compared to the
interpreter approach.  To run the interpreter requires all of Racket
in the run-time system.

When running a program using the interpreter, we have to parse the
Abscond program, check the syntax of the program (making sure it's an
integer), then run the interpreter and print the result.

When running a program using the compiler, we still have to parse the
Abscond program and check its syntax, but this work happens @emph{at
compile-time}.  When we @emph{run} the program this work will have
already been done.  While the compiler needs Racket to run, at
run-time, Racket does not need to be available.  All the run-time
needs is our (very tiny) object file compiled from C.  Racket doesn't
run at all -- we could delete it from our computer or ship the
executable to any compatible x86-64 machine and run it there.  This
adds up to much more efficient programs.  Just to demonstrate, here's
a single data point measuring the difference between interpreting and
compiling Abscond programs:

@shellbox["cat 42.rkt | time -p racket -t interp-stdin.rkt -m"]

Compiling:

@shellbox["time -p ./42.run"]

Because Abscond is a subset of Racket, we can even compare results
against interpreting the program directly in Racket:

@shellbox["touch 42.rkt # forces interpreter to be used"
          "time -p racket 42.rkt"]

Moreover, we can compare our compiled code to code compiled by Racket:

@shellbox["raco make 42.rkt"
	  "time -p racket 42.rkt"]


@section[#:tag-prefix prefix]{But is it @emph{Correct}?}

At this point, we have a compiler for Abscond.  But is it correct?
What does that even mean, to be correct?

First, let's formulate an alternative implementation of
@racket[interp] that composes our compiler and a86 interpreter to define
a (hopefully!) equivalent function to @racket[interp]:

@codeblock-include["abscond/exec.rkt"]

This function can be used as a drop-in replacement to @racket[interp]:

@ex[
(exec (Lit 42))
(exec (Lit 19))]

It captures the idea of a phase-distinction in that you can first
compile a program into a program in another language---in this case
a86---and can then interpret @emph{that} program to get the result.
If the compiler is correct, the result should be the same:

@bold{Compiler Correctness}: @emph{For all @racket[e] @math{∈}
@tt{Expr} and @racket[i] @math{∈} @tt{Integer}, if @racket[(interp e)]
equals @racket[i], then @racket[(exec e)] equals
@racket[i].}

One thing that is nice about specifying our language with an
interpreter is that we can run it. So we can @emph{test} the compiler
against the interpreter.  If the compiler and interpreter agree on all
possible inputs, then the compiler is correct.


This is actually a handy tool to have for experimenting with
compilation within Racket:


@ex[
(exec (Lit 42))
(exec (Lit 37))
(exec (Lit -8))]

This of course agrees with what we will get from the interpreter:

@ex[
(interp (Lit 42))
(interp (Lit 37))
(interp (Lit -8))]

We can turn this in a @bold{property-based test}, i.e. a function that
computes a test expressing a single instance of our compiler
correctness claim:

@codeblock-include["abscond/correct.rkt"]

@ex[
(check-compiler (Lit 42))
(check-compiler (Lit 37))
(check-compiler (Lit -8))]

This is a powerful testing technique when combined with random
generation.  Since our correctness claim should hold for @emph{all}
Abscond programs, we can randomly generate @emph{any} Abscond program
and check that it holds.

@ex[
(check-compiler (Lit (random 100)))

; test 10 random programs
(for ([i (in-range 10)])
  (check-compiler (Lit (random 10000))))
]

The last expression is taking 10 samples from the space of Abscond
programs in @math{[0,10000)} and checking the compiler correctness
claim on them.  If the claim doesn't hold for any of these samples, a
test failure would be reported.

Finding an input to @racket[check-compiler] that fails would
@emph{refute} the compiler correctness claim and mean that we have a
bug.  Such an input is called a @bold{counter-example}.

On the other hand we gain more confidence with each passing test.
While passing tests increase our confidence, we cannot test all
possible inputs this way, so we can't be sure our compiler is correct
by testing alone.  To really be sure, we'd need to write a
@emph{proof}, but that's beyond the scope of this class.

At this point we have not found a counter-example to compiler
correctness.  It's tempting to declare victory.  But... can you think
of a valid input (i.e. some integer) that might refute the correctness
claim?

Think on it.  In the meantime, let's move on.