Building a Tiny Compiler

Building a Tiny Compiler

by Samuel Giddins

Oct 14 2016

We all use compilers every day, but they still can seem like a mysterious black box at times. In this try! Swift talk, Samuel Giddins builds a tiny compiler for his made-up language 100% from scratch to get a feel for the basics of how compilers work. He discusses some of the ways Swift can yield elegant solutions for complex problems such as parsing, lexing, and code generation. At the end, we will have a working implementation of a brand-new programming language.

If you want to follow along, all of the code is on GitHub at segiddins/Sipquick.

Introduction (00:00)

I built a programming language that I have called “Sipquick,” and I built a compiler and a test run for it in Swift.

What does a tiny compiler look like? 310 lines of code. I did not go through and count for the Swift compiler, but I would take a guess it is probably over 100,000. So, pretty tiny by comparison.

Why Write A Compiler? (00:53)

Compilers are something that we interact with every day as developers. I want a better idea of how compilers work. What happens when I run Swift C or when I run Clang?
It is a well-known problem domain. There are many compilers. You can Google “how do I do X in a compiler”, and you will get something a bit better than Stack Overflow answers. You will get blog posts and books and rants on Twitter.
It is doable in any language. It is a project that takes a text file and transforms it into another text file.
I have never written a compiler before. Now I have I can put on my resume: compiler engineer.

What is A Compiler? (02:00)

A compiler is a really simple function:


let compiler: (String) -> Data /* Executable */

It is a thing that takes in a string, what we would call a source file, and prints out an executable. It transforms a program written in one language, usually something that we can read and understand and write and reason about, into another language, a language that is meant for machines to read and understand and reason about and execute.

So, how do you write a compiler? There are a bunch of well-known steps:

Parse source code
Lex it
Do what is called semantic analysis. You say, “these tokens that I have, what do they mean? In Swift, what does the word class mean?”
Optimize it
Generate machine code

Sipquick (03:48)

Compilers, by their nature, are highly functional. I built a language called “Sipquick.” It is a very loud, noisy bird in a book that I like.

It is a very small language that I invented for the purpose of this talk. If I tried to write a compiler for another programming language, I would have to implement more things and not be able to decide how it would work to make things as convenient for me as possible.
It is a language built on S-expressions (this is like Lisp). Everything goes in parentheses, and you use your editor. You have to go into your editor and hope it underlines the matching parentheses you can get things to compile.
It is a dynamic language. There are no types at compile time.
The compiler does not know what a type is and it is functional-ish. It does have side effects in the language, but by its nature, every function takes in arguments, and it returns something.

Get more development news like this


(def hello name (+ "Hello, " name))
(print (hello "world!"))

This would be how you write a “Hello, world!” in Sipquick. def is the keyword to define a function, and it’s named hello. It takes an argument named name and then it concatenates a string. In the second line, we call the function and print it.

Parser (05:41)

Step one was building a parser. I lifted most of the code from a try! Swift Tokyo talk about it. It is a parser combinator - there are fun things in there like monads. I do not like fancy operators - I changed it around to something that I could read at the time I wrote it.

This is the main parsing:


let schemeParser: Parser<String, [Sexp]> = {
  return ignoreSecond(
  sexp().
    then(unichar("\n").maybe()).
    maybeMany().
    fmap { $0.map { $0.0 } }
  .then(dropEndingMetadata().or(empty().andThatsAll())))
}

That is ignoring all of the machinery to set up the parser and some of the definitions for what an s-expression is. But the basic idea is we parse in a bunch of s-expressions, and then we say either there is this five trailing slashes at the end or the end of the file and that is it. We end up with an array of s-expressions.

There are a whole bunch of generics there. The compiler does not like that, but we transform the strings that we get, the tokens, into s-expressions.

Lexer (06:47)


_.fmap { Sexp.single($0) }
_.fmap { Sexp.many($0) }
_.fmap { _ in Sexp.none }

Whenever we parse something, I transform it into something that makes sense for the AST (Abstract Syntax Tree). I say, “these empty parentheses, that is an empty s-expression,” and that keeps things simple. I do not have to reason about what these strings mean in different positions. It happens in that one parser.swift file.

Semantic Analysis (07:45)


let sema: ([Sexp]) -> [Expression]

We have parsed things out, and now we have tokens. But a lot of tokens in many languages have special meanings. How do you define a function? How do you call a function? How do you define variables? Or keywords, are there operators? How do things work between different lines?*

That is what semantic analysis is: it is taking this abstract syntax tree and collapsing it down into something the programming language knows about which are expressions:


struct Expression {
    init(sexp: Sexp) {
        switch exp {
        ...
        }
    }
}

We have one initializer with a massive switch statement. I wrote this in the past week; I would not take this code as an example of production Swift. We switch on what we find, and we turn these s-expressions, the Lispy things, into the expression that the compiler can reason about: is this expression defining a variable, or defining a function, or calling a function.

In most compilers, we would say, “is this program correct? Is it well formed?” In a language like Swift, this would mean, do the types check out well? Are we calling this function with the right type of arguments, with the right number of arguments? Does this function even exist? This language does absolutely none of that. Another thing this compiler does not do is optimization.

Optimization (09:41)

Optimizations are, ironically enough, one of the easiest things to do in a compiler:


let optimizations: Array<[Expression] -> [Expression]>

Optimizations are saying, “I have expressions here, and I turn them into expressions there.” And in most production strength compilers, you will have many optimization passes. This is where you do things like:

Is there unreachable code? Let’s get rid of that.
Is this a pure function that is used twice? We can calculate the value once and use it in both places.

I did not do any of them, but it is a very important part of what compilers do, and it is a part of the compiler writing process that people spend a lot of time on. If I spend 10 hours implementing an optimization that can make your code run 1% faster, each one of you has a lot of users. 1% faster code equals hundreds, thousands, millions of hours and optimizing compilers are also a source of academic research, what transformations can you make safely.

Code Generation (11:31)

Code generation is where we finally say, I understand what this program is. I understand as the compiler how this is supposed to work, but now I have to output something that the computer can run:


let codeGen: ([Expression]) -> Data

We take our expressions, and we transform them into something executable.

What counts as an executable? Machine code is what something Clang and LLVM do. If you have used Babel for Javascript, that will compile some form of JavaScripty thing into another JavaScripty thing; that is still compilation.

I think code generation is really hard. The easiest way that people think of doing it is to import LLVM and spew stuff into LLVM, and it has many optimizations, and it will spew out good machine code. That works great if you have something like C. It turns out, LLVM was built for C and C++ and Objective-C compiler writers.

I was building a language that is very dynamic, untyped where you do not know if a variable is a string or an integer, does this function exist. I tried to fit it into LLVM, and that did not work out well.

On slide 27, that is the idealized “how do modern compilers work.” All the LLVM magic happens at the end, but I could not go that route. I had to do things the hard way.

I compiled my Lispy thing to Ruby. It is a dynamic language, and it has a robust standard library I do not have to write, “how do I access the arguments pass to this program? How do I output something to the screen?”:


extension Expression {
    func asRuby() -> String {
        switch self.type {
            ...
        }
    }
}

Code generation, basically implemented as an extension on Expressions. You say, “I have an expression, and I want it to be represented as Ruby.”

It is recursive. You have expressions that contain other expressions, so think in terms of a function call. You have the top level function call and then maybe the arguments themselves or other function calls or expressions. The asRuby() method handles that, and you output Ruby code which you can then run if you have a working Ruby version on your machine.

The gist of how this works is we have expressions and we literally just map the expressions to $0.asRuby, join them together with new lines, add a shebang at the top and run chmod, so the thing is executable, and we can run it.

How does this step look like?


; fibonacci.spq
(def fib x (condition (<= x 1) x (+ (fib (- x 1)) (fib (- x 2)))))
(print (fib 10))

Turns into:


#!/usr/bin/env ruby
def fib(x)
    if (x.<=(1))
        x
    else
        (fib((x.-(1))).+(fib((x.-(2)))))
    end
end
print(fib(10))

We have a definition of what is the nth Fibonacci number. Lisp is not the most expressive language, there are many parentheses, but it compiles to something that looks like vaguely friendly Ruby. And if you run it, it actually works.

On slide 32, I have the entirety of the code generation step. I know you cannot read that, it’s intentional. It fits into one file, compact, easy to do, and targeting Ruby made that nice.

It basically switches on what expression it is and maybe pull out, “this is a function definition. These are the argument names.”


extension Expression {
    func parenthesize(_ s: String) -> String
    
    var isOperatorCall: Bool { get }
    
    func asRuby(depth: Int = 0) -> String {
        switch kind {
        case .bare: { }
        case .call:
            if isOperatorCall {  }
            else {  }
        case .functionDefinition: {  }
        case .empty: {  }
        case .variableDeclaration: {  }
        case .conditional:
            guard children.count == 3 else {  }
            let conditional = children[0]
            let positive = children[1]
            let negative = children[2]
            {  }
        }
    }
}

But other than that, it is simple once you have done semantic analysis, once you have done optimizations to say, “let’s output to a target that reads the same as the language we are working in.”

If I tried to make this compile to Swift or C, that would have been a bit more difficult because then is this number and integer a floating point number? Is this a string or a character? Stack allocated, heap allocated? That is the mismatch that you can get when trying to think in terms of what does a traditional compiler do? It depends on what language you are compiling, to begin with.

Using The Compiler (17:25)


$ cat fibonacci.spq
(def fib x (condition (<= x 1) x (+ (fib (- x 1)) (fib (- x 2)))))
(puts (fib ([] ARGV 0)))
$ sipquick fibonacci.spq fibonacci
$ ./fibonacci 10
55

I like the .spq extension. You write a file, call it Sipquick. That file you can optionally say what file to output it to then, you can call it like any normal executable and pass in arguments.

Testing The Compiler (17:50)

I only wrote integration tests. They basically compile a file, run it, and verify that the output is exactly what I expected. Only testing this is how things should work, this is how a well-formed program should compile and run, not the other way around. And I wrote zero unit tests.

This is what a test case looked like:


(def print_even x (condition (== (% x 2) 0) (print "even") (print "odd")))
(print_even 17)
(print_even 12)
(print_even -1)
(print_even 1)
(print_even (* 0 1))
/////
it allows branching
0oddevenoddoddeven

It basically had the Sipquick on top, five slashes, followed by the name of the test case, the expected exit code and the expected output to stdout. And the test runner.

It has the idea of this is a test. It has a name, a script that you run, the output that you expect, and an exit code that you expect. Then you just run it and make sure that those things are equal. Again, figure out what is the path to the compiler or what is the path that has all of the spec files, find all of them, instantiate those test cases, run them all and see if they all pass.

This is the last one that I wrote today just this morning:


(def fizzbuzz x
    (condition (<= x 0) (return "")
        (condition (== 0 (% x 15)) (+ (fizzbuzz (- x 1)) "fizzbuzz")
            (condition (== 0 (% x 3)) (+ (fizzbuzz (- x 1)) "fizz")
                (condition (== 0 (% x 5)) (+ (fizzbuzz (- x 1)) "buzz")
                (+ (fizzbuzz (- x 1)) (String x)))))))
(def fetch_arg position default
    (condition (!= nil ([] ARGV position))
    ([] ARGV position)
    (return default)))
(print (fizzbuzz (fetch_arg 0 100)))
/////
it computes fizzbuzz
012fizz4buzzfizz78fizzbuzz11fizz1314fizzbuzz1617fizz19buzzfizz2223fizzbuzz26fizz2 829fizzbuzz3132fizz34buzzfizz3738fizzbuzz41fizz4344fizzbuzz4647fizz49buzzfizz525 3fizzbuzz56fizz5859fizzbuzz6162fizz64buzzfizz6768fizzbuzz71fizz7374fizzbuzz7677f izz79buzzfizz8283fizzbuzz86fizz8889fizzbuzz9192fizz94buzzfizz9798fizzbuzz

I tried to write FizzBuzz, had to fix a few bugs in the compiler to make it work and had to spend a while wondering why things were not working because I had mismatched parentheses because everything only fits on one line but it works. I wrote FizzBuzz in my own language!

To Do (20:00)

I wrote bad Swift code and things are structs that should be enums.
There are no optimizations. Ruby code is not the fastest in the world, and I am sure that my thing that compiles to Ruby is going to be even slower.
Allow defining new variables in function scope. Stuff like how functions work are less than ideal.
There are no error messages. Things just fatal error if we do not produce a proper s-expression. That makes figuring out why things are failing to compile difficult.
There are no semantic analysis errors. You can define functions with the wrong number of arguments, and call functions with the wrong number of arguments and you can define a variable that is actually a function, and it compiles to Ruby now and it probably should not.
It would be really fun to explore compiling to machine code.

Also:

There is no proper standard library. I am piggybacking off of Ruby’s standard library right now.
Comments do not work (I know several friends who would yell at me for building a programming language that actually disallows commenting).
There is no notion of Lambdas, which is really weird if you think of Lisp without Lambdas.
Expressions can only have a single expression in them, so function bodies have to be purely functional right now. There is no way to let’s say print a value and then call another function.

Lessons Learned & Conclusion (22:13)

Number one was: do not write a compiler for a conference talk. It is a lot of work. Writing a compiler is really hard. There is a reason that you have compiler engineers as their job title.

Testing a compiler is really necessary because I did not write a single thing that worked right on the first try.

String parsing in Swift is really hard. Being Unicode aware is great, but it also makes parsing like source files difficult.

Error messages are hard. This seems obvious but it is really hard to report why does this thing fail at parse, what is wrong with it, how do I say, you are missing a closing parenthesis here?

The LLVM API which I wanted to use is meant for typed languages where you can infer that this variable is an int32, or this function has this function signature.

Implementing your own programming language is fun and rewarding. On my GitHub, I have a programming language. Real programming languages are much better than what I wrote in the past week, but they take time and effort and know-how. So next time I want to complain about how Swift C fails to do something, I think I might understand a bit better why. These things are hard.

Resources

Next Up: A Tale of Debugging on Realm's Android Team

About the content

This talk was delivered live in September 2016 at try! Swift NYC. The video was recorded, produced, and transcribed by Realm, and is published here with the permission of the conference organizers.

Samuel Giddins

Sam’s hobby of contributing to open source projects while learning to code professionally soon became his true profession. He’s a core committer to CocoaPods, a maintainer of RestKit, and even a Stripe Open-Source retreat grantee. Currently, Sam works at Realm as an iOS developer. ✋

Twitter

4 design patterns for a RESTless mobile integration »