Writing a C compiler in 500 lines of Python (2023)

Writing a C compiler in 500 lines of Python Posted August 30, 2023

A few months ago, I set myself the challenge of writing a C compiler in 500 lines of Python , after writing my SDF donut post. How hard could it be? The answer was, pretty hard, even when dropping quite a few features. But it was also pretty interesting, and the result is surprisingly functional and not too hard to understand!

There's too much code for me to comprehensively cover in a single blog post , so I'll just give an overview of the decisions I made, things I had to cut, and the general architecture of the compiler, touching on a representative piece of each part. Hopefully after reading this post, the code is more approachable!

Decisions, decisions

The first, and most critical decision, was that this would be a single-pass compiler. 500 lines is too spare to be defining and transforming an abstract syntax tree! What does that mean?

Most compilers: faffing around with syntax trees

Well, most compiler's internals look something like this:

The tokens get lexed, then a parser runs over them and builds pretty little syntax trees:

# hypothetical code, not from anywhere def parse_statement ( lexer ) -> PrettyLittleSyntaxTree: ... if type := lexer.try_next(TYPE_NAME): variable_name = lexer.next(IDENTIFIER) if lexer.try_next( "=" ): initializer = parse_initializer(lexer) else : initializer = None lexer.next(SEMICOLON) return VariableDeclarationNode( type = type , name = variable_name, initializer = initializer, ) ... # much later... def emit_code_for ( node : PrettyLittleSyntaxTree) -> DisgustingMachineCode: ... if isinstance (node, VariableDeclarationNode): slot = reserve_stack_space(node.type.sizeof()) add_to_environment(node.name, slot) if node.initializer is not None : register = emit_code_for(node.initializer) emit( f "mov {register} , [ {slot} ]" ) ...

The important thing here is that there's two passes, first the parsing builds up a syntax tree, then a second pass chews that tree up and turns it into machine code. That's really useful for most compilers! It keeps the parsing and codegen separate, so each can evolve independently. It also means that you can transform the syntax tree before using it to generate code—for example, by applying optimizations to it. In fact, most compilers have multiple levels of "intermediate representations" between the syntax tree and codegen!

... continue reading