TL;DR Writing micro compiler in OCaml
At one point or another every single software developer in the world comes to a realization in his career when the time is ripe and it’s time to write your own super cool programming language.
However the subject of creating your own programming language with an compiler is quite a complex one and can’t be tackled without some pre-research. That’s how I’ve started reading Crafting Compiler in C, an aged but really comprehensive book about developing your own compiler for an Ada-like programming language. Second chapter describes writing a really simple micro language targeting pseudo assembly-like output in order to explain the core concepts of developing your own compiler and writing an LL(1) parser.
Let’s try rewriting this micro compiler in OCaml, a language better suited for writing compilers that is becoming quite popular due to it’s clean syntax and strict evaluation semantics combined with functional and object-oriented programming styles. If you are not familiar with OCaml try reading Real World OCaml first. Instead of outputting pseudo assembly our micro compiler will output a real nasm source code which will be automatically compiled into a binary executable file.
So let’s start by describing simple micro language with an example source code
1 2 3 4 5 6 7 8 begin a := 1; b := a + 1; b := b + 1; write (a,b); read(a,b); write (a+10, b+10); end
As you can see from example source code the program starts with begin keyword and ends with an end keyword. It has only integer variables which must be predefined by assignment operation before using in expressions, and it also has two simple functions read and write.
read takes a list of variable names separated by comma and reads user input from stdin into those variables
write takes a list of expressions and outputs them into stdout
Now in order to create an executable from this source code first we need to parse it. Since LL(1) type parser is enough to parse this kind of language, we’ll need only one character lookahead. Unfortunately OCaml doesn’t have an unread operation like libc’s ungetc so we’ll need to define a simple stream reader which will have a mutable char and we will also count lines of source code read. We’ll also define two utility functions which we will use later which will check if a character is alphanumeric and if it’s a digit.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 (* stream *) type stream = { mutable chr : char option ; mutable line_num : int ; chan : in_channel } let open_stream file = { chr = None ; line_num = 1 ; chan = open_in file } let close_stream stm = close_in stm . chan let read_char stm = match stm . chr with None -> let c = input_char stm . chan in if c = '
' then let _ = stm . line_num <- stm . line_num + 1 in c else c | Some c -> stm . chr <- None ; c let unread_char stm c = stm . chr <- Some c (* character *) let is_digit c = let code = Char . code c in code >= Char . code ( '0' ) && code <= Char . code ( '9' ) let is_alpha c = let code = Char . code c in ( code >= Char . code ( 'A' ) && code <= Char . code ( 'Z' )) || ( code >= Char . code ( 'a' ) && code <= Char . code ( 'z' ))
Now for our parser we will be parsing source code one token at a time and we’ll be recursively calling parsing methods and matching tokens as we go. We’ll define some additional utility functions which will be used during parsing including an Syntax_error exception which will be thrown if invalid token is scanned.
scan will scan the stream for next token, that’s where our token recognition logic is in.
skip_blank_chars function will skip any number of blank characters including new line characters
next_token will return last scanned token or will scan stream for next token
match_next will match last scanned token or will scan stream for next token, match it and return it
match_token will match the last scanned token against the specified token in parameter or will scan stream for next token and match against it
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 (* token *) type token = Begin | End | Identifier of string | Read | Write | Literal of int | Assign | LeftParen | RightParen | AddOp | SubOp | Comma | Semicolon type scanner = { mutable last_token : token option ; stm : stream } exception Syntax_error of string let syntax_error s msg = raise ( Syntax_error ( msg ^ " on line " ^ ( string_of_int s . stm . line_num ))) (* skip all blank and new line characters *) let rec skip_blank_chars stm = let c = read_char stm in if c = ' ' || c = '\t' || c = '\r' || c = '
' then skip_blank_chars stm else unread_char stm c ; () (* scan a stream and return next token *) let scan s = let stm = s . stm in let c = read_char stm in let rec scan_iden acc = let nc = read_char stm in if is_alpha nc || is_digit nc || nc = '_' then scan_iden ( acc ^ ( Char . escaped nc )) else let _ = unread_char stm nc in let lc = String . lowercase acc in if lc = "begin" then Begin else if lc = "end" then End else if lc = "read" then Read else if lc = "write" then Write else Identifier acc in let rec scan_lit acc = let nc = read_char stm in if is_digit nc then scan_lit ( acc ^ ( Char . escaped nc )) else let _ = unread_char stm nc in Literal ( int_of_string acc ) in if is_alpha c then scan_iden ( Char . escaped c ) else if is_digit c then scan_lit ( Char . escaped c ) else if c = '+' then AddOp else if c = '-' then SubOp else if c = ',' then Comma else if c = ';' then Semicolon else if c = '(' then LeftParen else if c = ')' then RightParen else if c = ':' && read_char stm = '=' then Assign else syntax_error s "couldn't identify the token" let new_scanner stm = { last_token = None ; stm = stm } let match_next s = match s . last_token with None -> let _ = skip_blank_chars s . stm in scan s | Some tn -> s . last_token <- None ; tn let match_token s t = match_next s = t let next_token s = match s . last_token with None -> ( skip_blank_chars s . stm ; let t = scan s in s . last_token <- Some t ; t ) | Some t -> t
In order to generate an asm output we’ll define an generator type which will contain the output channel and variable locations Hashtbl. Each variable’s location will be defined as an integer offset from esp and since our micro language is simple we won’t have to handle advanced aspects of variable scope and stack handling.
1 2 3 4 5 6 7 8 9 (* code generation *) type generator = { vars : ( string , int ) Hashtbl . t ; file : string ; chan : out_channel } let new_generator file = let fs = ( Filename . chop_extension file ) ^ ".s" in { vars = Hashtbl . create 100 ; file = fs ; chan = open_out fs } let close_generator g = close_out g . chan let gen g v = output_string g . chan v ; output_string g . chan "
"
We’ll also need to distinguish between ordinary variables and an temporary ones. Our temporary variables will be defined automatically and will start from __temp following variable location offset. This way we won’t be storing temporary variables in Hashtbl and will be computing their offset from their name. We’ll also define some additional generator helper functions to output asm code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 let bottom_var _ g = Hashtbl . fold ( fun _ v c -> if v >= c then ( v + 4 ) else c ) g . vars 0 let empty_var s g i = ( bottom_var s g )+( 4 *( i - 1 )) let var_addr s g v = if String . length v > 6 && String . sub v 0 6 = "__temp" then let i = String . sub v 6 (( String . length v ) - 6 ) in "[esp+" ^ i ^ "]" else try "[esp+" ^ string_of_int ( Hashtbl . find g . vars v ) ^ "]" with Not_found -> syntax_error s ( "identifier " ^ v ^ " not defined" ) let var s g v = "dword " ^ ( var_addr s g v ) let temp_var s g i = Identifier ( "__temp" ^ ( string_of_int ( empty_var s g i ))) let is_alloc_var _ g v = Hashtbl . mem g . vars v let alloc_var s g v = if is_alloc_var s g v then var s g v else let _ = Hashtbl . replace g . vars v ( empty_var s g 1 ) in var s g v let token_var s g v = match v with Identifier i -> var s g i | _ -> syntax_error s "identifier expected" let op g opcode a = gen g ( " " ^ opcode ^ " " ^ a ) let op2 g opcode a b = gen g ( " " ^ opcode ^ " " ^ a ^ ", " ^ b ) let push g a = op g "push" a
Now in order to compile a source code we need to create a new generator, open an stream on a source file, parse it, compile the asm source code using nasm and link the generated .o file into an elf executable.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 (* compiling *) let compile file = try let g = new_generator file in let stm = open_stream file in let out = Filename . chop_extension file in parse stm g ; close_stream stm ; close_generator g ; let _ = Sys . command ( "nasm -f elf " ^ g . file ) in let _ = Sys . command ( "gcc -o " ^ out ^ " " ^ out ^ ".o" ) in () with Syntax_error e -> Format . printf "syntax error: %s
" e ; Format . print_flush () | Sys_error _ -> print_string "no such file found
" let help () = print_string "micro
" let () = if Array . length Sys . argv = 1 then help () else let file = Array . get Sys . argv 1 in Format . printf "compiling %s
" file ; Format . print_flush () ; compile file
Our parsing method will be combined with semantics checking and will output asm code using generator functions which we will define later. Program begins from begin keyword, ends with end keyword and has 0 or more statements in between.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 let parse stm g = let s = ( new_scanner stm ) in try program s g with End_of_file -> syntax_error s "program reached end of file before end keyword" let program s g = if match_token s Begin then let _ = generate_begin s g in let _ = statements s g in if match_token s End then let _ = generate_end s g in () else syntax_error s "program should end with end keyword" else syntax_error s "program should start with begin keyword" let rec statements s g = if statement s g then statements s g else ()
Each statement is either an read, write or an assignment to a variable.
1 2 3 4 5 6 7 8 9 10 11 12 let rec statements s g = if statement s g then statements s g else () let statement s g = let t = next_token s in if match t with Read -> read s g | Write -> write s g | Identifier i -> assignment s g | _ -> false then if match_token s Semicolon then true else syntax_error s "statement must end with semicolon" else false
Each assignment statement has an identifier token on it’s left side followed by an assignment token and expression on the right hand side.
1 2 3 4 5 6 7 8 9 10 11 let assignment s g = let id = match_next s in match id with Identifier i -> ( if match_token s Assign then let new_var = if is_alloc_var s g i then 0 else 1 in let id2 = expression s g ( 1 + new_var ) in match id2 with Literal l2 -> let _ = generate_assign s g id id2 in true | Identifier i2 -> let _ = generate_assign s g id id2 in true | _ -> syntax_error s "literal or identifier expected" else syntax_error s "assign symbol expected" ) | _ -> syntax_error s "identifier expected"
Each expression statement is primary optionally followed by an operation token and another primary. Primary might also be an expression in curly brackets.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 let rec expression s g d = let primary s = match ( next_token s ) with LeftParen -> ( let _ = match_token s LeftParen in let e = expression s g ( d + 1 ) in if match_token s RightParen then Some e else syntax_error s "right paren expected in expression" ) | Identifier i -> let _ = match_token s ( Identifier i ) in Some ( Identifier i ) | Literal l -> let _ = match_token s ( Literal l ) in Some ( Literal l ) | _ -> None in let lp = primary s in match lp with Some l -> ( match ( next_token s ) with AddOp -> let _ = match_token s AddOp in addop s g d l ( expression s g ( d + 1 )) | SubOp -> let _ = match_token s SubOp in subop s g d l ( expression s g ( d + 1 )) | _ -> l ) | None -> syntax_error s "literal or identifier expected"
Our micro language supports only two operations on integers, addition and subtraction, but it can be easily extended to support more.
1 2 3 4 5 6 7 8 9 10 let addop s g d l r = match ( l , r ) with ( Literal l1 , Literal l2 ) -> Literal ( l1 + l2 ) | ( Identifier i1 , Literal l2 ) -> generate_add s g d l r | ( Literal l1 , Identifier i2 ) -> generate_add s g d r l | _ -> syntax_error s "expected literal or identifier for add operation" let subop s g d l r = match ( l , r ) with ( Literal l1 , Literal l2 ) -> Literal ( l1 - l2 ) | ( Identifier i1 , Literal l2 ) -> generate_sub s g d l r | ( Literal l1 , Identifier i2 ) -> generate_sub s g d l r | _ -> syntax_error s "expected literal or identifier for sub operation"
write statement is just comma separated list of expressions.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 let write s g = let rec expressions c = let e = ( expression s g 1 ) in if match e with Identifier _ -> let _ = generate_write s g e in true | Literal _ -> let _ = generate_write s g e in true | _ -> false then if ( next_token s ) = Comma then let _ = match_token s Comma in expressions ( c + 1 ) else ( c + 1 ) else c in if match_token s Write then if match_token s LeftParen then if expressions 0 > 0 then if match_token s RightParen then true else syntax_error s "right paren expected in write statement" else syntax_error s "write statement expected atleast one expression" else syntax_error s "left paren expected in write statement" else syntax_error s "write statement expected"
read statement is a comma separated list of identifiers
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 let read s g = if match_token s Read then if match_token s LeftParen then let ids = identifiers s in if ids = [] then syntax_error s "read statement expects comma seperated identifier(s)" else if match_token s RightParen then let _ = generate_reads s g ( List . rev ids ) in true else syntax_error s "right paren expected in read statement" else syntax_error s "left paren expected in read statement" else syntax_error s "read statement expected" let identifiers s = let rec idens ids = match ( next_token s ) with Identifier i -> let _ = match_next s in let n = next_token s in if n = Comma then let _ = match_token s Comma in idens ( Identifier i :: ids ) else idens ( Identifier i :: ids ) | _ -> ids in idens []
Now it’s finally time to generate some asm code, let’s start from begin and end of our program. Our program will preallocate 1000 bytes on stack for variables, since all of our variables are static. We’ll also need to define external libc functions scanf and printf.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 let generate_begin _ g = gen g "extern printf extern scanf section .data inf: db '%d', 0 ouf: db '%d', 10, 0 section .text global main main: sub esp, 1000" let generate_end _ g = gen g " add esp, 1000 exit: mov eax, 1 ; sys_exit mov ebx, 0 int 80h"
read and write statements will use libc scanf and printf functions to read integer variables from stdin and output them on stdout.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 let generate_read s g id = match id with Identifier i -> ( op2 g "lea" "eax" ( var_addr s g i ); push g "eax" ; push g "inf" ; op g "call" "scanf" ; op2 g "add " "esp" "8" ) | _ -> syntax_error s "generate read called with invalid argument" let rec generate_reads s g = List . iter ( generate_read s g ) let generate_write s g id = match id with Identifier i -> ( push g ( var s g i ); push g "ouf" ; op g "call" "printf" ; op2 g "add " "esp" "8" ) | _ -> syntax_error s "generate write called with invalid argument"
Assignment statement is just an allocation of a variable followed by a copying variable from one location into another. Our copy function understands literal variables and translates their values directly into asm code.
1 2 3 4 5 6 7 8 9 10 11 let generate_assign s g a b = match a with Identifier i -> let _ = alloc_var s g i in generate_copy s g a b | _ -> syntax_error s "generate assign called with invalid argument" let generate_copy s g a b = match a with Identifier i -> ( match b with Identifier i2 -> ( op2 g "mov " "eax" ( var s g i2 ); op2 g "mov " ( var s g i ) "eax" ) | Literal l -> op2 g "mov " ( var s g i ) ( string_of_int l ) | _ -> syntax_error s "generate copy called with invalid argument" ) | _ -> syntax_error s "generate copy called with invalid argument"
Addition and subtraction operations use temporary variables to add and subtract values from two variables and return result
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 let generate_add s g d id1 id2 = match ( id1 , id2 ) with ( Identifier i1 , Identifier i2 ) -> ( let v = temp_var s g d in let vi = token_var s g v in let _ = generate_copy s g v id1 in let _ = op2 g "add " vi ( var s g i2 ) in v ) | ( Identifier i1 , Literal l2 ) -> ( let v = temp_var s g d in let vi = token_var s g v in let _ = generate_copy s g v id1 in let _ = op2 g "add " vi ( string_of_int l2 ) in v ) | _ -> syntax_error s "generate exp called with invalid argument" let generate_sub s g d id1 id2 = match ( id1 , id2 ) with ( Identifier i1 , Identifier i2 ) -> ( let v = temp_var s g d in let vi = token_var s g v in let _ = generate_copy s g v id1 in let _ = op2 g "sub " vi ( var s g i2 ) in v ) | ( Identifier i1 , Literal l2 ) -> ( let v = temp_var s g d in let vi = token_var s g v in let _ = generate_copy s g v id1 in let _ = op2 g "sub " vi ( string_of_int l2 ) in v ) | ( Literal l1 , Identifier i2 ) -> ( let v = temp_var s g d in let vi = token_var s g v in let _ = generate_copy s g v id1 in let _ = op2 g "sub " vi ( var s g i2 ) in v ) | _ -> syntax_error s "generate exp called with invalid argument"
And that’s all of it! We now have a trivial micro compiler that generates binary executable file. Source code can be found in my github micro repo. Writing an compiler is quite complex and entertaining task but it’s definitely worth the time spend on!