In this blog post I’ll explain strategies I used to make the purple garden lexer really fast.
purple-garden is an s-expr based language I am currently developing for myself. Its my attempt at building a language I like, with a battery included approach, while designing it with performance in mind.
This doesn’t mean all approaches are feasible for your use case, architecture and design. I tried to bring receipts for my performance claims, so watch out for these blocks at the end of chapters:
Info - Benchmark
Introduction to Lexing
A lexer (often also referred to as a tokeniser) is the easiest part of any compilation and language pipeline. The idea is to convert a list of characters into a list of tokens in which each token conveys some meaning. This list of tokens can then be used by the parser to generate an abstract syntax tree (AST), which the compiler consumes, converting it to bytecode, which the vm executes.
A compilation pipeline
As an overview:
┌───────────────────┐ │ │ │ Lexer │ <- we are here │ │ └─────────┬─────────┘ │ │ Tokens <- and here │ ┌─────────▼─────────┐ │ │ │ Parser │ │ │ └─────────┬─────────┘ │ │ AST │ ┌─────────▼─────────┐ │ │ │ Compiler │ │ │ └─────────┬─────────┘ │ │ Bytecode │ ┌─────────▼─────────┐ │ │ │ Virtual Machine │ │ │ └───────────────────┘
For a list of characters, lets say (@std.fmt.println "my pi is: " 3.1415) :
Input to the lexer: TEXT 1 (@std.fmt.println "my pi is: " 3.1415) As characters: TEXT 1 ( 2 @ 3 s 4 t 5 d 6 . 7 f 8 m 9 t 10 // .... 11 ) As pseudo tokens: TEXT 1 ( 2 @std 3 . 4 fmt 5 . 6 println 7 "my pi is: " 8 3.1415 9 )
The above is just an example and I’ll go into detail below:
Defining purple garden’s Tokens
A token is not only a set characters it can be mapped to, but it also holds:
A token type, to easily distinguish between tokens
Positional information: start point end or length line number
Purple garden keeps it as minimal as possible and just has a String and a token type:
C
1 typedef enum { 2 // ( 3 T_DELIMITOR_LEFT = 1 , 4 // + 5 T_PLUS = 2 , 6 // - 7 T_MINUS = 3 , 8 // * 9 T_ASTERISKS = 4 , 10 // / 11 T_SLASH = 5 , 12 // = 13 T_EQUAL = 6 , 14 // ) 15 T_DELIMITOR_RIGHT, 16 // [ 17 T_BRAKET_LEFT, 18 // ] 19 T_BRAKET_RIGHT, 20 // anything between "" 21 T_STRING, 22 // true 23 T_TRUE, 24 // false 25 T_FALSE, 26 // floating point numbers 27 T_DOUBLE, 28 // whole numbers 29 T_INTEGER, 30 // builtins in the format @ 31 T_BUILTIN, 32 // any identifier 33 T_IDENT, 34 // end marker 35 T_EOF, 36 } TokenType; 37 38 typedef struct { 39 TokenType type; 40 // stores all values for T_STRING,T_IDENT,T_INTEGER and T_DOUBLE 41 Str string; 42 } Token;
Architecting a minimal Lexer
As explained in A compilation pipeline, the lexer iterates over the characters in the input and attempts to match found characters to sets of characters, like keywords, numbers and symbols. For this we will need to keep some state, even if though its not much:
C
1 typedef struct { 2 Str input; // <- String view over the input 3 size_t pos; // <- current position in the input 4 } Lexer;
Str as an abstraction is introduced in Operating on zero copy, zero alloc string windows.
A naive approach would be:
C
1 #define SINGLE_TOK(t) ((Token){.type = t}) 2 3 Lexer Lexer_new (Str str) { 4 return { 5 .input = str, 6 .pos = 0 7 }; 8 } 9 10 Token Lexer_next (Lexer * l) { 11 if (l -> input.len == 0 || l -> pos >= l -> input.len) { 12 return SINGLE_TOK (T_EOF) 13 } 14 15 Token t; 16 switch ( Str_get (l -> input, l -> pos)) { 17 case '+' : 18 t = SINGLE_TOK (T_PLUS); 19 break ; 20 case '-' : 21 t = SINGLE_TOK (T_MINUS); 22 break ; 23 case '*' : 24 t = SINGLE_TOK (T_ASTERISKS); 25 break ; 26 case '/' : 27 t = SINGLE_TOK (T_SLASH); 28 break ; 29 // ... thinks like strings, quoted symbols, numbers, comments, whitespace :^) 30 default : 31 t = SINGLE_TOK (T_EOF); 32 break ; 33 } 34 l -> pos ++ ; 35 return t; 36 }
And one could consume the api as follows (even with a little type->string lookup for debugging):
C
1 const char * TOKEN_TYPE_MAP[] = { 2 [T_PLUS] = "T_PLUS" , 3 [T_MINUS] = "T_MINUS" , 4 [T_ASTERISKS] = "T_ASTERISKS" , 5 [T_SLASH] = "T_SLASH" , 6 [T_EOF] = "T_EOF" 7 }; 8 9 Str s = STRING ( "+-*/" ); 10 Lexer l = Lexer_new (s); 11 while (l.pos < l.input.len) { 12 Token t = Lexer_next ( & l); 13 puts (TOKEN_TYPE_MAP[t.type]); 14 }
Tip - Designated initializers Designated initializers ( [] = ) are allowed by ISO C99, while designated initializers with ranges ( [ .. ] = ) are a gnu extension, see Designated initializers () are allowed by ISO C99, while designated initializers with ranges () are a gnu extension, see 6.2.11 Designated Initializers
Making it fast
Lets get into some optimisations and strategies for creating a clean, minimal and especially fast lexer.
Computed gotos, or: Threaded Lexing
Threaded Lexing is a reference to Threaded code, see wikipedia. The idea is to replace the switch statement with a jump to a block inside of the lexer. The easiest way I could think of implementing this was to:
Define a jump table mapping lexeme start characters to their respective blocks C 1 static void * jump_table[ 256 ] = { 2 // first bind all possible bytes to the unkown label, so there are no out 3 // of bound reads 4 [ 0 ... 255 ] = && unknown, 5 6 // replace all known bytes with correct jump labels 7 [ '(' ] = && delimitor_left, 8 [ ')' ] = && delimitor_right, 9 [ '[' ] = && braket_left, 10 [ ']' ] = && braket_right, 11 [ '@' ] = && builtin, 12 [ '+' ] = && plus, 13 [ '-' ] = && minus, 14 [ '/' ] = && slash, 15 [ '*' ] = && asterisks, 16 [ '=' ] = && equal, 17 [ ' ' ] = && whitespace, 18 [ '\t' ] = && whitespace, 19 [ '
' ] = && whitespace, 20 [ ';' ] = && comment, 21 [ '.' ] = && number, 22 [ '0' ... '9' ] = && number, 23 [ 'a' ... 'z' ] = && ident, 24 [ 'A' ... 'Z' ] = && ident, 25 [ '\'' ] = && quoted, 26 [ '_' ] = && ident, 27 [ '"' ] = && string, 28 29 // handle the edgecase of being at the end of the input 30 [ 0 ] = && end, 31 }; Create a macro for jumping to the label C 1 #define JUMP_TARGET goto *jump_table[(int8_t)l->input.p[l->pos]] At the start of the lexer, call the macro C 1 JUMP_TARGET
Putting it all together lets us build the following threaded lexer:
C
1 size_t Lexer_all (Lexer * l, Allocator * a, Token ** out) { 2 ASSERT (out != NULL, "Failed to allocate token list" ); 3 4 // empty input 5 if (l -> input.len == 0 ) { 6 return 1 ; 7 } 8 9 static void * jump_table[ 256 ] = { 10 [ 0 ... 255 ] = && unknown, 11 [ '(' ] = && delimitor_left, 12 [ ')' ] = && delimitor_right, 13 [ '[' ] = && braket_left, 14 [ ']' ] = && braket_right, 15 [ '@' ] = && builtin, 16 [ '+' ] = && plus, 17 [ '-' ] = && minus, 18 [ '/' ] = && slash, 19 [ '*' ] = && asterisks, 20 [ '=' ] = && equal, 21 [ ' ' ] = && whitespace, 22 [ '\t' ] = && whitespace, 23 [ '
' ] = && whitespace, 24 [ ';' ] = && comment, 25 [ '.' ] = && number, 26 [ '0' ... '9' ] = && number, 27 [ 'a' ... 'z' ] = && ident, 28 [ 'A' ... 'Z' ] = && ident, 29 [ '\'' ] = && quoted, 30 [ '_' ] = && ident, 31 [ '"' ] = && string, 32 [ 0 ] = && end, 33 }; 34 35 #define JUMP_TARGET goto *jump_table[(int8_t)l->input.p[l->pos]] 36 37 JUMP_TARGET; 38 39 delimitor_left : 40 JUMP_TARGET; 41 42 delimitor_right : 43 JUMP_TARGET; 44 45 braket_left : 46 JUMP_TARGET; 47 48 braket_right : 49 JUMP_TARGET; 50 51 builtin : 52 JUMP_TARGET; 53 54 plus : 55 JUMP_TARGET; 56 57 minus : 58 JUMP_TARGET; 59 60 slash : 61 JUMP_TARGET; 62 63 equal : 64 JUMP_TARGET; 65 66 asterisks : 67 JUMP_TARGET; 68 69 number : 70 JUMP_TARGET; 71 72 ident : 73 JUMP_TARGET; 74 75 quoted : 76 JUMP_TARGET; 77 78 string : 79 JUMP_TARGET; 80 81 comment : 82 JUMP_TARGET; 83 84 whitespace : 85 JUMP_TARGET; 86 87 unknown : 88 return 1 ; 89 90 end : 91 return 0 ; 92 }
Replacing the switch with an array index and a jump. The latter is significantly faster than the former due to:
Better code density
Less cache misses, better branch prediction
Warning There are two downsides to this approach: It is not supported by MSVC toolchain (clang, gcc only) It makes reading and debugging the lexer far more complicated
Abstracting allocations via the Allocator interface
I want to allow the user of purple-garden to choose between allocation strategies like my garbage collectors and a bump allocator. For this I need an interface to marry several backing implementations into a singular api:
C
1 #ifndef MEM_H 2 #define MEM_H 3 4 #include 5 6 #ifdef DEBUG 7 #if DEBUG 8 #define VERBOSE_ALLOCATOR 1 9 #endif 10 #else 11 #define VERBOSE_ALLOCATOR 0 12 #endif 13 14 // 1MB 15 #define GC_MIN_HEAP 1024 * 1024 16 17 typedef struct { 18 size_t current; 19 size_t allocated; 20 } Stats; 21 22 // CALL is used to emulate method calls by calling a METHOD on SELF with 23 // SELF->ctx and __VA_ARGS__, this is useful for interface interaction, such as 24 // Allocator, which reduces alloc_bump.request(alloc_bump.ctx, 64); to 25 // CALL(alloc_bump, request, 64), removing the need for passing the context in 26 // manually 27 #if VERBOSE_ALLOCATOR 28 #include 29 #define CALL(SELF, METHOD, ...) \ 30 (fprintf(stderr, "[ALLOCATOR] %s@%s::%d: %s->%s(%s)
", __FILE__, __func__, \ 31 __LINE__, #SELF, #METHOD, #__VA_ARGS__), \ 32 (SELF)->METHOD((SELF)->ctx, ##__VA_ARGS__)) 33 #else 34 #define CALL(SELF, METHOD, ...) (SELF)->METHOD((SELF)->ctx, ##__VA_ARGS__) 35 #endif 36 37 // Allocator defines an interface abstracting different allocators, so the 38 // runtime of the virtual machine does not need to know about implementation 39 // details, can be used like this: 40 // 41 // 42 // #define ALLOC_HEAP_SIZE = 1024 43 // Allocator alloc_bump = bump_init(ALLOC_HEAP_SIZE); 44 // 45 // size_t some_block_size = 16; 46 // void *some_block = alloc_bump.request(alloc_bump.ctx, some_block_size); 47 // 48 // alloc_bump.destroy(alloc_bump.ctx); 49 // 50 typedef struct { 51 // Allocator::ctx refers to an internal allocator state and owned memory 52 // areas, for instance, a bump allocator would attach its meta data (current 53 // position, cap, etc) here 54 void * ctx; 55 56 // Allocator::stats is expected to return the current statistics of the 57 // underlying allocator 58 Stats ( * stats)( void * ctx); 59 // Allocator::request returns a handle to a block of memory of size `size` 60 void * ( * request)( void * ctx, size_t size); 61 // Allocator::destroy cleans state up and deallocates any owned memory areas 62 void ( * destroy)( void * ctx); 63 } Allocator; 64 65 Allocator * bump_init ( size_t size); 66 Allocator * xcgc_init ( size_t size, void * vm); 67 68 #endif
For instance: this allocator is then passed to the virtual machine. The vm uses Allocator->request(Allocator->ctx, sizeof(Value)) to allocate a new runtime value.
I included a CALL macro for cleaning this duplicated context passing up:
C
1 Allocator * a = bump_init ( 1024 ); 2 3 // before: 4 void * block = a -> request (a -> ctx, 512 ); 5 a -> destroy (a -> ctx) 6 7 // after: 8 void * block = CALL (a, request, 512 ); 9 CALL (a, destroy)
This macro also enables verbose allocation insights for debugging purposes:
TEXT
1 [ALLOCATOR] vm.c@freelist_preallocate::59: fl->alloc->request(sizeof(Frame)) 2 [ALLOCATOR] main.c@main::245: vm.alloc->stats() 3 [ALLOCATOR] main.c@main::255: pipeline_allocator->destroy() 4 [ALLOCATOR] vm.c@Vm_destroy::283: vm->alloc->destroy()
The implementation of said interface is straightforward:
C
1 // Bump allocator header 2 typedef struct { 3 // points to the start of the allocated block from which Allocator::request 4 // will hand out aligned chunks 5 void * block; 6 // the size of said allocated block 7 size_t len; 8 // the current amount of bytes in use 9 size_t pos; 10 } BumpCtx; 11 12 void * bump_request ( void * ctx, size_t size) { 13 BumpCtx * b_ctx = (BumpCtx * )ctx; 14 size_t align = sizeof ( void * ); 15 b_ctx -> pos = (b_ctx -> pos + align - 1 ) & ~ (align - 1 ); 16 ASSERT (b_ctx -> pos + size <= b_ctx -> len, "OOM :( with %zu" , b_ctx -> len); 17 void * block_entry = ( char * )b_ctx -> block + b_ctx -> pos; 18 b_ctx -> pos += size; 19 return block_entry; 20 } 21 22 void bump_destroy ( void * ctx) { 23 ASSERT (ctx != NULL, "bump_destroy on already destroyed allocator" ); 24 BumpCtx * b_ctx = (BumpCtx * )ctx; 25 // madvise(2): 26 // The application no longer requires the pages in the range 27 // specified by addr and len. The kernel can thus free these 28 // pages, but the freeing could be delayed until memory pres‐ 29 // sure occurs. 30 madvise (b_ctx -> block, b_ctx -> len, MADV_FREE); 31 int res = munmap (b_ctx -> block, b_ctx -> len); 32 ASSERT (res == 0 , "munmap failed" ); 33 free (ctx); 34 } 35 36 Stats bump_stats ( void * ctx) { 37 BumpCtx * b_ctx = (BumpCtx * )ctx; 38 return (Stats){.allocated = b_ctx -> len, .current = b_ctx -> pos}; 39 } 40 41 Allocator * bump_init ( size_t size) { 42 long page_size = sysconf (_SC_PAGESIZE); 43 size = (size + page_size - 1 ) & ~ (page_size - 1 ); 44 45 void * b = mmap (NULL, size, PROT_READ | PROT_WRITE, 46 MAP_PRIVATE | MAP_ANONYMOUS, - 1 , 0 ); 47 ASSERT (b != MAP_FAILED, "failed to mmap allocator buffer" ); 48 49 BumpCtx * ctx = malloc ( sizeof (BumpCtx)); 50 ASSERT (ctx != NULL, "failed to bump allocator context" ); 51 ctx -> len = size; 52 ctx -> pos = 0 ; 53 ctx -> block = b; 54 55 Allocator * a = malloc ( sizeof (Allocator)); 56 ASSERT (ctx != NULL, "failed to alloc bump allocator" ); 57 a -> ctx = ( void * )ctx; 58 a -> destroy = bump_destroy; 59 a -> request = bump_request; 60 a -> stats = bump_stats; 61 62 return a; 63 }
Info - Benchmarks This reduced the total runtime by like 24ms or made it 1.58x faster, see 4251b6b . The separate parsing stage was due to me experimenting with attaching the parser to the compiler, but this made the whole thing a bit slower than just finishing parsing before compiling TEXT 1 mem+main+cc: replace dynamic alloc with bump allocator (~1.6x faster) 2 - preallocated 4MB for bytecode and 256KB for the global pool to improve memory management. 3 - cc now takes an allocator to omit inline memory allocations for global 4 pool entries and bytecode creation 5 - with previous changes shaving off 24ms 6 7 Old perf (before b59737a) 8 9 [ 0.0070ms] main::Args_parse: Parsed arguments 10 [ 0.0080ms] io::IO_read_file_to_string: mmaped input 11 [ 41.2460ms] parser::Parser_run: Transformed source to AST 12 [ 20.7330ms] cc::cc: Flattened AST to byte code 13 [ 0.9640ms] mem::Allocator::destroy: Deallocated AST memory space 14 [ 2.1270ms] vm::Vm_run: Walked and executed byte code 15 [ 0.5560ms] vm::Vm_destroy: Deallocated global pool and bytecode list 16 17 New: 18 19 [ 0.0050ms] main::Args_parse: Parsed arguments 20 [ 0.0120ms] io::IO_read_file_to_string: mmaped input 21 [ 37.8600ms] cc::cc: Flattened AST to byte code 22 [ 2.3540ms] vm::Vm_run: Walked and executed byte code 23 [ 0.7380ms] mem::Allocator::destroy: Deallocated AST memory space 24 25 $ hyperfine "./purple_garden examples/bench.garden" "../purple_garden_old/purple_garden examples/bench.garden" 26 Benchmark 1: ./purple_garden examples/bench.garden 27 Time (mean ± σ): 42.2 ms ± 0.7 ms [User: 28.0 ms, System: 13.8 ms] 28 Range (min … max): 41.3 ms … 45.8 ms 70 runs 29 30 Benchmark 2: ../purple_garden_old/purple_garden examples/bench.garden 31 Time (mean ± σ): 66.6 ms ± 1.1 ms [User: 45.9 ms, System: 20.2 ms] 32 Range (min … max): 64.8 ms … 69.8 ms 43 runs 33 34 Summary 35 ./purple_garden examples/bench.garden ran 36 1.58 ± 0.04 times faster than ../purple_garden_old/purple_garden examples/bench.garden Before I also made the parser use the block allocator, see 0324051 : TEXT 1 parser: allocate with bump alloc (2x faster parse, 26x faster clean) 2 I replaced the inline allocation logic for the growing of Node.children with 3 the bump allocator from the allocator abstraction in mem::Allocator 4 (mem::bump_*). This results in 2x faster parsing and 27x faster AST cleanup: 5 6 - [ 89.4830ms/41.7840ms=02.1417x faster] parser::Parser_run: Transformed 7 source to AST 8 - [ 22.3900ms/00.8440ms=26.5284x faster] parser::Node_destroy: Deallocated 9 AST Nodes renamed to mem::Allocator::destroy 10 11 Old: 12 13 $ make bench PG=examples/bench.garden 14 [ 0.0050ms] main::Args_parse: Parsed arguments 15 [ 0.0110ms] io::IO_read_file_to_string: mmaped input 16 [ 89.4830ms] parser::Parser_run: Transformed source to AST 17 [ 18.8010ms] cc::cc: Flattened AST to byte code 18 [ 22.3900ms] parser::Node_destroy: Deallocated AST Nodes 19 [ 2.3200ms] vm::Vm_run: Walked and executed byte code 20 [ 0.4670ms] vm::Vm_destroy: Deallocated global pool and bytecode list 21 22 New: 23 24 $ make bench PG=examples/bench.garden 25 [ 0.0050ms] main::Args_parse: Parsed arguments 26 [ 0.0100ms] io::IO_read_file_to_string: mmaped input 27 [ 41.7840ms] parser::Parser_run: Transformed source to AST 28 [ 21.2160ms] cc::cc: Flattened AST to byte code 29 [ 0.8440ms] mem::Allocator::destroy: Deallocated AST memory space 30 [ 2.2510ms] vm::Vm_run: Walked and executed byte code 31 [ 0.7590ms] vm::Vm_destroy: Deallocated global pool and bytecode list
Operating on zero copy, zero alloc string windows
Our lexer doesn’t require mutable strings, C does not come with a string abstraction out of the box and I wanted to attach metadata to strings - so I came up with a small string abstraction:
C
1 // str is a simple stack allocated wrapper around C char arrays, providing 2 // constant time length access and zero allocation+copy interactions for all 3 // methods except Str_to 4 typedef struct { 5 // store the pointer to the underlying char 6 const uint8_t * p; 7 // hash of the input, do not expect it to be filled, has to be computed via 8 // Str_hash or inline in the lexer 9 uint64_t hash; 10 // length of the input without a zero terminator 11 size_t len; 12 } Str;
C
1 #define STRING(str) ((Str){.len = sizeof(str) - 1, .p = (const uint8_t *)str}) 2 #define STRING_EMPTY ((Str){.len = 0, .p = NULL})
Creating a Str from a c style const char* can be done by passing it into the STRING macro, gcc can evaluate all operations inside of it at compile time. Since the view doesnt own its underlying data, its cheap to copy and create slices by just pointing new views to the underlying buffer.
I use this struct throughout the whole codebase, but specifically inside of the lexer to create views over the original input that point to the contents of some tokens:
C
1 // example inside of handling strings in the lexer: 2 3 // dummy values for start and hash 4 size_t start = 3 ; 5 size_t hash = 0xAFFEDEAD ; 6 7 Token * t = & (Token){}; 8 t -> type = T_STRING; 9 t -> string = (Str){ 10 .p = l -> input.p + start, 11 .len = l -> pos - start, 12 .hash = hash, 13 };
Due to the window nature of this struct I had to reimplement some things myself, such as slicing, concatination, equality checking, hashing and converting to int64 and double :
C
1 char Str_get ( const Str * str, size_t index); 2 Str Str_from ( const char * s); 3 Str Str_slice ( const Str * str, size_t start, size_t end); 4 Str Str_concat ( const Str * a, const Str * b, Allocator * alloc); 5 bool Str_eq ( const Str * a, const Str * b); 6 void Str_debug ( const Str * str); 7 size_t Str_hash ( const Str * str); 8 int64_t Str_to_int64_t ( const Str * str); 9 double Str_to_double ( const Str * str);
Lets quickly go over the interesting ones, specifically slicing and converting to other data types. Slicing is super easy, since we just move the start and the end to the new slice start and end.
C
1 Str Str_slice ( const Str * str, size_t start, size_t end) { 2 ASSERT (end >= start, "Str_slice: Invalid slice range: end must be >= start" ); 3 ASSERT (end <= str -> len, "Str_slice: Slice range exceeds string length" ); 4 5 return (Str){ 6 .p = str -> p + start, 7 .len = end - start, 8 }; 9 }
Converting to int64 is also fairly uncomplicated, since the lexer is expected to make sure all characters are in the integer set. The algorithm consists of converting the character representation of an integer component into its literal value by subtracting '0' from it. The resulting value is added to the product of the previous iteration and 10, since we are working in the decimal system.
C
1 int64_t Str_to_int64_t ( const Str * str) { 2 int64_t r = 0 ; 3 ASSERT (str -> len > 0 , "Cant convert empty string into int64_t" ); 4 5 for ( size_t i = 0 ; i < str -> len; i ++ ) { 6 int digit = str -> p[i] - '0' ; 7 ASSERT (r < (INT64_MAX - digit) / 10 , 8 "int64_t number space overflow: `%.*s`" , ( int )str -> len, str -> p) 9 r = r * 10 + digit; 10 } 11 12 return r; 13 }
Doubles are represented differently, specifcally by their mantiassa and exponent, requiring a slightly more sophisticated conversion algorithm. In the same vain as Str_to_int64_t , validating is already done by the lexer to the extend of only allowing any of .1234567890 .
C
1 double Str_to_double ( const Str * str) { 2 ASSERT (str -> len > 0 , "Can't convert empty string into double" ); 3 4 const char * p = ( const char * )str -> p; 5 size_t len = str -> len; 6 7 uint64_t mantissa = 0 ; 8 int exponent = 0 ; 9 bool seen_dot = false; 10 bool has_digits = false; 11 12 // we dont check that all chars are numbers here, since the lexer already does 13 // that 14 for ( size_t i = 0 ; i < len; i ++ ) { 15 char c = p[i]; 16 17 if (c == '.' ) { 18 seen_dot = true; 19 continue ; 20 } 21 22 has_digits = true; 23 short digit = c - '0' ; 24 ASSERT (mantissa <= (UINT64_MAX - digit) / 10 , "Mantissa overflow" ); 25 mantissa = mantissa * 10 + digit; 26 if (seen_dot) { 27 exponent -= 1 ; 28 } 29 } 30 31 // if there were no digits after the '.' 32 ASSERT (has_digits, "Can't parse `%.*s` into a double" , ( int )len, p); 33 34 double result = ( double )mantissa; 35 // skip exponent computation for .0, since these are just the 36 // mantissa 37 if (exponent != 0 ) { 38 result *= pow ( 10.0 , exponent); 39 } 40 41 return result; 42 }
Info - Benchmarks 1.4x Speedup by using the above abstraction in a non allocating way, see b19088a : TEXT 1 common: make String methods 0 alloc (~1.4x faster) 2 - String_slice no longer requires a malloc, now just returns a window into its 3 argument 4 - String_to no longer returns just the underlying pointer (String.p) but 5 allocates a new char* 6 - new String_debug method to print the window 7 - lexer::num no longer allocates and frees for its string window, instead uses 8 a stack buffer 9 - parser::Node_destroy no longer calls Token_destroy 10 - Token_destroy no longer needed, since the lexer no longer allocates 11 12 Main improvements in runtime while parsing (less allocations and frees for 13 ident, string and double handling) and in cleanup (no more deallocations for 14 tokens) 15 16 With 250k loc of "hello world": 17 18 - Parsing went from 20.9ms to 13.8ms => 1.51x faster 19 - Cleanup went from 3.6ms to 0.83ms => 4.34x faster 20 21 Old: 22 23 [BENCH] (T-0.0060ms): parsed arguments 24 [BENCH] (T-0.0420ms): read file to String 25 [BENCH] (T-20.8780ms): parsed input 26 [BENCH] (T-6.8080ms): compiled input 27 [BENCH] (bc=500002|globals=250001) 28 [BENCH] (T-0.3440ms): ran vm 29 [BENCH] (T-3.5960ms): destroyed Nodes, vm and input 30 31 New: 32 33 [BENCH] (T-0.0060ms): parsed arguments 34 [BENCH] (T-0.0410ms): read file to String 35 [BENCH] (T-13.8280ms): parsed input 36 [BENCH] (T-7.9410ms): compiled input 37 [BENCH] (bc=500002|globals=250001) 38 [BENCH] (T-0.3490ms): ran vm 39 [BENCH] (T-0.8280ms): destroyed Nodes, vm and input
Hashing everything
I want to distinguish atoms (strings, numbers, idents) from other atoms for interning purposes and faster comparisons in the pipeline. This can be done via hashes, especially since we already “visit” each member of said atoms while converting them to tokens. Hashing them is therefore just a matter of computations while walking the atoms underlying bytes.
For instance strings: before this, the lexer just advanced until it hit the closing delimitor or EOF:
C
1 size_t Lexer_all (Lexer * l, Allocator * a, Token ** out) { 2 3 // ... 4 5 string : { 6 // skip " 7 l -> pos ++ ; 8 size_t start = l -> pos; 9 for ( char cc = cur (l); cc > 0 && cc != '"' ; l -> pos ++ , cc = cur (l)) {} 10 11 if ( UNLIKELY ( cur (l) != '"' )) { 12 Str slice = Str_slice ( & l -> input, l -> pos, l -> input.len); 13 fprintf (stderr, "lex: Unterminated string near: '%.*s'" , ( int )slice.len, 14 slice.p); 15 out[count ++ ] = INTERN_EOF; 16 } else { 17 Token * t = CALL (a, request, sizeof (Token)); 18 t -> type = T_STRING; 19 t -> string = (Str){ 20 .p = l -> input.p + start, 21 .len = l -> pos - start, 22 }; 23 out[count ++ ] = t; 24 // skip " 25 l -> pos ++ ; 26 } 27 JUMP_TARGET; 28 } 29 30 // ... 31 32 }
Adding hashing to this is fairly easy:
DIFF
1 diff --git a/lexer.c b/lexer.c 2 index 316a494..2280a6b 100644 3 --- a/lexer.c 4 +++ b/lexer.c 5 @@ -286,10 +286,7 @@ string: { 6 // skip " 7 l->pos++; 8 size_t start = l->pos; 9 + size_t hash = FNV_OFFSET_BASIS; 10 for (char cc = cur(l); cc > 0 && cc != '"'; l->pos++, cc = cur(l)) { 11 + hash ^= cc; 12 + hash *= FNV_PRIME; 13 } 14 15 if (UNLIKELY(cur(l) != '"')) { 16 @@ -303,7 +300,6 @@ string: { 17 t->string = (Str){ 18 .p = l->input.p + start, 19 .len = l->pos - start, 20 + .hash = hash, 21 }; 22 out[count++] = t; 23 // skip "
Numbers, identifiers and builtins are all also being hashed, I omitted this here for clearity and since we will revisit this topic again in this article.
Interning Tokens
Most of the tokens we will encounter are constant, we know:
their size
their type
We can use this knowledge, statically allocate these and use their pointers to reduce memory pressure:
C
1 // we can "intern" these, since all of them are the same, regardless of position 2 Token * INTERN_DELIMITOR_LEFT = & SINGLE_TOK (T_DELIMITOR_LEFT); 3 Token * INTERN_DELIMITOR_RIGHT = & SINGLE_TOK (T_DELIMITOR_RIGHT); 4 Token * INTERN_BRAKET_LEFT = & SINGLE_TOK (T_BRAKET_LEFT); 5 Token * INTERN_BRAKET_RIGHT = & SINGLE_TOK (T_BRAKET_RIGHT); 6 Token * INTERN_MINUS = & SINGLE_TOK (T_MINUS); 7 Token * INTERN_PLUS = & SINGLE_TOK (T_PLUS); 8 Token * INTERN_ASTERISKS = & SINGLE_TOK (T_ASTERISKS); 9 Token * INTERN_SLASH = & SINGLE_TOK (T_SLASH); 10 Token * INTERN_FALSE = & SINGLE_TOK (T_FALSE); 11 Token * INTERN_TRUE = & SINGLE_TOK (T_TRUE); 12 Token * INTERN_EQUAL = & SINGLE_TOK (T_EQUAL); 13 Token * INTERN_EOF = & SINGLE_TOK (T_EOF); 14 15 // size_t Lexer_all(Lexer *l, Allocator *a, Token **out)
SINGLE_TOK is just:
C
1 #define SINGLE_TOK(t) ((Token){.type = t})
Prehashing keywords for comparisons
As introduced in the previous chapters, all identifers are hashed, thus we can also hash the known keywords at startup and make comparing them very fast.
Lets take a look at true and false , both are known keywords and we will need to compare found identifers to them.
C
1 ident : { 2 size_t start = l -> pos; 3 size_t hash = FNV_OFFSET_BASIS; 4 for ( char cc = cur (l); cc > 0 && is_alphanum (cc); l -> pos ++ , cc = cur (l)) { 5 hash ^= cc; 6 hash *= FNV_PRIME; 7 } 8 9 size_t len = l -> pos - start; 10 Token * t; 11 12 // comparing to the keywords is now just a number comparison 13 if (hash == true_hash) { 14 t = INTERN_TRUE; 15 } else if (hash == false_hash) { 16 t = INTERN_FALSE; 17 } else { 18 t = CALL (a, request, sizeof (Token)); 19 t -> type = T_IDENT; 20 t -> string = (Str){ 21 .p = l -> input.p + start, 22 .len = len, 23 .hash = hash, 24 }; 25 } 26 out[count ++ ] = t; 27 JUMP_TARGET; 28 }
Both true_hash and false_hash are computed at startup of Lexer_all :
C
1 size_t Lexer_all (Lexer * l, Allocator * a, Token ** out) { 2 ASSERT (out != NULL, "Failed to allocate token list" ); 3 4 // empty input 5 if (l -> input.len == 0 ) { 6 out[ 0 ] = INTERN_EOF; 7 return 1 ; 8 } 9 10 size_t true_hash = Str_hash ( & STRING ( "true" )); 11 size_t false_hash = Str_hash ( & STRING ( "false" )); 12 13 // [...] 14 }
Tip - is_alphanum performance deep dive I know this function shouldn’t be called is_alphanum because it allows [A-Za-z0-9_-] A naive check of is_alphanum can be: C 1 bool is_alphanum ( char cc) { 2 return (cc >= 'a' && cc <= 'z' ) || (cc >= 'A' && cc <= 'Z' ) 3 || (cc >= '0' && cc <= '9' ) || cc == '_' || cc == '-' 4 } We know we can omit the uppercase check by converting the character to its lowercase representation, so lets fold the character, since ASCII upper and lowercase characters only differ by a single bit: C 1 bool is_alphanum ( char cc) { 2 uint8_t lower = cc | 0x20 ; 3 bool is_alpha = (lower >= 'a' && lower <= 'z' ); 4 bool is_digit = (cc >= '0' && cc <= '9' ); 5 return is_alpha || is_digit || cc == '_' || cc == '-' ; 6 } In benchmarks I was able to measure inline and parameter type uint8_t have a reproducible impact of reducing the runtime by 1-5% for identifier heavy inputs, so I marked the function as “private” static inline : C 1 __attribute__ ((always_inline)) inline static bool is_alphanum ( uint8_t cc) { 2 uint8_t lower = cc | 0x20 ; 3 bool is_alpha = (lower >= 'a' && lower <= 'z' ); 4 bool is_digit = (cc >= '0' && cc <= '9' ); 5 return is_alpha || is_digit || cc == '_' || cc == '-' ; 6 } There are some other ways that could be more efficient, but I haven’t benchmarked these: statically allocated lookup table like: C 1 static const bool is_alphanum_lookup[ 128 ] = { 2 [ '0' ... '9' ] = true, 3 [ 'A' ... 'Z' ] = true, 4 [ 'a' ... 'z' ] = true, 5 [ '_' ] = true, 6 [ '-' ] = true, 7 }; 8 __attribute__ ((always_inline)) inline static bool is_alphanum ( uint8_t cc) { 9 return cc < 128 && is_alphanum_lookup[cc]; 10 } weird bit sets: I don’t fully understand this one and it sucks to read, so no thanks C 1 static const uint64_t table1 = 0x03ff000000000000 2 static const uint64_t table2 = 0x07fffffe07fffffe 3 4 __attribute__ ((always_inline)) inline static bool is_alphanum ( uint8_t cc) { 5 if (cc >= 128 ) return false; 6 return cc < 64 ? (table1 >> cc) & 1 : (table2 >> (cc - 64 )) & 1 ; 7 }
On demand double and int64_t parsing
Let’s revisit hashing numbers: as introduced before, all atoms are hashed, therefore I am able to use this hash for and while interning. This way the compiler converts all duplicated integers and doubles into their numerical representation only once (in the compiler).
The lexer therefore only needs to store the window of the atoms input. While verifying all characters in this window are valid arguments for Str_to_double and Str_to_int64_t . In a later chapter I’ll show the compiler doing on demand parsing with the token we created here in the lexer.
C
1 number : { 2 size_t start = l -> pos; 3 size_t i = start; 4 bool is_double = false; 5 size_t hash = FNV_OFFSET_BASIS; 6 for (; i < l -> input.len; i ++ ) { 7 char cc = l -> input.p[i]; 8 hash ^= cc; 9 hash *= FNV_PRIME; 10 if (cc >= '0' && cc <= '9' ) 11 continue ; 12 if (cc == '.' ) { 13 ASSERT ( ! is_double, "Two dots in double" ); 14 is_double = true; 15 continue ; 16 } 17 break ; 18 } 19 20 l -> pos = i; 21 Token * n = CALL (a, request, sizeof (Token)); 22 n -> string = (Str){ 23 .p = l -> input.p + start, 24 .len = i - start, 25 .hash = hash, 26 }; 27 if (is_double) { 28 n -> type = T_DOUBLE; 29 } else { 30 n -> type = T_INTEGER; 31 } 32 33 out[count ++ ] = n; 34 JUMP_TARGET; 35 }
Tip - Distinguishing between integers and doubles At first, purple garden’s runtime represented all numbers as doubles. After benchmarking, I found out that integer math is a lot faster, so it makes a lot more sense to store all non floating point numbers as integers. For general advice, always benchmark and read the following: Int VS FP performance (java, but the point stands)
6.2.2 Integer benefits and pitfalls - GNU Astronomy Utilities
Integer operations vs floating point operations on Computational Science Stack Exchange
Extra: memory mapping the input
Normaly one would consume a file in C by
open file descriptor: fopen fread the buffer into a malloced space zero terminate close file: fclose
C
1 // https://stackoverflow.com/questions/14002954/c-how-to-read-an-entire-file-into-a-buffer 2 #include 3 #include 4 5 int main ( void ) { 6 FILE * f = fopen ( "textfile.txt" , "rb" ); 7 fseek (f, 0 , SEEK_END); 8 long fsize = ftell (f); 9 fseek (f, 0 , SEEK_SET); 10 11 char * string = malloc (fsize + 1 ); 12 fread (string, fsize, 1 , f); 13 fclose (f); 14 15 string[fsize] = 0 ; 16 free (string); 17 return EXIT_SUCCESS; 18 }
However, you can also do the whole thing a lot faster by instructing the kernel to dump the whole file into our virtual memory via mmap (Just not walking a file two times is already faster).
After opening the file (opening a file with O_RDONLY and mapping it with PROT_READ can be faster than making it mutable), we need it’s type (we dont’t want to open or dump a directory) and it’s size (the api wants a mapping block size). fstat helps us with filling a struct with meta data containing exactly the info we need:
C
1 // taken from https://www.commandlinux.com/man-page/man2/fstat.2.html 2 struct stat { 3 dev_t st_dev; /* ID of device containing file */ 4 ino_t st_ino; /* inode number */ 5 mode_t st_mode; /* protection */ 6 nlink_t st_nlink; /* number of hard links */ 7 uid_t st_uid; /* user ID of owner */ 8 gid_t st_gid; /* group ID of owner */ 9 dev_t st_rdev; /* device ID (if special file) */ 10 off_t st_size; /* total size, in bytes */ 11 blksize_t st_blksize; /* blocksize for file system I/O */ 12 blkcnt_t st_blocks; /* number of 512B blocks allocated */ 13 time_t st_atime; /* time of last access */ 14 time_t st_mtime; /* time of last modification */ 15 time_t st_ctime; /* time of last status change */ 16 };
I use S_ISREG : to check if the handled is a regular file. After mmapping with the size stored in stat.st_size I do a bookkeeping, see the combined snippet below:
C
1 #include 2 #include 3 #include 4 #include 5 6 #include "common.h" 7 #include "io.h" 8 9 Str IO_read_file_to_string ( char * path) { 10 ASSERT (path != NULL, "path was NULL" ); 11 12 int fd = open (path, O_RDONLY); 13 ASSERT (fd != - 1 , "failed to open input file" ); 14 15 struct stat s; 16 fstat (fd, & s); 17 ASSERT ( S_ISREG (s.st_mode), "path is not a file" ); 18 19 long length = s.st_size; 20 if (length < 0 ) { 21 close (fd); 22 ASSERT (length > 0 , "input is empty" ) 23 } 24 25 char * buffer = 0 ; 26 if (length != 0 ) { 27 buffer = mmap (NULL, length, PROT_READ, MAP_PRIVATE, fd, 0 ); 28 } 29 30 ASSERT ( close (fd) == 0 , "failed to close file" ); 31 ASSERT (buffer != MAP_FAILED, "failed to mmap input" ) 32 return (Str){.len = length, .p = ( const uint8_t * )buffer}; 33 }
Info - Benchmarks In my benchmark this made the stage before even starting lexing 6x-35x faster, see a2cae88 : Please excuse the ugly debug info, I have reworked this till then. Also, lexing and parsing was a single stage back then. TEXT 1 io: use mmap to make IO_read_file_to_string 6x-35x faster 2 For 5k lines with 4 atoms each (20k atoms), the initial string read was 3 reduced from 0.4390ms to 0.0730ms (6x faster): 4 5 Old: 6 [BENCH] (T-0.1620ms): parsed arguments 7 [BENCH] (T-0.4390ms): read file to String 8 [BENCH] (T-10.2020ms): parsed input 9 [BENCH] (T-1.2820ms): compiled input 10 [BENCH] (bc=40008|globals=20004) 11 [BENCH] (T-0.1620ms): ran vm 12 [BENCH] (T-0.6190ms): destroyed Nodes, vm and input 13 14 New: 15 [BENCH] (T-0.1510ms): parsed arguments 16 [BENCH] (T-0.0730ms): read file to String 17 [BENCH] (T-10.1350ms): parsed input 18 [BENCH] (T-1.3210ms): compiled input 19 [BENCH] (bc=40008|globals=20004) 20 [BENCH] (T-0.1710ms): ran vm 21 [BENCH] (T-0.6460ms): destroyed Nodes, vm and input 22 23 For larger files, such as 250k lines with 4 atoms each (1mio atoms), the 24 initial string read was reduced from 3.472ms to 0.0980ms (35x faster): 25 26 Old: 27 [BENCH] (T-0.1430ms): parsed arguments 28 [BENCH] (T-3.4720ms): read file to String 29 [BENCH] (T-434.8770ms): parsed input 30 [BENCH] (T-30.7538ms): compiled input 31 [BENCH] (bc=2040408|globals=1020204) 32 [BENCH] (T-7.5610ms): ran vm 33 [BENCH] (T-37.2170ms): destroyed Nodes, vm and input 34 35 New: 36 [BENCH] (T-0.1490ms): parsed arguments 37 [BENCH] (T-0.0980ms): read file to String 38 [BENCH] (T-437.4770ms): parsed input 39 [BENCH] (T-30.8820ms): compiled input 40 [BENCH] (bc=2040408|globals=1020204) 41 [BENCH] (T-7.4540ms): ran vm 42 [BENCH] (T-36.9500ms): destroyed Nodes, vm and input
Consuming numeric tokens in the compiler
As introduced in On demand double and int64_t parsing, the lexer does not perform string to numerical conversions, but rather stores a hash and a window of said string. The compiler converts any tokens with this hash only once and refers any duplicates to the global pool index of this number.
The compiler itself will probably be the topic of a future blog article, but I kept it simple at this time.
token_to_value is called for all unique (not encountered before and thus not interned) atoms:
C
1 // token_to_value converts tokens, such as strings, idents and numbers to 2 // runtime values 3 inline static Value * token_to_value (Token * t, Allocator * a) { 4 Value * v = CALL (a, request, sizeof (Value)); 5 switch (t -> type) { 6 case T_STRING : 7 case T_IDENT : 8 v -> type = V_STR; 9 v -> string = t -> string; 10 break ; 11 case T_INTEGER : 12 v -> type = V_INT; 13 v -> integer = Str_to_int64_t ( & t -> string); 14 break ; 15 case T_DOUBLE : 16 v -> type = V_DOUBLE; 17 v -> floating = Str_to_double ( & t -> string); 18 break ; 19 default : 20 ASSERT ( 0 , "Unsupported value for token_to_value" ); 21 break ; 22 } 23 return v; 24 }
Note the missing cases for T_FALSE and T_TRUE ? They are omitted, because there are hard coded entries 0 and 1 in the global pool ( @None is the same, its bound to index 2 ).
Info - Benchmarks This resulted in crazy 15ms/64% faster total runtime results for number and duplicate heavy test inputs, see a55a190 . TEXT 1 lexer+cc: move Str_to_(double|int64_t) parsing from lexer to cc 2 This change omits all integer and number parsing from the pipeline but 3 the first occurence of each unique integer or number by storing a hash 4 of the string representation of said values. At the interning stage in 5 the compiler only the first occurence of any hash of a double or integer 6 is parsed via Str_to_int64_t or Str_to_double, which reduces the 7 theoretically workload for any duplicated number of integers and doubles 8 from N to 1. 9 10 For a double and integer heavy benchmark (250k loc with 250k duplicated 11 doubles and integers) results in: 12 13 - 15ms faster 14 - 64% faster 15 - ~2.8x faster 16 17 Prev commit: 18 ./build/bench +V examples/bench.garden 19 [ 0.0000ms] main::Args_parse: Parsed arguments 20 [ 0.0120ms] io::IO_read_file_to_string: mmaped input of size=2500090B 21 [ 0.0050ms] mem::init: Allocated memory block of size=153092096B 22 [ 23.8300ms] lexer::Lexer_all: lexed tokens count=1000033 23 [ 12.5190ms] parser::Parser_next created AST with node_count=250003 24 [ 9.2090ms] cc::cc: Flattened AST to byte code/global pool length=1500048/4 25 [ 36.3060ms] vm::Vm_run: executed byte code 26 [ 0.3730ms] mem::Allocator::destroy: Deallocated memory space 27 [ 0.0410ms] vm::Vm_destroy: teared vm down 28 [ 0.0000ms] munmap: unmapped input 29 30 New: 31 ./build/bench +V examples/bench.garden 32 [ 0.0000ms] main::Args_parse: Parsed arguments 33 [ 0.0170ms] io::IO_read_file_to_string: mmaped input of size=2500090B 34 [ 0.0060ms] mem::init: Allocated memory block of size=153092096B 35 [ 8.5270ms] lexer::Lexer_all: lexed tokens count=1000033 36 [ 12.2070ms] parser::Parser_next created AST with node_count=250003 37 [ 9.4020ms] cc::cc: Flattened AST to byte code/global pool length=1500048/4 38 [ 36.9900ms] vm::Vm_run: executed byte code 39 [ 0.3960ms] mem::Allocator::destroy: Deallocated memory space 40 [ 0.0480ms] vm::Vm_destroy: teared vm down 41 [ 0.0010ms] munmap: unmapped input
Benchmark
So I created, what i would consider a fairly heavy lexer benchmark:
SCHEME
1 ( @Some ( @Some ( @Some ( @None )))) 2 true false true false 3 3.1415 22222222222 . 12345 4 "string me this, string me that" 5 'quoted-strings-is-a-must-do 6 ( @let unquoted-strings-are-just-idents ( @None )) 7 unquoted-strings-are-just-idents 8 ( @None ) ( + ) ( - ) ( * ) ( / ) ( = ) 9 ;; COMMENT COMMENT COMMENT 10 ;; COMMENT COMMENT COMMENT 11 ;; COMMENT COMMENT COMMENT with whitespace for 3 lines 12 13 14 15 ;; whitespace end
And I typed VggyG66666p to fill 1mio lines ( 1000005 ).
On a Laptop
Terminal
$ inxi -CMD
System: Host: ************* Kernel: 6.11.0-28-generic arch: x86_64 bits: 64 Desktop: i3 v: 4.23 Distro: Ubuntu 24.04.2 LTS (Noble Numbat) Machine: Type: Laptop System: LENOVO product: 21F8002TGE v: ThinkPad T14s Gen 4 serial: Mobo: LENOVO model: 21F8002TGE v: SDK0T76530 WIN serial: UEFI: LENOVO v: R2EET41W (1.22 ) date: 09/22/2024 CPU: Info: 8-core model: AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics bits: 64 type: MT MCP cache: L2: 8 MiB Speed (MHz): avg: 883 min/max: 400/5132 cores: 1: 1388 2: 400 3: 1396 4: 400 5: 400 6: 400 7: 1374 8: 400 9: 1331 10: 400 11: 1357 12: 400 13: 1357 14: 1346 15: 1393 16: 400
With the above components.
Terminal
$ make bench
./build/bench +V examples/bench.garden [ 0.0000ms] main::Args_parse: Parsed arguments [ 0.0150ms] io::IO_read_file_to_string: mmaped input of size=25466794B [ 0.0060ms] mem::init: Allocated memory block of size=929537981B [ 43.9190ms] lexer::Lexer_all: lexed tokens count=3133350 [ 48.8460ms] parser::Parser_next created AST with node_count=1200006 [ 18.2070ms] cc::cc: Flattened AST to byte code/global pool length=2666680/8 [ 8.9970ms] vm::Vm_run: executed byte code [ 26.7470ms] mem::Allocator::destroy: Deallocated memory space [ 1.0180ms] vm::Vm_destroy: teared vm down [ 0.0000ms] munmap: unmapped input
I can confidently say I do a million lines or 25,466,794 Bytes in 44ms. Let’s do some math:
$$ \begin{align} 44 \textrm{ms} &\triangleq 25,466,749 \mathrm{B} \\ 1 \textrm{ms} &\triangleq 578,789.75 \mathrm{B} \\ 1000 \textrm{ms} &\triangleq 578,789,750 \mathrm{B} \\ &= \underline{578,79 \mathrm{MB}/\textrm{s}} \end{align} $$
In token:
$$ \begin{align} 44 \textrm{ms} &\triangleq 3,133,350 \mathrm{T} \\ 1 \textrm{ms} &\triangleq 71212.5 \mathrm{T} \\ 1000 \textrm{ms} &\triangleq 71,212,500 \mathrm{T} \\ &= \underline{71,212,500 \mathrm{T}/\textrm{s}} \end{align} $$
That’s pretty fast, but SIMD can probably do a lot better at this point. However, I haven’t started that experiment yet.
On a Tower
Terminal
$ inxi -CMD
System: Host: comfyputer Kernel: 6.15.4-arch2-1 arch: x86_64 bits: 64 Desktop: i3 v: 4.24 Distro: Arch Linux Machine: Type: Desktop Mobo: ASUSTeK model: PRIME B450-PLUS v: Rev X.0x serial: UEFI: American Megatrends v: 2008 date: 12/06/2019 CPU: Info: 8-core model: AMD Ryzen 7 3700X bits: 64 type: MT MCP cache: L2: 4 MiB Speed (MHz): avg: 4052 min/max: 2200/4979 cores: 1: 4052 2: 4052 3: 4052 4: 4052 5: 4052 6: 4052 7: 4052 8: 4052 9: 4052 10: 4052 11: 4052 12: 4052 13: 4052 14: 4052 15: 4052 16: 4052
So we are around 14ms faster on my tower.
TEXT
1 ./build/bench +V examples/bench.garden 2 [ 0.0000ms] main::Args_parse: Parsed arguments 3 [ 0.0070ms] io::IO_read_file_to_string: mmaped input of size=25400127B 4 [ 0.0030ms] mem::init: Allocated memory block of size=927104635B 5 [ 30.9930ms] lexer::Lexer_all: lexed tokens count=3133350 6 [ 22.6340ms] parser::Parser_next created AST with node_count=1200006 7 [ 10.1480ms] cc::cc: Flattened AST to byte code/global pool length=2666680/8 8 [ 7.4800ms] vm::Vm_run: executed byte code 9 [ 0.7520ms] mem::Allocator::destroy: Deallocated memory space 10 [ 0.0620ms] vm::Vm_destroy: teared vm down 11 [ 0.0000ms] munmap: unmapped input
The same math as above, just with 30ms instead of 44ms:
$$ \begin{align} 30 \textrm{ms} &\triangleq 25,466,749 \mathrm{B} \\ 1 \textrm{ms} &\triangleq 848,891.633333 \mathrm{B} \\ 1000 \textrm{ms} &\triangleq 848,891,633.333 \mathrm{B} \\ &= \underline{848.89 \mathrm{MB}/\textrm{s}} \end{align} $$
In token:
Benchmark contexts
$$ \begin{align} 30 \textrm{ms} &\triangleq 3,133,350 \mathrm{T} \\ 1 \textrm{ms} &\triangleq 104,445 \mathrm{T} \\ 1000 \textrm{ms} &\triangleq 104,445,000 \mathrm{T} \\ &= \underline{104,445,000 \mathrm{T}/\textrm{s}} \end{align} $$
For a C input of 7.5 mio loc, which is of course more complex to tokenize then my language, see Some Strategies For Fast Lexical Analysis when Parsing Programming Languages. The following numbers are available and I added the performance the purple-garden lexer has for 7.5mio lines lexer heavy benchmark inputs.
lexer performance flex (default) 13.56 s stb_lex (w/symbol hashing) 4.84 s stb_lex 4.23 s flex -F (fast) 3.07 s flex -f (full) 2.92 s handcoded 2.45 s handcoded mmap 2.14 s wc 1.73 s purple-garden (laptop) 0.308s purple-garden (tower) 0.150s
What next
A summary what I implemented in this article:
Jump table for direct threading
0 copy and window based tokens
interned and stateless tokens
bump allocator for unique tokens
inline hashing for atoms that need it (strings, idents, numeric)
fast paths for true and false
While 580-848 MB/s is already pretty fast, I want to go further, some things I have planned:
use the absurd bit set based is_alphanum checks
checks use SIMD for comments and whitespace
use SIMD as a preprocessing step to find markers for tokens 16 bytes at a time
replace FNV-1a with a faster hashing algorithm, something like xxHash
prefetch some amount of bytes to reduce L1 & L2 latency
mmap larger inputs with MAP_HUGETLB
maybe align mmap to 64 byte boundaries for SIMD
Putting it all together
C
1 #include "lexer.h" 2 #include "common.h" 3 #include "mem.h" 4 #include "strings.h" 5 #include 6 #include 7 #include 8 #include 9 10 #define SINGLE_TOK(t) ((Token){.type = t}) 11 12 Str TOKEN_TYPE_MAP[] = {[T_DELIMITOR_LEFT] = STRING ( "T_DELIMITOR_LEFT" ), 13 [T_DELIMITOR_RIGHT] = STRING ( "T_DELIMITOR_RIGHT" ), 14 [T_BRAKET_LEFT] = STRING ( "T_BRAKET_LEFT" ), 15 [T_BRAKET_RIGHT] = STRING ( "T_BRAKET_RIGHT" ), 16 [T_STRING] = STRING ( "T_STRING" ), 17 [T_TRUE] = STRING ( "T_TRUE" ), 18 [T_FALSE] = STRING ( "T_FALSE" ), 19 [T_DOUBLE] = STRING ( "T_DOUBLE" ), 20 [T_INTEGER] = STRING ( "T_INTEGER" ), 21 [T_BUILTIN] = STRING ( "T_BUILTIN" ), 22 [T_IDENT] = STRING ( "T_IDENT" ), 23 [T_PLUS] = STRING ( "T_PLUS" ), 24 [T_MINUS] = STRING ( "T_MINUS" ), 25 [T_ASTERISKS] = STRING ( "T_ASTERISKS" ), 26 [T_SLASH] = STRING ( "T_SLASH" ), 27 [T_EQUAL] = STRING ( "T_EQUAL" ), 28 [T_EOF] = STRING ( "T_EOF" )}; 29 30 Lexer Lexer_new (Str input) { 31 return (Lexer){ 32 .input = input, 33 .pos = 0 , 34 }; 35 } 36 37 #define cur(L) (L->input.p[L->pos]) 38 39 __attribute__ ((always_inline)) inline static bool is_alphanum ( uint8_t cc) { 40 uint8_t lower = cc | 0x20 ; 41 bool is_alpha = (lower >= 'a' && lower <= 'z' ); 42 bool is_digit = (cc >= '0' && cc <= '9' ); 43 return is_alpha || is_digit || cc == '_' || cc == '-' ; 44 } 45 46 // we can "intern" these, since all of them are the same, regardless of position 47 Token * INTERN_DELIMITOR_LEFT = & SINGLE_TOK (T_DELIMITOR_LEFT); 48 Token * INTERN_DELIMITOR_RIGHT = & SINGLE_TOK (T_DELIMITOR_RIGHT); 49 Token * INTERN_BRAKET_LEFT = & SINGLE_TOK (T_BRAKET_LEFT); 50 Token * INTERN_BRAKET_RIGHT = & SINGLE_TOK (T_BRAKET_RIGHT); 51 Token * INTERN_MINUS = & SINGLE_TOK (T_MINUS); 52 Token * INTERN_PLUS = & SINGLE_TOK (T_PLUS); 53 Token * INTERN_ASTERISKS = & SINGLE_TOK (T_ASTERISKS); 54 Token * INTERN_SLASH = & SINGLE_TOK (T_SLASH); 55 Token * INTERN_FALSE = & SINGLE_TOK (T_FALSE); 56 Token * INTERN_TRUE = & SINGLE_TOK (T_TRUE); 57 Token * INTERN_EQUAL = & SINGLE_TOK (T_EQUAL); 58 Token * INTERN_EOF = & SINGLE_TOK (T_EOF); 59 60 size_t Lexer_all (Lexer * l, Allocator * a, Token ** out) { 61 ASSERT (out != NULL, "Failed to allocate token list" ); 62 63 // empty input 64 if (l -> input.len == 0 ) { 65 out[ 0 ] = INTERN_EOF; 66 return 1 ; 67 } 68 69 size_t true_hash = Str_hash ( & STRING ( "true" )); 70 size_t false_hash = Str_hash ( & STRING ( "false" )); 71 72 size_t count = 0 ; 73 static void * jump_table[ 256 ] = { 74 [ 0 ... 255 ] = && unknown, 75 [ ' ' ] = && whitespace, 76 [ '\t' ] = && whitespace, 77 [ '
' ] = && whitespace, 78 [ ';' ] = && comment, 79 [ '(' ] = && delimitor_left, 80 [ ')' ] = && delimitor_right, 81 [ '@' ] = && builtin, 82 [ '.' ] = && number, 83 [ '0' ... '9' ] = && number, 84 [ 'a' ... 'z' ] = && ident, 85 [ 'A' ... 'Z' ] = && ident, 86 [ '_' ] = && ident, 87 [ '\'' ] = && quoted, 88 [ '"' ] = && string, 89 [ '+' ] = && plus, 90 [ '-' ] = && minus, 91 [ '/' ] = && slash, 92 [ '*' ] = && asterisks, 93 [ '=' ] = && equal, 94 [ '[' ] = && braket_left, 95 [ ']' ] = && braket_right, 96 [ 0 ] = && end, 97 }; 98 99 #define JUMP_TARGET goto *jump_table[(int32_t)l->input.p[l->pos]] 100 101 JUMP_TARGET; 102 103 delimitor_left : 104 out[count ++ ] = INTERN_DELIMITOR_LEFT; 105 l -> pos ++ ; 106 JUMP_TARGET; 107 108 delimitor_right : 109 out[count ++ ] = INTERN_DELIMITOR_RIGHT; 110 l -> pos ++ ; 111 JUMP_TARGET; 112 113 braket_left : 114 out[count ++ ] = INTERN_BRAKET_LEFT; 115 l -> pos ++ ; 116 JUMP_TARGET; 117 118 braket_right : 119 out[count ++ ] = INTERN_BRAKET_RIGHT; 120 l -> pos ++ ; 121 JUMP_TARGET; 122 123 builtin : { 124 l -> pos ++ ; 125 // not an ident after @, this is shit 126 if ( ! is_alphanum ( cur (l))) { 127 out[count ++ ] = INTERN_EOF; 128 } 129 size_t start = l -> pos; 130 size_t hash = FNV_OFFSET_BASIS; 131 for ( char cc = cur (l); cc > 0 && is_alphanum (cc); l -> pos ++ , cc = cur (l)) { 132 hash ^= cc; 133 hash *= FNV_PRIME; 134 } 135 136 size_t len = l -> pos - start; 137 Str s = (Str){ 138 .p = l -> input.p + start, 139 .len = len, 140 .hash = hash, 141 }; 142 Token * b = CALL (a, request, sizeof (Token)); 143 b -> string = s; 144 b -> type = T_BUILTIN; 145 out[count ++ ] = b; 146 JUMP_TARGET; 147 } 148 149 plus : 150 out[count ++ ] = INTERN_PLUS; 151 l -> pos ++ ; 152 JUMP_TARGET; 153 154 minus : 155 out[count ++ ] = INTERN_MINUS; 156 l -> pos ++ ; 157 JUMP_TARGET; 158 159 slash : 160 out[count ++ ] = INTERN_SLASH; 161 l -> pos ++ ; 162 JUMP_TARGET; 163 164 equal : 165 out[count ++ ] = INTERN_EQUAL; 166 l -> pos ++ ; 167 JUMP_TARGET; 168 169 asterisks : 170 out[count ++ ] = INTERN_ASTERISKS; 171 l -> pos ++ ; 172 JUMP_TARGET; 173 174 number : { 175 size_t start = l -> pos; 176 size_t i = start; 177 bool is_double = false; 178 size_t hash = FNV_OFFSET_BASIS; 179 for (; i < l -> input.len; i ++ ) { 180 char cc = l -> input.p[i]; 181 hash ^= cc; 182 hash *= FNV_PRIME; 183 if (cc >= '0' && cc <= '9' ) 184 continue ; 185 if (cc == '.' ) { 186 ASSERT ( ! is_double, "Two dots in double" ); 187 is_double = true; 188 continue ; 189 } 190 break ; 191 } 192 193 l -> pos = i; 194 Token * n = CALL (a, request, sizeof (Token)); 195 n -> string = (Str){ 196 .p = l -> input.p + start, 197 .len = i - start, 198 .hash = hash, 199 }; 200 if (is_double) { 201 n -> type = T_DOUBLE; 202 } else { 203 n -> type = T_INTEGER; 204 } 205 206 out[count ++ ] = n; 207 JUMP_TARGET; 208 } 209 210 ident : { 211 size_t start = l -> pos; 212 size_t hash = FNV_OFFSET_BASIS; 213 for ( char cc = cur (l); cc > 0 && is_alphanum (cc); l -> pos ++ , cc = cur (l)) { 214 hash ^= cc; 215 hash *= FNV_PRIME; 216 } 217 218 size_t len = l -> pos - start; 219 Token * t; 220 if (hash == true_hash) { 221 t = INTERN_TRUE; 222 } else if (hash == false_hash) { 223 t = INTERN_FALSE; 224 } else { 225 t = CALL (a, request, sizeof (Token)); 226 t -> type = T_IDENT; 227 t -> string = (Str){ 228 .p = l -> input.p + start, 229 .len = len, 230 .hash = hash, 231 }; 232 } 233 out[count ++ ] = t; 234 JUMP_TARGET; 235 } 236 237 // same as string but only with leading ' 238 quoted : { 239 // skip ' 240 l -> pos ++ ; 241 size_t start = l -> pos; 242 size_t hash = FNV_OFFSET_BASIS; 243 for ( char cc = cur (l); cc > 0 && is_alphanum (cc); l -> pos ++ , cc = cur (l)) { 244 hash ^= cc; 245 hash *= FNV_PRIME; 246 } 247 248 size_t len = l -> pos - start; 249 Token * t; 250 t = CALL (a, request, sizeof (Token)); 251 t -> type = T_STRING; 252 t -> string = (Str){ 253 .p = l -> input.p + start, 254 .len = len, 255 .hash = hash, 256 }; 257 out[count ++ ] = t; 258 JUMP_TARGET; 259 } 260 261 string : { 262 // skip " 263 l -> pos ++ ; 264 size_t start = l -> pos; 265 size_t hash = FNV_OFFSET_BASIS; 266 for ( char cc = cur (l); cc > 0 && cc != '"' ; l -> pos ++ , cc = cur (l)) { 267 hash ^= cc; 268 hash *= FNV_PRIME; 269 } 270 271 if ( UNLIKELY ( cur (l) != '"' )) { 272 Str slice = Str_slice ( & l -> input, l -> pos, l -> input.len); 273 fprintf (stderr, "lex: Unterminated string near: '%.*s'" , ( int )slice.len, 274 slice.p); 275 out[count ++ ] = INTERN_EOF; 276 } else { 277 Token * t = CALL (a, request, sizeof (Token)); 278 t -> type = T_STRING; 279 t -> string = (Str){ 280 .p = l -> input.p + start, 281 .len = l -> pos - start, 282 .hash = hash, 283 }; 284 out[count ++ ] = t; 285 // skip " 286 l -> pos ++ ; 287 } 288 JUMP_TARGET; 289 } 290 291 comment : 292 for ( char cc = cur (l); cc > 0 && cc != '
' ; l -> pos ++ , cc = cur (l)) { 293 } 294 JUMP_TARGET; 295 296 whitespace : 297 l -> pos ++ ; 298 JUMP_TARGET; 299 300 unknown : { 301 uint8_t c = cur (l); 302 ASSERT ( 0 , "Unexpected byte '%c' (0x%X) in input" , c, c) 303 } 304 305 end : 306 out[count ++ ] = INTERN_EOF; 307 return count; 308 } 309 310 #undef SINGLE_TOK
Thank you for reading this far. If you have any suggestions or feedback, feel free to send me an email at [email protected] or: