Tech News
← Back to articles

Strategies for Fast Lexers

read original related products more articles

In this blog post I’ll explain strategies I used to make the purple garden lexer really fast.

purple-garden is an s-expr based language I am currently developing for myself. Its my attempt at building a language I like, with a battery included approach, while designing it with performance in mind.

This doesn’t mean all approaches are feasible for your use case, architecture and design. I tried to bring receipts for my performance claims, so watch out for these blocks at the end of chapters:

Info - Benchmark

Introduction to Lexing

A lexer (often also referred to as a tokeniser) is the easiest part of any compilation and language pipeline. The idea is to convert a list of characters into a list of tokens in which each token conveys some meaning. This list of tokens can then be used by the parser to generate an abstract syntax tree (AST), which the compiler consumes, converting it to bytecode, which the vm executes.

A compilation pipeline

As an overview:

┌───────────────────┐ │ │ │ Lexer │ <- we are here │ │ └─────────┬─────────┘ │ │ Tokens <- and here │ ┌─────────▼─────────┐ │ │ │ Parser │ │ │ └─────────┬─────────┘ │ │ AST │ ┌─────────▼─────────┐ │ │ │ Compiler │ │ │ └─────────┬─────────┘ │ │ Bytecode │ ┌─────────▼─────────┐ │ │ │ Virtual Machine │ │ │ └───────────────────┘

For a list of characters, lets say (@std.fmt.println "my pi is: " 3.1415) :

... continue reading