Parsing Advances

I find myself writing yet another toy parser, as one does during a Christmas break. It roughly follows Resilient LL Parsing Tutorial. Not because I need resilience, but mostly because I find producing a syntax tree and a collection of diagnostics a more natural fit for the problem than bailing out on the first error.

One practical pitfall with the approach is infinite loops/recursion. Resilience sometimes means not consuming a token, and, if you do that in a loop or a Pratt recursive call, you’ll get yourself an annoying to debug error:

running 1 test from ./src/corpus_test.ts corpus ...Task test deno test --allow-read=./src/corpus --allow-write=./src/corpus "--" "--update" Check src/corpus_test.ts <--- Last few GCs ---> 4,[26641:0x9d1574000] 7390 ms: Mark-Compact (reduce) 3924.9 (3927.3) -> 3924.9 (3926.3) MB, pooled: 0.0 MB, 1224.00 / 0.00 ms (+ 0.3 ms in 1 steps since start of marking, biggest step 0.3 ms, walltime since start of marking 1232 ms) (average mu = 0.200,[26641:0x9d1574000] 8804 ms: Mark-Compact (reduce) 4009.9 (4011.3) -> 4009.9 (4011.3) MB, pooled: 0.0 MB, 1294.67 / 0.00 ms (+ 0.2 ms in 1 steps since start of marking, biggest step 0.2 ms, walltime since start of marking 1302 ms) (average mu = 0.141, # # Fatal JavaScript out of memory: Ineffective mark-compacts near heap limit # ==== C stack trace =============================== 0 deno 0x0000000102ce8404 v8::base::debug::StackTrace::StackTrace() + 24 1 deno 0x0000000102ceeb9c v8::platform::(anonymous namespace)::PrintStackTrace() + 24 2 deno 0x0000000102ce4094 v8::base::FatalOOM(v8::base::OOMType, char const*) + 68 3 deno 0x0000000102d3a7a8 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, v8::OOMDetails const&) + 296 4 deno 0x0000000102f37378 v8::internal::Heap::stack() + 0 5 deno 0x0000000102f3581c v8::internal::Heap::CheckMemoryPressure() + 0 6 deno 0x0000000102ead4f8 v8::internal::StackGuard::HandleInterrupts(v8::internal::StackGuard::InterruptLevel) + 504 7 deno 0x000000010335fe44 v8::internal::Runtime_HandleNoHeapWritesInterrupts(int, unsigned long*, v8::internal::Isolate*) + 304 8 deno 0x00000001043887b4 Builtins_CEntry_Return1_ArgvOnStack_NoBuiltinExit + 84 9 ??? 0x0000000126997874 0x0 + 4942559348 10 ??? 0x000000012698a758 0x0 + 4942505816 ...

For a concrete example, you might parse function argument list using code like this:

const result : ast. Expression [] = []; p. expect ( "(" ); while (!p. eof () && !p. at ( ")" )) { result. push ( expression (p)); if (!p. at ( ")" )) p. expect ( "," ); } p. expect ( ")" ); return result;

The implicit contract here is that expression consumes at least one token, even if there are errors in the source code. If there’s some token that makes expression bail without consuming anything, the code loops forever, and you’ll need a debugger to get at the stack trace!

The way I solved this issue traditionally is via a combination of two techniques:

Fuel: parser has a fuel: Cell field, which is decremented even by “readonly” lookahead methods, and topped up every time the parser consumes a token. Fuel is useful to make you parser crash somewhat cleanly, though the crash is typically still removed from problematic function by several stack frames.

The second technique is to maintain a mental map of functions which always consume at least one token of input, and functions which might bail without consuming anything. And, whenever you write a loop or a recursive call, consult this map to be sure to call at least one token-consuming function. Hard and error prone!

... continue reading