Googling around got me to a few resources that seemed like they’d be relevant, specifically the Commonmark Spec section about parsing , but what really ended up sticking in my brain was a book called Crafting Interpreters , which I someday would love to go back and really read for its intended purpose. But since I was going to be doing more or less the first half (up to the point where you do something with the tree you’ve created by scanning and parsing the code), I figured this would be a good place to start, and it was! Very well-written, too. So much so that it even made sense though I haven’t pretended to know anything about reading Java in years.
What this meant, anyway, was that I had a clear path forward. Prior to asking for help, I’d written a half-of-a-half implementation that mixed up the lexing and the parsing and the output all together, but this was going to be better, both in terms of building it, in terms of architecture, and in terms of being able to do other things with the tree/graph once I had it. So what I would then do was:
Scanning, Lexing, Whatever You Want to Call It
I’m still not sure what the difference between "scanning" and "lexing" is, if there is one at all, but anyway I needed to generate some tokens. I don’t plan on going into too much detail about the why/how of this (instead I refer you back to Crafting Interpreters), but there are a few interesting (annoying?) things about asciidoc that I think are worth mentioning here.
Like markdown, asciidoc is essentially a line-based language. The most significant character is therefore the line break,
, and in some worlds/lights it makes sense to parse asciidoc line-by-line. If I were to go back and do it as a "one-shot" parser (which according to the chatter in the Asciidoc community chat, isn’t possible anyway), I might do it as a line-by-line thing. Instead, however, I did the scanning character-by-character, in part because that’s what the book told me to do, and in part because keeping track of the newline tokens actually made parsing much easier in the end (I think/hope, anyway).
So the scanning.
Maybe the best "new thing I started using a lot" of 2024 was the humble Enum . I started using them in Python for a specific thing, and then started using them more, and one of the things I like best about Rust is that it takes its Enum s seriously. So, to wit, the first thing I did was create a big ass TokenType enum:
#[derive(Debug, Clone, Copy, PartialEq, Eq)] pub enum TokenType { NewLineChar, LineContinuation, ThematicBreak, PageBreak, Comment, PassthroughBlock, // i.e., "++++" SidebarBlock, // i.e., "****" SourceBlock, // i.e., "----" // ...snip
Note All source can be found in the Github repo. I’m going to condense and remove some comments and things in this post as needed to keep it clean.
... continue reading