How I wrote JustHTML, a Python-based HTML5 parser, using coding agents

How I wrote JustHTML using coding agents

I recently released JustHTML, a python-based HTML5 parser. It passes 100% of the html5lib test suite, has zero dependencies, and includes a CSS selector query API. Writing it taught me a lot about how to work with coding agents effectively.

I thought I knew HTML going into this project, but it turns out I know nothing when it comes to parsing broken HTML5 code. That's the majority of the algorithm.

Henri Sivonen, who implemented the HTML5 parser for Firefox, called the "adoption agency algorithm" (which handles misnested formatting elements) "the most complicated part of the tree builder". It involves a "Noah's Ark" clause (limiting identical elements to 3) and complex stack manipulation that breaks the standard stack model.

I still don't know how to solve those problems. But I still have a parser that solves those problems better than the reference implementation html5lib. Power of AI! :)

When picking a project to build with coding agents, choosing one that already has a lot of tests is a great idea. HTML5 is extremely well-specified, with a long specification and thousands of treebuilder and tokenizer tests available in the html5lib-tests repository.

When using coding agents autonomously, you need a way for them to understand their own progress. A complete test suite is perfect for that. The agent can run the tests, see what failed, and iterate until they pass.

Building the parser (iterations, restarts, and performance work)#

Writing a full HTML5 parser is not a short one-shot problem. I have been working on this project for a couple of months on off-hours.

Tooling: I used plain VS Code with Github Copilot in Agent mode. I enabled automatic approval of all commands, and then added a blacklist of commands that I always wanted to approve manually. I wrote an agent instruction that told it to keep working, and don't stop to ask questions. Worked well!

... continue reading