You can't parse XML with regex. Let's do it anyways

You can't parse XML with regex. Let's do it anyways. this scene came to me in a dream Haruhi says... stuff1 #1 Content is a word of the enemy. Companies will say "content" instead of calling it artworks, writings, pieces, and such, as if all media is something interchangeable meant to fill a box. Referring to "art" as "content" nowadays is often pejorative. If I ever make a "CDN" (Content Delivery Network), I will call it an SDN instead. Sounds much comfier. Has to be some sort of record! Fortunately, this contradiction is far from being the last in this post. " "They didn't even get to the blogpostand they're already making a contradictory statement.Has to be some sort of record! Fortunately, this contradiction isfrom being the last in this post. " Attempting to parse HTML with regular expressions is an infamous pitfall, and a great example of using the wrong tool for the job. It's generally accepted to be a bad idea, for a multitude of reasons. Picture 1 - he keeps on going for like 3 more screens (Stack Overflow link) There's this famous Stack Overflow answer about why you should never, ever do it. In fact, this answer got so popular that it was used like a copypasta in some circles. Every time I stumbled upon it, I would think how there's a lot of truth in it - but at the same time, I couldn't agree in full... But... can't you, really? Picture 2 - did you know that XML has a logo? I'm not joking, I only learnt today too While I assume that all readers of this weblog have at least a vague understanding of XML, it's worth to recap for the sake of later arguments. Quoting the Wikipedia article on XML: Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. I'd like to focus on three parts: It's a markup language: unlike JSON or 3 TOML #3 This sentence originally mentioned YAML. This post isn't about YAML, and yet I got a lot of complaints for implying that YAML could be considered simple. This discussion has absolutely no relevance to this post, so I replaced it with TOML. Don't like TOML? Think of INI. Don't like INI? Think of CSV, etc. XML defines a much more specific structure for the document. Other SGML derivatives are a bit more lax with enforcing said structure - remember this fact for later. It's machine-readable: it's designed to be parsed and interpreted into a tree. It's human-readable: no specialized tools are required to look at and understand the data contained within an XML document. What Wikipedia doesn't immediately convey (you have to scroll down to section 11) is that XML is horribly complex. JSON, TOML and many other human-readable data interchange formats are simple enough that many self-taught developers learn them through osmosis. Heck, RFC8259, "The JavaScript Object Notation (JSON) Data Interchange Format", is 16 pages long, out of which the actual format description takes maybe 8. In contrast, the base XML 1.0 (Second Edition) spec is 59 pages long, and that doesn't include various extensions that have grown onto it since 2000. Unsurprisingly, this larger surface area becomes a security liability when developers aren't familiar with the whole feature set. This lack of in-depth knowledge about the format is why newbies even consider parsing XML with a regex. It's a "you don't know what you don't know" problem, which leads to a vastly different approach when writing a parser. Your parser ≠ My parser Let's get back to the "machine-readable" vs. "human-readable" part; Assume we have a stack-based parser; this makes it easy to illustrate where the parser is in a given structure. (To refresh, a stack is a queue/array where the operations are "push", that adds a value to the end, and "pop", which removes and returns that value to our program.) meow nya Figure 1 - a very simple XML-like object tree Here's a simplified view of how a parser may "walk" a tree: # stack=() # push a; stack=(a) # push b; stack=(a b) meow # push c; stack=(a b c) nya # pop; push d; stack=(a b d) # pop; stack=(a b) # pop; stack=(a) # pop; stack=() Figure 2 - same tree, now hastily annotated with actions and state While the example above doesn't show anything useful happening with our tree, it's actually quite simple to incorporate a DOM-like selector query system on top of this. The following snippet implements a very naïve XML-like parser, which can be used to extract strings from objects: #!/usr/bin/env bash # Please don't actually use this. xoxo, dmi stack=() tokens=() buf= # QUERY=(a b c) QUERY=($@) flush() { if [[ "$buf" ]]; then tokens+=("$buf") fi buf= } search() { (( ${#stack[@]} < ${#QUERY[@]} )) && return [[ ${tokens[-1]} != "lbrack" ]] && return for (( i=0; i<${#QUERY[@]}; i++ )); do if [[ "${QUERY[i]}" != "${stack[-${#QUERY[@]}+i]}" ]]; then return fi done echo "query result: ${tokens[-2]}" } while read -rn1 chr; do if [[ "$chr" == "<" ]]; then flush tokens+=("lbrack") elif [[ "$chr" == ">" ]]; then if [[ "${tokens[-1]}" == "lbrack" ]]; then flush # get tag contents stack+=("${tokens[-1]}") # put it onto the stack elif [[ "${tokens[-1]}" == "slash" ]]; then unset stack[${#stack[@]}-1] # pop last element fi tokens+=("rbrack") elif [[ "$chr" == "/" && "${tokens[-1]}" == "lbrack" ]]; then tokens+=("slash") else buf+="$chr" fi search done Figure 3 - bash parser for our markup. i will invest in syntax coloring next quarter The result is: ## in DOM selector terms, 'a b c' would be 'a > b > c' $ ./parse.sh a b c < test.xml query result: meow $ ./parse.sh a b d < test.xml query result: nya Figure 4 - parser demonstration This "walking" behavior can be visualized even better after adding declare -p stack to every loop: $ ./parse.sh a b d < test declare -a stack=() declare -a stack=() declare -a stack=([0]="a") declare -a stack=([0]="a") declare -a stack=([0]="a") # (...) declare -a stack=([0]="a" [1]="b") # (...) declare -a stack=([0]="a" [1]="b" [2]="c") declare -a stack=([0]="a" [1]="b" [2]="c") declare -a stack=([0]="a" [1]="b" [2]="c") # (...) declare -a stack=([0]="a" [1]="b") # (...) declare -a stack=([0]="a" [1]="b" [2]="d") declare -a stack=([0]="a" [1]="b" [2]="d") declare -a stack=([0]="a" [1]="b" [2]="d") declare -a stack=([0]="a" [1]="b" [2]="d") query result: nya declare -a stack=([0]="a" [1]="b" [2]="d") declare -a stack=([0]="a" [1]="b" [2]="d") declare -a stack=([0]="a" [1]="b" [2]="d") declare -a stack=([0]="a" [1]="b") # (...) declare -a stack=([0]="a") # (...) declare -a stack=() Figure 5 - stack in action Due to the single-pass nature of our parser (which combines tokenization and a few other steps into one), I had to remove some repetition. Furthermore, this parser is for demonstrational purposes only and cannot parse arbitrary XML. Real-world XML has a lot of special objects, self-terminating tags, and other gotchas that have to be accounted for, even during a simple text extraction. How your brain reads XML Now that you have a gist of how an algorithm for parsing XML may work (and hopefully understand that writing a parser is a lot of pain), let's step back and consider how we, creatures of protein and flesh, parse XML. To make things harder, let's look at the raw, true form of XML - no pretty-printing allowed. meownya Figure 6 - example from before, compacted To an untrained eye, this doesn't look like a tree. nya meow Figure 7 - the same structure, with whitespace arranged to form an x-mas tree Ah, much better! This is semantically equivalent to all the snippets I've attached before, but you have to think really hard to picture that a > b > (c, d). To me, this snippet is first and foremost a string. String parsing Approaching XML or any other structured data format as a string is like dumpster-diving for parts. I don't mean this in a bad way; both regex and dumpster diving have awarded me some great stuff. But they also give me the urge to shower immediately afterwards. To continue the analogy, you can't inquire about why something got thrown out (as in, why given data is present and why it is formatted the way it is). This information is lost. You can make educated guesses if you stare at it long enough, but you can't know for sure. Worse even, if your data changes (as may happen with XML returned by an API), the whole tree may get ordered in a slightly different way, rendering your meticulously crafted parser useless. For this - and many other reasons - it's best to parse XML with a real parser. I'll explore actual string parsing techniques later in this post. Before that, we have an elephant in the room to address... HTML: XML but quirky Pedantry Corner In opposition, I'd like to argue that while XML inspires fear in CS majors and hackers alike, virtually nobody knows about SGML. HTML is quirky XML. Some might argue that both HTML and XML were derived from SGML , not from each other, so this section title doesn't make sense.In opposition, I'd like to argue that while XML inspires fear in CS majors and hackers alike, virtually nobody knows about SGML. HTMLquirky XML. HTML is the main language used for presentation online. The web lives and breathes HTML. You can make webapps without WebAssembly, without ECMAScript, or even without CSS. But you absolutely need2 HTML (... or XHTML - hold that thought). #2 Before publication, Lisa argued that you technically can make pages without HTML: SVG, Java Applets, Flash, PDF One could discredit the last three options, as they're external technologies that aren't a part of any Web spec. However, SVG is much tougher to ignore. It's a W3C Recommendation, which makes it at least adjacent. It also specifies the tag, so technically SVG could be used "without HTML" to create a webpage. I remain sceptical. A few thousand bytes ago, I touched on how XML is extremely strict in the layout. HTML is the exact opposite, allowing for unclosed tags and broken grammar. An XML parser would get a heart attack if asked to parse HTML found online. Parsing HTML is near-impossible Well-formed HTML is fine. However, browsers are designed to make educated guesses instead of failing outright when the markup doesn't fit. This was a compromise made for accessibility. Today's devtools make debugging easy, but in the early 90s? There was virtually no tooling for this. Having the parsers accept slightly mangled input no doubt improved adoption when HTML was all new. Sadly, this means that HTML is already two layers removed from XML. Quirks mode is largely based on how things got implemented by IE and Netscape 30 years ago. Standards compliance mode somewhat improves the situation, but it will still accept missing closing tags or quotes. That being said, virtually all of those situations are defined by the standard, and contemporary browsers implement it extremely closely. Why is it "near-impossible" then? HTML living standard dwarfs the base of XML, being over 1500 pages long! ...Okay, perhaps that's a bit unfair - at the time of writing, only 114 of those pages actually deal with parsing (thanks for checking, Linus!). Regardless, that's still over x2 the length of the XML standard, and this growth is mostly defining edge-cases! Unless you're using an actual browser, chances are that your DOM tree will parse slightly differently on pages that aren't well-formed. HTML4.01? Ridiculous! We need to develop a better alternative that suits everyone's needs Situation: there are two sibling standards. XHTML is... a weird creature. It was first introduced in late 1998 and refined into a standard that was adopted as a W3C recommendation in January 2000. Unfortunately, it wasn't widely adopted (unlike later HTML5)... The attempt to get the world to switch to XML, including quotes around attribute values and slashes in empty tags and namespaces all at once didn't work. The large HTML-generating public did not move, largely because the browsers didn't complain. Some large communities did shift and are enjoying the fruits of well-formed systems, but not all. ~ Tim Berners Lee , 2005 I'm only mentioning XHTML here because, technically, we've had a strict, well-defined HTML alternative for almost 3 decades by now, despite not many people knowing about it. Heck, XHTML5 exists too! You can use it right now! It's really cool! (famfo keeps telling me about it, so it has to be true.) Finally: actually parsing HTML with regex The following section is entirely a product of my attempts to scrape various webpages over the years. I'm aware how badly the practice of scraping is viewed in some circles, and I'd like to assure the reader that the bots I've built in the past have always been slow to request, and used extensive caching. GenAI scrapers constantly DoSing the internet can go to hell. Benefits Haruhi says... "Bet you didn't expect them to talk about benefits after they spent so long rambling about how hard it is to parse HTML. Ha!" Development speed Modern websites often have hundreds, if not thousands of nested elements. Writing a selector for something really deep down can take a while, especially if additional constraints are present (randomized class names? the developer only knowing about div-s?). Writing a regex takes me 30 seconds. But hacking up a good selector and debugging why it doesn't work on the next request? Tens of minutes of cursing. Adaptability Selectors are strict. They either give you a result or fail. This is great when you trust the other side of the system to send you good, accurate markup. HOWEVER, this is not something you can expect when scraping. For instance: (...) Peterborough 1 1801 On Time

Calling at: Ifield (1805), Crawley (1808), Three Bridges (1812), Gatwick Airport (1817), (...) Figure 8 - Excerpt from a departure table of the least used train station in West Sussex Say, if we wanted to extract all the stations where this train calls at. In ECMAScript, that'd be document.querySelectorAll("#scroll0 > span") ... And then you have to join the strings, so more like let a=""; document.querySelectorAll("#scroll0 > span").forEach((e)=>{a+=e.innerText;}); console.log(a); ... And then you have to join the strings, so more like With regex, I'd start by matching for scroll0".*?

. This leaves us with a lot of spaces, which can be mitigated by matching for (two spaces in a row). My shell one-liner looks something like: curl (...) | tr -d '\r ' | grep -Poh 'scroll0.*?@@g;s/ //g;' This leaves us with the following payload: scroll0" class="scrollable">Calling at:Ifield(1835), Crawley(1838), Three Bridges(1842), Gatwick Airport(1847), Horley(1851), Redhill(1859), Merstham(1903), Coulsdon South(1908), East Croydon(1915), London Bridge(1930), London Blackfriars(1936), City Thameslink(1938), Farringdon(1941), London St Pancras (Intl)(1945), Finsbury Park(1952), Stevenage(2013), Hitchin(2020), Arlesey(2026), Biggleswade(2031), Sandy(2035), St Neots(2042), Huntingdon(2049), Peterborough(2105)

Calling at: Ifield (1805), Crawley (1808), Three Bridges (1812), Gatwick Airport (1817), (...) Figure 10 - Example modification that could happen to a page In this scenario, our existing selector #scroll0 > span only matches the first element, returning the words Calling at:, without the actual list of stations. On the other hand, the shell one-liner doesn't break because it didn't rely on markup for context, only anchoring. Please note: in some scenarios this may garble the output - especially during large layout changes. I still count it as a win for team regex. Complexity HTML is not designed to be consumed by any program that isn't a browser. At heart, it's a language for presentation, not data interchange. This makes traditional selectors unsuitable for some tasks. Consider the following snippet:

Screen size: 21.37"

Battery capacity : 154Wh

Figure 11 - a mock item info list There's nothing that could help the parser extract the key/value pairs.

is generic enough that it's probably present in a few different areas of the page. Even if you had a proper selector for it, you'd still have to split on :, then remove whitespace, and fuzzy-match the key. With regex, extracting the screen size is just a matter of grep -Pohi 'screen size ?:.*?"' | grep -Poh '[0-9]*\.[0-9]*"' - half of which you would need to do either way. Some broad tips Ask yourself whether regex is the right tool for the job. Unless you're scraping, or you're really low on available space for new libraries (writing for embedded?), then the correct approach is using some XML parser library. For some scraping workflows (extremely regular data, i.e. autogenerated tables), an HTML parser may prove more helpful than hacks described below. If you are actually dealing with irregularly generated HTML, check if you can get the data elsewhere. A smart hacker doesn't scrape HTML when there's an Android app with readily available API endpoints returning JSON. Don't try to actually parse the tree structure. This removes all benefits of using regex for scraping and welcomes you to a world of pain. PCRE is a must. Standard regex doesn't have the ? non-greedy match operator. This means that a.*b will match the first "a", and the last "b", whereas doing a.*?b will match the first "a", and the first "b" that follows. This is VERY useful for anchoring to some unique text, and then matching the next closing tag (like this: meow.*? ). In grep, PCRE is enabled with flag -P, if compiled-in. If you don't have PCRE, non-greedy matching can be emulated through: anchoring somewhere ( meow.* ) replacing the first occurence of the end marker with a unique string ( s| Network -> (right click on request) -> Copy as cURL. Extraction will break. Good scraping bots account for it (and fail safe). Very good bots notify the admin out-of-band. Excellent bots may even try a second path for extracting data. The exhibit I've had some trouble trying to settle on a specific thing to use as a sample, so I finally went with something pointless and silly. Here's a small scraper that extracts data from OpenRCT2's download webpage. Of course, in this case the data is available through a variety of different means. Nonetheless, same principles can be applied to proprietary pages. #!/bin/bash unset IFS data="$(curl 'https://openrct2.io/download/release' )" # data="$(cat /tmp/dump)" # a good scraper always prototypes on a local dump # match line that contains 'v' and 'latest', to get the current game version latest_ver="$(grep -Poh 'v.*?(?=\(latest)' <<< "$data" )" # let's split the list of releases into something we can use: # 1. make sure that there's no 0x01/0x02. we'll use those as our delimeters # 2. split the page on "card-header", which occurs only between list items releases="$(tr -d $'\x01\x02' <<< "$data" | sed 's/card-header/\x01/' )" # now, let's iterate over what we found: while read -d $'\x01' -r release; do # 'btn btn-link' seems to always preceed the version. let's use that fact :) version="$(grep -A1 'btn btn-link' <<< "$release" | tail -n1 | sed 's/ //g' )" version= "${version#* }" # trim all whitespace from the beginning of the string if [[ ! "$version" ]]; then # empty? has to be the 1st item, let's fall back version= "$latest_ver" fi # 'card-title' is our top anchor. # then, we can iterate over 'Download', which occurs at the end. # also here: cleanup all closing tags, since they're at the end of each line anyways. meta="$(grep card-title -C10 <<< "$release" | sed 's@//' )" # architecture is right after a line with h6. # let's match h6 and the next line, then discard the first line arch="$(grep -A1 h6 <<< "$entry" | tail -n1 | sed 's/ *//;s/,//' )" # artifact type is just one line below the arch. # match h6 and *two* following lines, then discard first two. type="$(grep -A2 h6 <<< "$entry" | tail -n1 | sed 's/ *//' )" # url is simple, just extract data from quotes. # the space after " is load-bearing! otherwise we'd need to do additional cleanup url="$(grep ' tag, so technically SVG could be used "without HTML" to create a webpage. I remain sceptical. can make pages without HTML: One could discredit the last three options, as they're external technologies that aren't a part of any Web spec. However, SVG is much tougher to ignore. It's a W3C Recommendation, which makes it at least adjacent. It also specifies the tag, so SVG could be used "without HTML" to create a webpage. I remain sceptical. #3 - This sentence originally mentioned YAML. This post isn't about YAML, and yet I got a lot of complaints for implying that YAML could be considered simple. This discussion has absolutely no relevance to this post, so I replaced it with TOML. Don't like TOML? Think of INI. Don't like INI? Think of CSV, etc. Notes: Edited by Miya. Standards technical consulting by Linus. Proofreading by Linus, Riedler, famfo, Lili, ari, lauren, kleines Filmröllchen, shebang, irth, mei and Multi (damn that's a lot). Thanks a lot to everyone for helping out, and thank you for reading until the end. <3 Comments:

You can't parse XML with regex. Let's do it anyways

Share this article

Related Articles