Fast columnar JSON decoding with arrow-rs
Published on: 2025-05-26 06:10:27
JSON is the most common serialization format used in streaming pipelines, so it pays to be able to deserialize it fast. This post covers in detail how the arrow-json library works to perform very efficient columnar JSON decoding, and the additions we've made for streaming use cases.
My day job is working on the Arroyo stream processing engine, which executes complex, stateful queries over high-scale streams of events. Computing things like windowed aggregates, stream joins, and incrementally-computed SQL involves, as you might imagine, a lot of sophisticated algorithms and systems. Doing it much faster than existing systems like Apache Flink took careful performance engineering at every level of the stack.
But…I'm going to let you in on a little secret. The sad truth of this industry, which no one will tell you, is that so, so many data pipelines spend the bulk of their CPU time…deserializing JSON1.
That's right. Everybody's favorite human-readable and sorta-human-writable data inte
... Read full article.