Tech News
← Back to articles

Hardwood: A New Parser for Apache Parquet

read original related products more articles

Hardwood is built with high performance in mind. It applies many of the lessons learned from 1BRC, such as memory-mapping files or multi-threading. I am planning to share more details in a future blog post, so I’m going to focus just on one specific performance-related aspect here: Parallelizing the work of parsing Parquet files, so as to utilize the available CPU resources as much as possible and achieve high throughput.

This task is surprisingly complex due to the subtleties of the format, so Hardwood pulls a few tricks for taking advantage of all the available cores:

Page-level parallelism , fanning out the work of decoding individual data pages to multiple worker threads. This allows for a much higher CPU utilization (and lower memory consumption) than when solely processing different column chunks, row groups, or even files in parallel.

Adaptive page prefetching , ensuring that columns which are slower to decode than others (e.g. depending on their data type) receive more resources, so that all columns of a file can be read at the same pace.

Cross-file prefetching, starting to map and decode the pages of file N+1 when approaching the end of file N of a multi-file dataset, avoiding any slowdown at file transitions.

By employing these techniques and some others, such as minimizing allocations and avoiding auto-boxing of primitive values, Hardwood’s performance has come quite a long way since starting the project at the end of last year. As an example, the values of three out of 20 columns of the NYC taxi ride data set (a subset of 119 files overall, ~9.2 GB total, ~650M rows) can be summed up in ~2.7 sec using the row reader API with indexed access on my MacBook Pro M3 Max with 16 CPU cores. With the column reader API, the same task takes ~1.2 sec.

The taxi ride data set has a completely flat schema, i.e. it doesn’t contain any structs, lists, or maps. Most Parquet-based data sets fall into this category, and thus the focus for optimizing Hardwood has primarily been on these kinds of files so far. While less commonly found, the Parquet format also supports nested schemas. An example for this category are the Parquet files of the Overture Maps project. On the same machine as above, Hardwood can completely parse all the columns of a file with points of interest (~900 MB, ~9M records) in ~2.1 sec using the row reader API and in ~1.3 sec with the column reader API.

In order to identify bottlenecks, Hardwood comes with support for the JDK Flight Recorder, tracking key performance metrics and events such as prefetch misses, page decoding times, etc.