We've written about individual compression codecs in Vortex before: FastLanes for bit-packing integers, FSST for strings, and ALP for floating point. But we've never explained how Vortex decides which codec to use for a given column, or how it layers multiple codecs on top of each other.
On TPC-H at scale factor 10, Vortex files are 38% smaller and decompress 10–25x faster than Parquet with ZSTD, without using any general-purpose compression. The difference comes down to how the codecs are selected and composed, a framework inspired by the BtrBlocks paper from TU Munich.
🎓
Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis. 2023. BtrBlocks: Efficient Columnar Compression for Data Lakes.
Proc. ACM Manag. Data 1, 2, Article 118 (June 2023). > https://doi.org/10.1145/3589263
The core idea: don't pick one codec. Try them all, and let the data decide.
How Parquet does it
Parquet has two separate compression layers. First, a lightweight encoding transforms column values into a more compact representation: dictionary encoding, delta encoding, byte-stream-split, and so on. Then a general-purpose compression codec (ZSTD, LZ4, Snappy) is applied on top. The encoding is specified per data page, while the compression codec is fixed per column chunk.
Mainstream Parquet writers use a simple strategy: try dictionary encoding first, and if the dictionary grows too large (typically >1 MB), fall back to PLAIN encoding for the remaining pages. More advanced writers (like Spark's V2 writer) use DELTA_BINARY_PACKED for integers and DELTA_BYTE_ARRAY for strings instead of PLAIN, but the overall decision tree is the same.
This is a form of adaptive compression. The writer does respond to the data by falling back when the dictionary overflows. There's even a small amount of cascading: dictionary-encoded pages use the RLE/bit-packing hybrid to encode the integer codes, which is dictionary → RLE → bit-packing if you squint. But it's a single hard-coded cascade path, not a general mechanism.
... continue reading