The two versions of Parquet

A few days ago, the creators of DuckDB wrote the article: Query Engines: Gatekeepers of the Parquet File Format, which explained how the engines that process Parquet files as SQL tables are blocking the evolution of the format. This is because those engines are not fully supporting the latest specification, and without this support, the rest of the ecosystem has no incentive to adopt it.

In my experience, this issue is not limited to Query Engines but extends to the tools within the ecosystem. Soon after releasing the first version of Carpet, I discovered that there was a version 2 of the format and that the core Java Parquet library does not activate it by default. Since the specification had been finalized for some time, I decided that the best approach was to make Carpet use version 2 by default.

A week later, I discovered at work the hard way that if you are not up to date with Pandas in Python, you cannot read files written with version 2. I had to rollback the change immediately.

Parquet Version 2

Upon researching the topic, you’ll find that even though the format specification is finalized, it is not fully implemented across the ecosystem. Ideally, the standard would be whatever the specification defines, but in reality, there is no agreement on the minimum set of features an implementation must support to be considered compatible with version 2.

In this Pull Request from the project that describes the file format, there has been an ongoing discussion for four years about what constitutes the core, and there are no signs of a resolution anytime soon. Reading this other thread on the mailing list, I came to the conclusion that although they are part of the specification, two concepts are mixed that could evolve independently:

Given a series of values in a column, how to encode them efficiently. Being able to incorporate new encodings such as RLE_DICTIONARY or DELTA_BYTE_ARRAY , which further improve compression.

or , which further improve compression. Given an encoded column’s data, where to write it within the file along with its metadata such as headers, nulls, or statistics, which helps to maximize the available metadata while minimizing its size and the number of file reads. This is what they call Data Page V2.

Many would likely prefer to prioritize improvements in encoding over page structure. Finding a file that uses an unknown encoding would make a column unreadable, but a change in how pages are structured would make the entire file unreadable.

What I came to understand is that new logical types are not tied to a specific format version. On the one hand, there are the primitive types that are fixed, but on top of them, logical types are defined: a date is a representation of an int64 , a Big Decimal or String is represented with a BYTE_ARRAY . Now the VARIANT type is being defined and I have not seen it associated with either of the two versions.

... continue reading