Text classification with Python 3.14's ZSTD module

Python 3.14 introduced the compression.zstd module. It is a standard library implementation of Facebook’s Zstandard (Zstd) compression algorithm. It was developed a decade ago by Yann Collet, who holds a blog devoted to compression algorithms.

I am not a compression expert, but Zstd caught my eye because it supports incremental compression. You can feed it data to compress in chunks, and it will maintain an internal state. It’s particularly well suited for compressing small data. It’s perfect for the classify text via compression trick, which I described in a previous blog post 5 years ago.

My previous blog post was based on a suggestion from Artificial Intelligence: A Modern Approach, and is rooted in the idea that compression length approximates Kolmogorov complexity. There’s a 2023 paper called “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors that revisits this approach with encouraging results.

The problem with this approach is practical: popular compression algorithms like gzip and LZW don’t support incremental compression. They might algorithmically speaking, but in reality they don’t expose an incremental API. So you have to recompress the training data for each test document, which is very expensive. But Zstd does, which changes everything. The fact Python 3.14 added Zstd to its standard library got me excited.

Before delving into the machine learning part, I’ll provide a snippet to build some intuition. The main class we’re interested in is ZstdCompressor . It has a compress method that takes a chunk of data and returns the compressed output. The data it compresses is then added to its internal state. You can also provide a ZstdDict to the compressor, which is a pre-trained dictionary that gives it a head start.

>>> from compression.zstd import ZstdCompressor , ZstdDict >>> tacos = b "taco burrito tortilla salsa guacamole cilantro lime " * 50 >>> zd_tacos = ZstdDict ( tacos , is_raw = True ) >>> comp_tacos = ZstdCompressor ( zstd_dict = zd_tacos ) >>> padel = b "racket court serve volley smash lob match game set " * 50 >>> zd_padel = ZstdDict ( padel , is_raw = True ) >>> comp_padel = ZstdCompressor ( zstd_dict = zd_padel ) >>> input_text = b "I ordered three tacos with extra guacamole" >>> len ( comp_tacos . compress ( input_text , mode = ZstdCompressor . FLUSH_FRAME )) 43 >>> len ( comp_padel . compress ( input_text , mode = ZstdCompressor . FLUSH_FRAME )) 51

The input text can be classified as “tacos” rather than “padel” because the compressor with the “tacos” dictionary produces a smaller compressed output. This can be turned into a simple classifier by building a compressor for each class, and then classifying a new document by finding the compressor that produces the smallest compressed output for that document.

Note that the compress method doesn’t only return the compressed output. It also updates the internal state of the compressor. From a machine learning perspective, this means it is corrupting each compressor with data that does not belong to its class. Unfortunately, and there is no public or private method to compress without updating the internal state.

The trick is to rebuild the compressor every time a new labelled document is received. Thankfully, instantiating a ZstdCompressor with a ZstdDict is very fast – tens of microseconds in my experiments. This makes it affordable to rebuild the compressor very frequently.

Here are the steps to take to turn this into a learning algorithm:

... continue reading