Efficient String Compression for Modern Database Systems

Why you should care about strings

If dealing with strings wasn’t important, why bother thinking about compressing them? Whether you like it or not, “strings are everywhere”. They are by far the most prominent data type in the real world. In fact, roughly 50% of data is stored as strings. This is largely because strings are flexible and convenient: you can store almost anything as a string. As a result, they’re often used even when better alternatives exist. For example, have you ever found yourself storing enum-like values in a text column? Or, even worse, using a text column for UUIDs? Guilty as charged, I’ve done it too, and it turns out this is actually very common.

Recently, Snowflake published insights into one of their analytical workloads. They found that string columns where not only the most common data type, but also the most frequently used in filters.

This means two things: First, it is important to store these strings efficiently, i.e., in a way that does not waste resources, including money. Second, they must be stored in a way that allows to answer queries efficiently, because that is what users want: Fast responses to their queries.

Why you want to compress (your strings)

Compression (usually) reduces the size of your data, and therefore the resources your data consumes. That can mean real money if you’re storing data in cloud object stores where you pay per GB stored, or simply less space used on your local storage device. However, in database systems, size reduction isn’t the only reason to compress. As Prof. Thomas Neumann always used to say in one of my foundational database courses (paraphrased): “In database systems, you don’t primarily compress data to reduce size, but to improve query performance.” As compression also reduces the data’s memory footprint, your data might all of a sudden fit into CPU caches where it previously only fit into RAM, cutting the access time by more than 10x. And because data must travel through bandwidth-limited physical channels from disk to RAM to the CPU registers, smaller data also means reading more information in the same amount of time, and thus better bandwidth-utilization.

String Compression in CedarDB

Before diving deeper into FSST, it makes sense to take a brief detour and look at CedarDB’s current compression suite for text columns, as this helps explain some design decisions and better contextualize the impact of FSST. Until the 22nd January 2026, CedarDB supported following compression schemes for strings:

Uncompressed

Single Value

... continue reading