The mathematics of compression in database systems

This post is a little departure from our usual architecture discussions, but I promise it is worth understanding if you care about your database performance.

I started thinking about compression when implementing prefix compression for SlateDB. When I ran benchmarks, I noticed that performance seemed "worse" despite improved compression ratios.

This got me thinking deeper about why databases use compression, and what the right framework is for reasoning about whether that tradeoff is worth it. This post is the result of that exploration, and I hope it helps you understand why, when and how data systems use techniques to reduce data size.

understanding “compression math”

The first rule to understand about why compression is important is that there are effectively three resources that databases can be bottlenecked on:

I/O Bandwidth: bytes/sec from disk, network, or even DRAM into your CPU caches CPU: cycles available on your machine for processing Memory: transient storage that is orders of magnitude faster to access than disk/network

Compression gives you a knob to tune between these resources. More specifically, compression directly trades I/O for CPU because it takes CPU cycles to compress and decompress data but moving compressed data requires less I/O bandwidth.

Mathematically, we can reason about this tradeoff by computing the cost of compression and applying it to different resource utilization. Let’s assume for a given workload and compression we have the following properties:

\(\begin{aligned} S &: \text{uncompressed size of your dataset} \\ S_c &: \text{compressed size of your dataset} \\ R = S / S_c &: \text{compression ratio (higher is better)} \\ T_c \text{ and } T_d &: \text{compression / decompression time} \\ B_x &: \text{physical bandwidth of I/O path } x \text{ (bytes/sec)} \end{aligned}\)

is compression about latency?

... continue reading