Lies, Damn Lies and Database Benchmarks

QuestDB is the open-source time-series database for demanding workloads—from trading floors to mission control. It delivers ultra-low latency, high ingestion throughput, and a multi-tier storage engine. Native support for Parquet and SQL keeps your data portable, AI-ready—no vendor lock-in.

Benchmarks, everyone loves benchmarks. People look at a benchmark result and start spreading the word that database X is the top dog, since it is so much faster than database Y.

A decent benchmark might be pictured as a strict Olympic Games-like running competition where the "Citius, Altius, Fortius" principle is precisely implemented. But in reality, when you approach the athletes, you start hearing unexpected noises. What is that? It turns out the competition is more like those weird contests you find on the Internet: the athletes must whistle "Yellow Submarine" accurately while running as fast as they can. The winner is no longer the fastest runner. It is whoever best balances raw speed against a skill that has nothing to do with running, and the quickest sprinter on the track can easily finish last.

That analogy applies to a thing as complex as database benchmarks, especially when quite different categories of databases are being compared. A perfect, completely fair database benchmark is like a unicorn: good luck finding one. Today we will try to illustrate this by toying with a public, well-recognized benchmark.

The benchmark we will use is ClickBench, but do not get us wrong: we are here to question all database benchmarks, not ClickBench specifically. ClickBench is just convenient. It is a solid comparison for analytical databases and already includes a large roster of engines.

ClickBench runs the same workload against every system: a single web-analytics table of around 100 million rows and 105 columns (the famous hits dataset), and 43 analytical queries over it. Each engine ships a small set of shell scripts. The flow is always the same: a script installs the database, loads the data (importing from CSV/TSV, or simply pointing the engine at a downloaded Parquet file if it can read external files), and then runs the 43 queries.

Each query is measured in two flavors:

Cold run. This is the first execution of a query, with all operating system page caches and database caches cleared beforehand. It captures the worst case, when nothing is warm.

This is the first execution of a query, with all operating system page caches and database caches cleared beforehand. It captures the worst case, when nothing is warm. Hot run. Quoting the ClickBench rules, "each of the 43 queries is run three times," and "the smaller of the 2nd and 3rd runtime is used if both runs are successful." The first run is supposed to populate the caches, so the two later runs are expected to be the fastest.

That cold definition hides an asymmetry the public dashboard does not advertise. Clearing the OS page cache and restarting the server is only possible when the database runs on the benchmark machine. A managed cloud service, say Snowflake, BigQuery, Redshift, or Databricks, runs on the provider's hardware, where the harness has no shell, no drop_caches , and no way to bounce the server, so its three runs all hit the same live, never-restarted service. Its cold number is therefore never forced cold the way a self-hosted engine's is, which tilts the cold-run ranking toward hosted systems, and with it the combined score that folds cold runs in. ClickBench's rules require that restart for a true cold run, and a restart is something you can only ask of a server you control. Every engine in this post runs self-hosted on the same box, so they all play by the same rule, but it is worth remembering the next time you compare cold-run numbers across hosted and self-managed systems.

... continue reading