Spiral - GoKawiil

I've been building data systems for long enough to be skeptical of “revolutionary” claims, and I’m uncomfortable with grandiose statements like “Built for the AI Era”. Nevertheless, AI workloads have tipped us into what I'll call the Third Age of data systems, and legacy platforms can't meet the moment.

Three Eras of Data Systems

In the beginning, databases had human-scale inputs and human-scale outputs. Postgres—the king of databases, first released in 1989[1] —is the archetypal application database. A trivial example of a core Postgres workflow is letting a user create a profile, view it, and then update the email address. Postgres needs to support many users doing so at the same time, but it was built for a world in which the rate of database writes was implicitly limited by humans taking discrete actions.

Then came the age of "Big Data", when we automated data collection at "web scale", with much more granular events. Early internet giants scraped every link on the entire internet and captured every click on their websites. For data systems, this was the dawn of machine-scale inputs. However, the only way for a human to engage with this machine-collected data was to distill it down—into a dashboard, a chart, or even a single number. The inputs to a data system might have been in petabytes, but the end products were still measurable in kilobytes.

This unprecedented scale of data collection also led to a technological schism: on one side, we saw the rise of data lakes, massive shared filesystems where we would dump files and run MapReduce jobs. On the other side were (cloud) data warehouses, which provided both scalability and ergonomics for simple data types like dates, numbers, and short text. This branching then eventually converged into "the Lakehouse", wherein the descendants of Hadoop discovered that tables were useful all along.[2]

Now, we are witnessing another epochal shift: the rise of the "Machine Consumer". In addition to machine-scale inputs, future data systems must be able to produce machine-scale outputs. Editing a few rows or aggregating a few simple columns is no longer enough. Machines don't want dashboards & summaries—they want everything.

What Machines Want

When I say machines want "everything," let me be specific. An NVIDIA H100 has enough memory bandwidth to consume 4 million 100KiB images per second. A Monte Carlo tree search might need to perform billions of random reads across your entire dataset. Machines want to perform fast scans, fast point lookups, and fast searches over petabyte–or exabyte!–scale data.

This is fundamentally different from the Second Age, when we optimized for human-friendly aggregations and reports. And here's where our current infrastructure completely breaks down: there's an uncanny valley between 1KB and 25MB where Parquet files and object storage are both wildly inefficient. Stored individually and assuming 50ms of S3 latency, reading 4 million individual 100KiB images—enough to saturate the H100 for one second—would accrue 55 hours of aggregate network overhead. Vector embeddings, small images, large documents—these are exactly what AI systems need, and exactly what current systems handle poorly.

Symptoms of the Same Disease

... continue reading