Supertoast tables

Alexander Belanger, Matt Kaye Postgres wranglers · Hatchet

The first rule of the “just use Postgres” club — of which we're dedicated members — is simple. For web applications, Postgres should be the starting point for any data storage and retrieval problem.

The reasoning is straightforward: Postgres is a general-purpose database engine, you likely already run your core OLTP workload on Postgres, and you probably don't have time to become an expert in specialized storage systems. This, coupled with improvements in hardware and Postgres performance, means you can start with Postgres for task queues or message queues, caching, vector embeddings, search and file storage.

This has been our primary approach with Hatchet, and so far it's worked out. Every engineer is well-versed in writing Postgres schemas and queries from scratch, and we're aware of the tradeoffs that we make in battling the Postgres query planner and the MVCC model.

But there's a limit to this approach for every growing startup, and we finally reached it: storing large amounts of jsonb data in our Postgres instances. Here's the story of our migration from jsonb columns and toast tables to what we affectionately call supertoast tables.

The foundational data structure in Hatchet is the task queue; it's what durable workflows, DAGs, events, and nearly every other feature is built on. Each task in the queue contains an input, and after completion, an output. These inputs and outputs are arbitrary JSON payloads which enter the system rapidly.

An additional constraint is that Hatchet is designed to be fast; it takes less than an average of 25ms (and in the optimistic case, as fast as 9ms) for a task to be sent to the engine and start running on a worker. This rules out a set of candidate options. Object stores are much too slow, and many hosted databases can be tricky to work with because networked disks have restrictive IOPS. NVMe disks are a great fit, and we already run most of our hosted infrastructure on NVMe-backed Postgres already!

So like pretty much everything else in our system, we persist these payloads to Postgres using the jsonb column type.

The downsides are clear. Even small payloads can take up over 50% of our database storage, and larger payloads can take up well over 90%. But only payloads from very recent tasks are accessed frequently. Payload access follows a power law; payloads from over a day ago are very, very infrequently accessed. This leaves a good chunk of the database storage just sitting idle on our NVMe disks, which is not ideal from a cost-efficiency perspective and also bloats our backups.

What happens if our database starts to fill up rapidly? While NVMe disks give us fantastic IOPS, they're not networked, which means that swapping a disk requires that we provision an entirely new database. To make matters worse, Hatchet is a very high-churn system, which means that our WAL is very large compared to a more traditional SaaS read-heavy workload. New databases can sometimes take many hours to provision, which can get scary as the database nears 100% of its storage capacity.

... continue reading