When we started building a durable workflows library, the most critical architectural decision we faced was what data store to use for workflow metadata. The core durable workflow operations are simple–checkpointing workflow state and recovering an interrupted workflow from its latest checkpoint. Almost any data store can handle these operations, but choosing the right one is critical to ensure workflows are scalable and performant.
In this blog post, we’ll dive deep into why we chose to build on Postgres. While there are good nontechnical reasons for the decision (Postgres is popular and open-source with a vibrant community and over 40 years of battle-testing), we’ll focus on the technical reasons–key Postgres features that make it easier to develop a robust and performant workflows library. In particular, we’ll look at:
How Postgres concurrency control (particularly its support for locking clauses) enable scalable distributed queues. How the relational data model (plus careful use of secondary indexes) enables performant observability tooling over workflow metadata. How Postgres transactions enable exactly-once execution guarantees for steps performing database operations.
Building Scalable Queues
It’s often useful to enqueue durable workflows for later execution. However, using a database table as a queue is tricky because of the risk of contention. To see why that’s a problem, let’s look at how database-backed queues work.
In a database-backed workflow queue, clients enqueue workflows by adding them to a queues table, and workers dequeue and process the oldest enqueued workflows (assuming a FIFO queue). Naively, each worker runs a query like this to find the N oldest enqueued workflows, then dequeues them:
The problem is that if you have many workers concurrently pulling new tasks from the queue, they all try to dequeue the same workflows. However, each workflow can only be dequeued by a single worker, so most workers will fail to find new work and have to try again. If there are enough workers, this contention creates a bottleneck in the system, limiting how rapidly tasks can be dequeued.
Fortunately, Postgres provides a solution: locking clauses. Here's an example of a query using them:
Selecting rows in this way does two things. First, it locks the rows so that other workers cannot also select them. Second, it skips rows that are already locked, selecting not the N oldest enqueued workflows, but the N oldest enqueued workflows that are not already locked by another worker. That way, many workers can concurrently pull new workflows without contention. One worker selects the oldest N workflows and locks them, the second worker selects the next oldest N workflows and locks those, and so on.
Thus, by greatly reducing contention, Postgres enables a durable workflow system to process tens of thousands of workflows per second across thousands of workers.
... continue reading