A few weeks ago, we wrote that you should “just use Postgres” for durable workflows.
That post generated a lot of discussion, but also a misunderstanding. We didn't just mean you should use a workflow engine that stores state in Postgres. We meant your workflow system can, and often should, live inside the same Postgres database as your application.
At first glance, this doesn’t sound like a good idea. Shouldn’t those concerns be separated? Shouldn’t workflow state live in one database and application data in another?
Maybe not.
In distributed systems, co-location is a superpower. When workflow metadata and application data live in the same Postgres database, they can be updated in the same database transaction. That means partial failures are no longer possible, making it far easier to build workflows that correctly handle all edge cases.
In this post, we'll explain why that's possible, and how transactions can simplify tough problems like idempotency and atomicity.
Idempotency with Transactional Steps
One fundamental challenge in distributed systems is idempotency, especially for operations that modify database state.
Durable workflows achieve fault tolerance by checkpointing the result of each step after it completes. If a workflow is interrupted, it resumes from its last checkpointed step instead of starting from the beginning. However, a workflow may be interrupted after completing a step but before recording its checkpoint. When it recovers, it has no record that the step already ran and will execute it again.
As a result, durable workflows alone do not solve the idempotency problem. Workflow engines typically require steps to be idempotent so they can safely be retried without duplicate side effects. For example, consider a step that credits (add money to) a bank account. This is not an idempotent operation: if a step adds $100 to an account, fails, reruns, and adds $100 again, then a total of $200 is added to the account, which is not correct.
... continue reading