Tech News
← Back to articles

Stategraph: Terraform state as a distributed systems problem

read original related products more articles

Why We're Building Stategraph: Terraform State as a Distributed Systems Problem

TL;DR why-stategraph.tldr $ cat why-stategraph.tldr • Terraform state shows distributed coordination issues but uses file primitives. • File blob (100% read/lock) vs. change cone (~3%). • Stategraph → graph state, ACID transactions, subgraph isolation.

The Terraform ecosystem has spent a decade working around a fundamental architectural mismatch: we're using filesystem semantics to solve a distributed systems problem. The result is predictable and painful.

When we started building infrastructure automation at scale, we discovered that Terraform's state management exhibits all the classic symptoms of impedance mismatch between data representation and access patterns. Teams implement increasingly elaborate workarounds: state file splitting, wrapper orchestration, external locking mechanisms. These aren't solutions; they're evidence that we're solving the wrong problem.

Stategraph addresses this by treating state for what it actually is: a directed acyclic graph of resources with partial update semantics, not a monolithic document.

The Pathology of File-Based State Terraform state, at its core, is a coordination problem. Multiple actors (engineers, CI systems, drift detection) need to read and modify overlapping subsets of infrastructure state concurrently. This is a well-studied problem in distributed systems, with established solutions around fine-grained locking, multi-version concurrency control, and transaction isolation. Instead, Terraform implements the simplest possible solution: a global mutex on a JSON file. Observation The probability of lock contention in a shared state file increases super-linearly with both team size and resource count. At 100 resources and 5 engineers, you're coordinating 500 potential interaction points through a single mutex. Consider the actual data access patterns in a typical Terraform operation: Current Model tfstate.json (2.3MB) Read: 100%

Lock: 100%

Modify: 0.5% Actual Requirement VPC Subnet RDS ALB ASG SG Read: 3%

Lock: 3%

Modify: 3% This mismatch between granularity of operation and granularity of locking is the root cause of every Terraform scaling problem. It violates the fundamental principle of isolation in concurrent systems: non-overlapping operations should not block each other. The standard response, splitting state files, doesn't solve the problem. It redistributes it. Now you have N coordination problems instead of one, plus the additional complexity of managing cross-state dependencies. You've traded false contention for distributed transaction coordination, which is arguably worse.

... continue reading