We chose OCaml to write Stategraph

Why we chose OCaml to write Stategraph

OCaml Type Systems Functional Programming Infrastructure Stategraph

Josh Pollara • November 6th, 2025

TL;DR why-ocaml.tldr $ cat why-ocaml.tldr • Stategraph manages Terraform state, so correctness isn't optional • Strongly-typed data structures catch field errors at compile time • Type-safe SQL queries prevent schema drift before deployment • Immutability by default eliminates race conditions • PPX generates correct JSON serialization automatically

We're building infrastructure that manages other people's infrastructure. State corruption can't be "rare." It has to be impossible. That's why we chose OCaml.

Stategraph stores Terraform state as a dependency graph in PostgreSQL with resource-level locking. The challenge isn't building a database-backed state store. The challenge is ensuring that concurrent operations can never corrupt state, even with concurrent operations/users, that database schema changes break the build instead of production, and that JSON transformations are correct.

We chose OCaml because its type system catches entire categories of bugs at compile time that would require extensive testing and still slip through in other languages.

Type-safe data structures Here's a scenario every infrastructure engineer has seen. Two Terraform operations run concurrently and both read a resource in an active state. One updates it while the other destroys it. Without proper coordination, you risk marking the resource as destroyed in state while it's still being modified in the cloud. Most systems handle this defensively with locks and runtime validation, but race conditions are hard to test and the resulting state corruption usually appears in production, not CI. Stategraph tackles this in two ways. Immutability and database-level locking prevent concurrent writes from corrupting state, while OCaml's type system makes the underlying data structures themselves safer by construction. Resources, outputs, and instances are all defined as strongly-typed records, so you can't access a field that doesn't exist or mix up field types. The compiler enforces correctness before anything runs. state.ml type t = { lineage : string; outputs : Outputs.t option; resources : Resources.t; serial : int; terraform_version : string; version : int; } If you try to access state.versions (typo) instead of state.version , you get a compiler error. If you try to assign a string to serial , you get a compiler error. If you forget to handle None in the outputs field, you get a compiler error with exhaustiveness checking. This extends throughout the codebase. Every Terraform resource type, every state transition, and every database record is strongly typed. The compiler catches entire categories of bugs at compile time, like accessing non-existent fields, missing null checks, or database schema mismatches.

The database schema drift problem You're iterating on your database schema by renaming a column, changing a type, or adding a constraint. In most languages, you update the schema, deploy the migration, and hope you caught all the queries that reference the old structure. You didn't because a query somewhere references the old column name. It works in dev with the old schema but crashes in staging with the new schema. Stategraph uses typed SQL where every query declares explicit types for its parameters and return values. When you change a query's type signature, every call site in the codebase must be updated to match, and the compiler enforces this. ingestion.ml let insert_resource_sql () = Pgsql_io.Typed_sql.( sql // Ret.bigint /^ "INSERT INTO resources (state_id, mode, type, name, provider_id, module_) VALUES ($state_id, $mode, $type, $name, $provider_id, $module_) RETURNING id" /% Var.uuid "state_id" /% Var.text "mode" /% Var.text "type" /% Var.text "name" /% Var.uuid "provider_id" /% (Var.option @@ Var.text "module_")) This query expects specific types. The state_id must be a UUID, mode must be text, and module_ is optional text. The return value is typed as bigint . If you try to pass a string where a UUID is expected, you get a compiler error. If you forget to handle the optional return value, you get a compiler error. When you update a query to match a new schema, the type system ensures every place that calls that query gets updated too. You can't deploy code where query definitions and their usage are out of sync.

JSON transformations that can't lose data Stategraph ingests Terraform state as JSON, normalizes it into a graph, stores it in PostgreSQL, and reconstructs it back to JSON when Terraform requests it. Every transformation is a place where data can get lost or corrupted, whether from a field you forgot to serialize, a nested structure you flattened incorrectly, or a type that doesn't round-trip. Testing can catch some of this, and round-trip tests help, but you're fundamentally relying on test coverage. Missed cases show up when someone's Terraform state comes back missing a field. OCaml has a feature called PPX (preprocessor extensions) that generates serialization code automatically. You define the type, and the serializer is generated from the type definition. resource_types.ml type aws_instance = { instance_id : string; instance_type : string; ami : string; availability_zone : string option; tags : (string * string) list; } [@@deriving yojson] When you add a field, the serializer is regenerated. When you change a type, the serializer is regenerated. If you forget to handle a case, the exhaustiveness checker catches it at compile time. You don't write serialization tests because the type system guarantees serialization is correct. This is how Stategraph handles Terraform's resource types. Every AWS resource, every GCP resource, every Azure resource is an OCaml type with automatic JSON serialization. We don't write serialization code. We don't test round-trips manually. The type system handles it.

... continue reading