While I was working on a project with Wayfair, I got the opportunity to work on a system that generated daily business reports aggregated from multiple data sources flowing through event streams across Wayfair. At a high level, Kafka consumers listened to these events, hydrated them with additional data by calling downstream services, and finally persisted the enriched events into a durable datastore—CloudSQL PostgreSQL on GCP.
When everything was healthy, the pipeline worked exactly as expected. Events flowed in, got enriched, and were stored reliably. The real challenge started when things went wrong, which, in distributed systems, is not an exception but a certainty.
There were multiple failure scenarios we had to deal with. Sometimes the APIs we depended on for hydration were down or slow. Sometimes the consumer itself crashed midway through processing. In other cases, events arrived with missing or malformed fields that could not be processed safely. These were all situations outside our direct control, but they still needed to be handled gracefully.
This is where the concept of a Dead Letter Queue came into the picture. Whenever we knew an event could not be processed successfully, instead of dropping it or blocking the entire consumer, we redirected it to a DLQ so it could be inspected and potentially reprocessed later.
Our first instinct was to use Kafka itself as a DLQ. While this is a common pattern, it quickly became clear that it wasn't a great fit for our needs. Kafka is excellent for moving data, but once messages land in a DLQ topic, they are not particularly easy to inspect. Querying by failure reason, retrying a specific subset of events, or even answering simple questions like "what failed yesterday and why?" required extra tooling and custom consumers. For a system that powered business-critical daily reports, this lack of visibility was a serious drawback.
That's when we decided to treat PostgreSQL itself as the Dead Letter Queue.
Instead of publishing failed events to another Kafka topic, we persisted them directly into a DLQ table in PostgreSQL. We were already using CloudSQL as our durable store, so operationally this added very little complexity. Conceptually, it also made failures first-class citizens in the system rather than opaque messages lost in a stream.
Whenever an event failed processing—due to an API failure, consumer crash, schema mismatch, or validation error—we stored the raw event payload along with contextual information about the failure. Each record carried a simple status field. When the event first landed in the DLQ, it was marked as PENDING . Once it was successfully reprocessed, the status was updated to SUCCEEDED . Keeping the state model intentionally minimal made it easy to reason about the lifecycle of a failed event.
DLQ Table Schema and Indexing Strategy
To support inspection, retries, and long-term operability, the DLQ table was designed to be simple, query-friendly, and retry-aware.
... continue reading