Much of the developer world is familiar with the AWS outage in us-east-1 that occurred on October 20th due to a race condition bug inside a DNS management service. The backlog of events we needed to process from that outage on the 20th stretched our system to the limits, and so we decided to increase our headroom for event handling throughput. When we attempted that infrastructure upgrade on October 23rd, we ran into yet another race condition bug in Aurora RDS. This is the story of how we figured out it was an AWS bug (later confirmed by AWS) and what we learned.
Background
The Hightouch Events product enables organizations to gather and centralize user behavioral data such as page views, clicks, and purchases. Customers can setup syncs to load events into a cloud data warehouse for analytics or stream them directly to marketing, operational, and analytics tools to support real-time personalization use cases.
Here is the portion of Hightouch’s architecture dedicated to our events system:
Hightouch events system architecture
Our system scales on three levers: Kubernetes clusters that contain event collectors and batch workers, Kafka for event processing, and Postgres as our virtual queue metadata store.
When our pagers went off during the AWS outage on the 20th, we observed:
Services were unable to connect to Kafka brokers managed by AWS MSK.
Services struggled to autoscale because we couldn’t provision new EC2 nodes.
Customer functions for realtime data transformation were unavailable due to AWS STS errors, which caused our retry queues to balloon in size.
... continue reading