A story about hunting zombie tasks in a distributed environment

Bruin Agent: How I Survived (and Thrived) in the Zombie Apocalypse

Gather up, everyone. The bonfire's warm, but what I’m about to tell you will freeze you anyway, the night I went toe-to-toe with zombies, and somehow came out on top.

Bruin is an open-source data platform that brings together data ingestion, transformation, quality, and governance. We run different types of workloads for our customers, allowing them to extract insights from the data without having to deal with the boring infrastructure problems.

Here at Bruin, we run different types of data workloads. Our customers usually write SQL and Python definitions, which we call an "asset", and we build a pipeline around these assets. When the time comes for executing these workloads, we take care of provisioning the infrastructure, running the workloads, and gathering observability data for our customers.

Traditionally, these workloads have been completely executed within Bruin-owned cloud environments. We provision multi-tenant environments across different cloud providers and regions, and we distribute the workloads accordingly. The infrastructure is owned and maintained by us, and from our clients' perspective, it is a fully serverless experience.

While this has worked great so far, there has always been a need for being able to distribute these workloads to run on our clients' infrastructure. Due to regulatory reasons, special networking and infrastructure requirements, or purely for the peace of mind, our clients might prefer running all of their own workloads in their own infrastructure. In order to be able to satisfy these needs, we have been working on a distributed execution topology where Bruin Cloud has a hosted control plane, and our customers can host the data plane. We call the control plane "Orchestrator", henceforth “oXr”, and the individual runners "Bruin Agent".

Building these individual pieces has been an interesting journey, plagued with challenges such as authentication, error handling, keeping task statuses in sync between scheduling → oXr → Agent, and reporting task results, all in a multi-tenant environment.

At some point, we noticed that certain failed task attempts came with no logs.

Logs are supposed to be produced by Agent and sent back to oXr for collection, but these tasks had none. We dug into the “history” of these tasks. oXr logged that they were picked up, but logs were never sent, and the tasks were never heard from again, leading oXr to mark them as… ZOMBIE TASKS. These are a special type of failures where oXr loses contact with the Agent running the task, no running heartbeats, no logs, nothing.

This was very strange. Looking at the logs, it seemed that while oXr reported the task as picked up, Agent timed out. Since these are different log streams, it was hard to confirm. To verify, we made Agent generate and log a random request ID and pass it using the standard HTTP header X-Request-Id . We also logged this on the oXr side so we could match them and see the complete request lifecycle.

... continue reading