Make the Easy Change Hard

I'd say this is a setup for a joke later on in the blog post, except the joke doesn't even make sense, so I don't really know what this is. Generated by ChatGPT. There’s a semi-well-known adage in software development that says when you have a hard code change, you should “first make the hard change easy, and then make the easy change.” In other words, refactor the code (or do whatever else you need to do) to simplify the change you’re trying to make before trying to make the change. This is especially good advice if you have a good automated test suite—you can first do your refactor without changing any program behaviour, and be relatively confident that you haven’t introduced any bugs or regressions in your refactor; and then you can make a smaller change that actually introduces your new feature. Today I’m going to talk about how I followed the opposite of this advice for SimKube: I first made the easy change hard, and then I made the hard change. Because I can never do anything easy. Anyways this is going to be a fairly technical post, with a bit of a deep dive into SimKube internals and also a bunch of Rust programming techniques, so I hope you’ve had your coffee! Who owns the owned objects? First, a quick overview of the problem I needed to solve: in Kubernetes, resources can be owned by other resources. For example, a Deployment owns zero or more ReplicaSets, and each ReplicaSet owns zero or more Pods. These are tracked by setting an OwnerReference on the owned resource; if you wanted to know what Deployment a Pod belongs to, you would follow the chain of OwnerReference s from the Pod to the ReplicaSet to the Deployment. Recall that in SimKube, there is a component called the tracer, which lives in your production Kubernetes cluster and watches for changes to the resources in that cluster. You can then export those changes to a trace file for replay in your simulated environment. In the tracer, you can configure what resources to watch like this: trackedObjects: apps/v1.Deployment: podSpecTemplatePaths: - /spec/template The above config file tells the tracer to watch for changes to Deployments in the cluster. In this case, when you replay a trace in your simulated environment, SimKube would create all of the Deployments specified in the trace file, and then the controller manager in your simulated cluster would create the ReplicaSets and Pods that are owned by those Deployments. This is all fine and good, but what would happen if you had a config file that looked like this? trackedObjects: apps/v1.Deployment: podSpecTemplatePaths: - /spec/template apps/v1.ReplicaSet: podSpecTemplatePaths: - /spec/template Now, the tracer would record changes to all Deployments and ReplicaSets in the cluster; when we replay this trace in the simulated environment, SimKube would create all of the Deployments, as well as all of the ReplicaSets owned by those Deployments. At the same time, the controller manager in the simulation cluster would create a second copy of the ReplicaSets owned by those deployments. Whoops! Now we have two copies of our ReplicaSets. That’s no good. Now normally, there would be no reason to have a config file like this, you could just track the “root objects” and call it a day. But now imagine a more complicated scenario, where you have a custom resource that you want to simulate (say, an Airflow DAG Resource) that creates Deployments; now you’re in a situation where some Deployments in your trace have owners and other Deployments are the root objects. So this is the core problem I was trying to solve; in the tracer, we should be able to tell the difference between these two types of resources using the OwnerReference field; if an object that the tracer is tracking is owned by some other object that the tracer is tracking, we shouldn’t store it in the exported trace. Arcs, Mutexes, and Channels, oh my! To understand how to fix this bug, we need to know a little bit about the internals of the tracer. Here’s how it looks: Because the resources the tracer is watching are configured at runtime, we use the dynamic object API to set watches on the Kubernetes API server for these resources; there are n of these watchers, one for each configured resource type. There is also one watcher that tracks changes to Pod objects, for the purposes of recording Pod lifecycle events (creation, deletion, etc). Each of these watchers communicates updates to the TraceStore , which is the in-memory storage backend for the tracer. Lastly, there is a Rocket server that exposes an /export endpoint to the TraceStore so that users can export a trace to long-term storage. The way this used to work is that each of the watchers (as well as the Rocket server) held a reference to the store through an Arc> . If you’ve done any async Rust programming, you’ll recognize this pattern; it’s a way for objects on multiple threads or coroutines to safely read and write to a shared singleton. Now, in the old architecture, the Pod watcher also had a cache of OwnerReference s. This is so that we only track lifecycle data for pods that are owned by some higher-level resource in the trace. Now we can get to the change I wanted to make: we want to look up owners for all resources, not just Pods. Easy change, right? Just stick the ownership cache in the TraceStore and call it a day! Boom, done. I’m sure if you’ve done much async Rust programming, you see the problem with this plan, but let’s walk through it anyways. Specifically: why do we have a cache for the owner references? Looking up a recursive chain of owners to find the root owner involves repeated calls to the Kubernetes API server which a) could add additional load to the server, and b) will block other threads or coroutines while we hold an exclusive lock on the TraceStore . So instead, we look up the ownership chain once and then cache it, because ownership doesn’t change. This was approximately fine when the Pod watcher was the only thing that needed access to the ownership cache, but now every resource update is going to potentially involve a bunch of API calls if there’s a cache miss, so we’re back to square one with regards to our mutex locks. The solution here, as I’m sure you’ve already seen, is to remove the TraceStore handles from each of the watchers, and instead communicate with the store via channels. Et voila, we’ve just made the easy change hard! The Dude Abides Refactors Yea I dunno where that headline came from either. Anyways let’s see what our new proposed architecture is going to look like: Instead of having each watcher hold an Arc> , now the watchers have an mpsc::Channel::Writer , which they use to send messages to the TraceStore . This decouples the watchers from the store: the watchers can now get updates from the Kubernetes API server as fast as it can send them, without blocking on any “expensive” operations that the TraceStore needs to do. If you’ve done any Go programming, this is the standard pattern for goroutine communication. This is objectively a better design, and is probably the one I should have gone with at the beginning, but making the change now requires digging into some of the oldest code in the SimKube codebase. This is all code that I wrote back when a) I didn’t know as much about Rust as I do now, and b) I didn’t really know what I was doing with SimKube. So it’s gonna be a messy change. The whole giant change is on GitHub (1, 2) if you just wanna look at the code, but here I’ll talk about some of the specific challenges I ran into. Tokio versus the Standard Library Most of the primitives that we need (channels, mutexes, etc) are implemented in the Rust standard library. They are, confusingly, also implemented in tokio, which is the most popular async runtime for Rust, and also the async runtime SimKube uses. Why are there both? Well, locks on the tokio::Mutex can be held across .await points, where as the standard library Mutex cannot. The tokio documentation states that you probably don’t want to use this because it is more expensive and you can “probably” refactor your code to not hold the Mutex across a wait. Unfortunately, we can’t use std::mpsc::Channel because they are not Send , and thus can’t be used in an async or multi-threaded environment. We must use tokio’s versions of these channels, which then implies that we must hold a Mutex lock across the channel’s .await . The tokio documentation doesn’t really explain what the performance impacts of its fancy Mutex are, and suggests that it is “easy” to not use them, but I spent quite a while trying to figure out a suitable way to architect it without using tokio::Mutex and eventually gave up. Would be nice if there was some better documentation about the tradeoffs I’m subjecting myself to here. Mutable recv A consequence of the above decision is that std::mpsc::Channel s do not take a mutable self reference when you call recv , but tokio::mpsc::Channel s do. I don’t know if this is an intentional design difference between the APIs or not (I assume it must be?) but the implication is that you can’t stick the channel receivers inside the TraceStore itself without wrapping them in an Arc> ; aside from being really annoying to use, it didn’t make any sense at all to me to have an Arc> (which is actually a singleton object) that holds Arc> s. I was able to refactor around this by introducing an intermediate “manager” object that facilitates the communication between the watchers and the store. However, this introduces another wrinkle: tokio channels are async, which means ideally we’d spawn them in a separate task and let them go, but tokio tasks must be 'static . But the channels that I’m constructing are dynamically created at runtime, which means they can’t be 'static . Figuring out how to coerce the compiler into accepting all this code was a bunch of time spent that I’ll never get back: the short version is that the main “message coordinator” routine is a static function that takes ownership of the channel receivers, but getting this to work required a bunch of fighting the Rust ownership model. Rust ownership, Kubernetes ownership, it’s all ownership Bringing this back to the quote at the beginning, I split this change across two PRs (1, 2). The first change was attempting to make the hard change easy by implementing the channel communication with the store, and the second change was my attempt to make the easy change. It turned out, though, that “making the easy change” wasn’t actually that easy. I realized extremely late in the process that filtering out resources owned by other tracked resources at ingest time was a bad decision: there are no guarantees on ordering for the messages the TraceStore receives, so it’s possible that it receives a message about a new ReplicaSet before it receives a message about the Deployment that owns the ReplicaSet. Then when it looks up the ReplicaSet’s ownership information, it won’t find anything, and we’ll end up with a duplicate in our simulation. Instead, what we need to do is check the ownership conditions at export time. That is, when a user requests an exported trace, we then filter out all of the stored resources that are owned by something else. But this required me to go in and do another refactor on how we filter out objects in our exported trace. At this point, I had spent multiple weeks in the refactoring mines and was very eager to get out of them, so I’m not 100% certain that this bit of code was the best thing ever, but it all seems to work (for now). Whew! There is a new version of SimKube (2.4.0) on GitHub, crates.io, and quay.io that contains this new code, along with a couple smaller bug-fixes and improvements. Anyways, I hope you enjoyed coming along for the ride on my SimKube refactoring adventure, and maybe hopefully you learned something about Kubernetes and/or Rust along the way! As always, thanks for reading; until next time, ~drmorr

Make the Easy Change Hard

Share this article

Related Articles