Tech News
← Back to articles

Ten Years of Deploying to Production

read original related products more articles

Back in 2018, where I worked there was an operations team. “Ops”, we called them. In that decade, this company was behind the curve, but not far from typical. We were just starting to think about AWS. At the tail end of my time there, we were just starting to adopt AWS for some internal-only systems. But from what I’ve heard from friends who worked at more mature companies, it wasn’t uncommon in that era to have an operations team that owned production.

Funny thing: the ops team literally sat in a corner of the office, in their own room. That’s where ops is, in that little room. It sounds like a meme.

The ops team had a nice tool to spin up a VM inside the company’s infrastructure. I appreciated that – my whole team used it all the time. I needed to train recurrent neural networks using GPUs and 20+ gigabytes of RAM. No way that was going to run on my laptop, so this workflow was invaluable to my work.

Here’s the big catch: production deployments happened once every two weeks. Full stop.

If something went wrong, the deployment had to wait another two weeks. Unless you were lucky: if the current ops on the weekly rotation was particularly nice, and not dealing with evening plans, and if you were online to respond to their questions, you could push through and fix that random error that only happens in production.

From time to time, I would wander into the ops corner and chat with people about strange issues my team saw in the production database, in our latest attempt to deploy to production, and so on.

The production deployment challenge

My team was fundamentally a data science team. We were training ML models, building and running data pipelines to collect training data and train models on the latest data. All Python code. That’s all fine.

There was a big problem: the models in production were misbehaving, and customers were noticing:

Your API returned this classifier result. That makes no sense. Why?

... continue reading