TLDR
LiteLLM is an open-source AI gateway (36K+ GitHub stars) that routes hundreds of millions of LLM API calls daily for companies like NASA, Adobe, Netflix, Stripe, and Nvidia. We're at $7M ARR, 10 people, YC W23.
When LiteLLM goes down, our customers' entire AI stack goes down. We need someone who makes sure that doesn't happen.
You'd be the first dedicated reliability hire. You'll own reliability, performance, and production stability end-to-end. Nobody will tell you how to do it
What this job actually is
We'll be straight with you: this role is roughly 60% operational reliability and 40% deep performance engineering. On any given week you might be:
Hunting a memory leak in our async streaming handler that causes OOMs after 4 hours under load
Fixing a race condition where PodLockManager releases another pod's lock
Profiling why update_database() does 7 deep copies per request in the spend tracking hot path
does 7 deep copies per request in the spend tracking hot path Helping a Fortune 500 customer debug why their 20-pod deployment is exhausting Postgres connections
... continue reading