Cedana (YC S23) Is Hiring a Systems Engineer

At Cedana, we are solving what many thought was impossible: the seamless, live migration of active CPU+GPU containers across global compute.

We're building the next generation of AI orchestration systems, founded on our pioneering work in checkpoint/restore technology. This isn't just an incremental improvement; it's a fundamental shift that makes distributed computing truly portable, elastic, and resilient across planet scale compute. This is an exceptionally difficult systems problem that requires a rare combination of kernel engineering, distributed systems design, and a relentless pursuit of perfection.

We’re backed by leading investors, including a co-founder of OpenAI, the former Chief Architect of Slack, the founding team of Meta AI, YC, Initialized Capital, and Garry Tan. To achieve our mission, we’re looking for brilliant systems engineers — the kind who are obsessed with understanding how computing works from the silicon up. We’re looking for systems engineers who live deep in the container stack and understand Kubernetes beyond just the surface.

If you thrive on solving deep, complex problems in uncharted territory, we invite you to join us.

What You Will Do

As a core member of our engineering team, you will build and fortify the "magic" that powers our platform. You will operate across the entire compute stack, from the Linux kernel to our managed Kubernetes offering, to deliver a product that is both powerful and exceptionally reliable.

Design and Build New Orchestration Primitives: Architect and implement core components of our system, leveraging our unique insights into checkpointing, virtualization, and container orchestration to create capabilities that don't exist anywhere else. Design and implement novel scheduling and resource management capabilities by integrating our core checkpoint/restore engine directly into the control planes of Kubernetes, SLURM, and other orchestrators.

Architect and implement core components of our system, leveraging our unique insights into checkpointing, virtualization, and container orchestration to create capabilities that don't exist anywhere else. Design and implement novel scheduling and resource management capabilities by integrating our core checkpoint/restore engine directly into the control planes of Kubernetes, SLURM, and other orchestrators. Engineer Unbreakable Reliability: Enhance the stability and performance of our entire system, from kernel-level interactions and hypervisor optimizations to our managed Kubernetes cloud platform. Dive deep into the Linux kernel, container runtimes, and hypervisors to ensure our live migration capability is bulletproof.

Enhance the stability and performance of our entire system, from kernel-level interactions and hypervisor optimizations to our managed Kubernetes cloud platform. Dive deep into the Linux kernel, container runtimes, and hypervisors to ensure our live migration capability is bulletproof. Partner with Customers: Work directly with customers to solve their most complex infrastructure challenges, acting as a trusted technical partner and gathering insights that drive our product roadmap.

Work directly with customers to solve their most complex infrastructure challenges, acting as a trusted technical partner and gathering insights that drive our product roadmap. Develop Sophisticated Tooling: Build and refine our internal observability and alerting infrastructure to proactively identify and resolve issues anywhere in the stack, ensuring our systems meet the highest standards of performance and availability.

... continue reading