AutoKernel
Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels.
Inspired by @karpathy/autoresearch -- which demonstrated autonomous AI agents for LLM training research. AutoKernel applies the same philosophy to GPU kernel optimization: agent modifies one file, runs a fixed evaluation, keeps or reverts, repeats forever.
How It Works
Give AutoKernel any PyTorch model. It will:
Profile the model to find which GPU kernels are bottlenecks Extract each bottleneck as a standalone Triton kernel Optimize each kernel autonomously (edit, benchmark, keep/revert -- forever) Verify end-to-end correctness and report the total speedup
The agent reads program.md -- the "research org code" -- which contains comprehensive instructions for autonomous operation. It edits kernel.py one kernel at a time, runs bench.py (fixed benchmark with 5-stage correctness checks + roofline analysis), and either keeps or reverts the change. The orchestrator decides when to move to the next kernel using Amdahl's law.
Each experiment takes ~90 seconds. That's ~40 experiments/hour, ~320 overnight, across all kernels.
Quick Start
Requirements: NVIDIA GPU (tested on H100/A100/RTX 4090), Python 3.10+, uv.
... continue reading