You're going to build a high performance LLM inference engine with C++ and CUDA - tiny-vllm, a younger and smaller sibling of vLLM
We will learn a lot along the way, make mistakes and derive the ideas and maths from scratch
This repository consists of two things: 1. a full source code of the inference server and 2. a course where I lead you through the process of implementing the engine. Feel invited to use it as a learning tool on your learning path or if you are a lecturer, feel welcome to use it as a teaching resource at your university
The inference engine consists of:
load a real LLM model from Safetensors (Llama 3.2 1B Instruct)
load a real LLM model from Safetensors (Llama 3.2 1B Instruct) full LLM forward pass (prefill + decode)
full LLM forward pass (prefill + decode) all computation with CUDA kernels
all computation with CUDA kernels KV cache
KV cache static batching
static batching continuous batching
... continue reading