LLM Inference for Large-Context Offline Workloads
oLLM is a lightweight Python library for large-context LLM inference, built on top of Huggingface Transformers and PyTorch. It enables running models like gpt-oss-20B, qwen3-next-80B or Llama-3.1-8B-Instruct on 100k context using ~$200 consumer GPU with 8GB VRAM. No quantization is used—only fp16/bf16 precision.
Latest updates (0.4.0) 🔥
qwen3-next-80B (160GB model) added with ⚡️1tok/2s throughput (fastest model so far)
(160GB model) added with throughput (fastest model so far) Llama3 custom chunked attention replaced with flash-attention2 for stability
gpt-oss-20B flash-attention-like implementation added to reduce VRAM usage
gpt-oss-20B chunked MLP added to reduce VRAM usage
KVCache is replaced with DiskCache.
8GB Nvidia 3060 Ti Inference memory usage:
Model Weights Context length KV cache Baseline VRAM (no offload) oLLM GPU VRAM oLLM Disk (SSD) qwen3-next-80B 160 GB (bf16) 10k 1.4 GB ~170 GB ~5.4 GB 162 GB gpt-oss-20B 13 GB (packed bf16) 10k 1.4 GB ~40 GB ~7.3GB 15 GB llama3-1B-chat 2 GB (fp16) 100k 12.6 GB ~16 GB ~5 GB 15 GB llama3-3B-chat 7 GB (fp16) 100k 34.1 GB ~42 GB ~5.3 GB 42 GB llama3-8B-chat 16 GB (fp16) 100k 52.4 GB ~71 GB ~6.6 GB 69 GB
... continue reading