GPT-OSS-120B runs on just 8GB VRAM & 64GB+ system RAM
Here is the thing, the expert layers run amazing on CPU ( ~17T/s 25T/s on a 14900K) and you can force that with this new llama-cpp option: --cpu-moe . You can offload just the attention layers to GPU (requiring about 5 to 8GB of VRAM) for fast prefill. KV cache for the sequence Attention weights & activations Routing tables LayerNorms and other “non-expert” parameters No giant MLP weights are resident on the GPU, so memory use stays low. This yields an amazing snappy system for a 120B mod