Here is the thing, the expert layers run amazing on CPU ( ~17T/s 25T/s on a 14900K) and you can force that with this new llama-cpp option: --cpu-moe .
You can offload just the attention layers to GPU (requiring about 5 to 8GB of VRAM) for fast prefill.
KV cache for the sequence
Attention weights & activations
Routing tables
LayerNorms and other “non-expert” parameters
No giant MLP weights are resident on the GPU, so memory use stays low.
This yields an amazing snappy system for a 120B model! Even something like a 3060Ti would be amazing! GPU with BF16 support would be best (RTX3000+) because all layers except the MOE layers (which are mxfp4) are BF16.
64GB of system ram would be minimum, and 96GB would be ideal. (linux uses mmap so will keep the 'hot' experts in memory even if the whole model doesn't fit in memory)
prompt eval time = 28044.75 ms / 3440 tokens ( 8.15 ms per token, 122.66 tokens per second) eval time = 5433.28 ms / 98 tokens ( 55.44 ms per token, 18.04 tokens per second)
... continue reading