I'd had my internet fail a few times recently leaving me stranded without a coding agent, and so when I saw the "Gemma 4 now runs 2x faster with MTP" Multi-Token Prediction update for Gemma 4 I decided to have a go at getting it running.
I wanted a local coding agent setup that:
was fast enough to actually use on my Mac
worked through an OpenAI compatible API (so I could use it in other tools)
and preferably could handle screenshots/images when needed, so I can feed it screenshots of what it has made.
And I did! This video is realtime. And shows the agent responding at a perfectly usable speed.
After a bit of testing the final setup I ended up with is:
llama.cpp built with Metal on macOS
Gemma 4 26B-A4B in GGUF format
A Q8 MTP draft model for speculative decoding
... continue reading