Skip to content
Tech News
← Back to articles

Running Gemma 4 locally with LM Studio's new headless CLI and Claude Code

read original more articles
Why This Matters

Running models like Google Gemma 4 locally offers significant advantages such as cost savings, enhanced privacy, and reduced latency, making AI more accessible and reliable for users. The mixture-of-experts architecture enables high performance on modest hardware, broadening the potential for local deployment in various applications. This development signals a shift towards more efficient, privacy-conscious AI solutions that can operate independently of cloud services.

Key Takeaways

Why run models locally?

Cloud AI APIs are great until they are not. Rate limits, usage costs, privacy concerns, and network latency all add up. For quick tasks like code review, drafting, or testing prompts, a local model that runs entirely on your hardware has real advantages: zero API costs, no data leaving your machine, and consistent availability.

Google’s Gemma 4 is interesting for local use because of its mixture-of-experts architecture. The 26B parameter model only activates 4B parameters per forward pass, which means it runs well on hardware that could never handle a dense 26B model. On my 14” MacBook Pro M4 Pro with 48 GB of unified memory, it fits comfortably and generates at 51 tokens per second. Though there’s significant slowdowns when used within Claude Code from my experience.

Google Gemma 4 26B-a4b LLM model served via LM Studio API with Claude Code alias command claude-lm

The Gemma 4 model family

Google released Gemma 4 as a family of four models, not just one. The lineup spans a wide range of hardware targets:

Google Gemma 4 LLM models

The “E” models (E2B, E4B) use Per-Layer Embeddings to optimize for on-device deployment and are the only variants that support audio input (speech recognition and translation). The 31B dense model is the most capable, scoring 85.2% on MMLU Pro and 89.2% on AIME 2026.

Why I picked the 26B-A4B. The mixture-of-experts architecture is the key. It has 128 experts plus 1 shared expert, but only activates 8 experts (3.8B parameters) per token. A common rule of thumb estimates MoE dense - equivalent quality as roughly sqrt(total x active parameters), which puts this model around 10B effective. In practice, it delivers inference cost comparable to a 4B dense model with quality that punches well above that weight class. On benchmarks, it scores 82.6% on MMLU Pro and 88.3% on AIME 2026, close to the dense 31B (85.2% and 89.2%) while running dramatically faster.

The chart below tells the story. It plots Elo score against total model size on a log scale for recent open-weight models with thinking enabled. The blue-highlighted region in the upper left is where you want to be: high performance, small footprint.

... continue reading