Gemma 4 12B: A unified, encoder-free multimodal model

Today, we are introducing Gemma 4 12B, our latest model designed to bring agentic multimodal intelligence directly to laptops. Bridging the gap between our edge-friendly E4B and our more advanced 26B Mixture of Experts (MoE), Gemma 4 12B packages powerful capabilities inside a reduced memory footprint. It is also our first mid-sized model to feature native audio inputs.

Thanks to the developer community, Gemma 4 models have now crossed 150 million downloads. You’ve built everything from wearable robotic arms for physical assistance to enterprise-grade AI security. We're excited to see what you build with this latest addition.

Here’s an overview of what makes Gemma 4 12B unique:

Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone.

No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone. Advanced reasoning: Benchmark performance nearing our 26B model, unlocking powerful multi-step reasoning and agentic workflows.

Benchmark performance nearing our 26B model, unlocking powerful multi-step reasoning and agentic workflows. Laptop ready: Small enough to run locally with just 16GB of VRAM or unified memory.

Small enough to run locally with just 16GB of VRAM or unified memory. Open and accessible: Released under an Apache 2.0 license with support across the developer ecosystem.

Released under an Apache 2.0 license with support across the developer ecosystem. Drafter-ready: Gemma 4 12B comes equipped with Multi-Token Prediction (MTP) drafters to reduce latency.

Together, these features bring advanced multimodal capabilities to everyday hardware without sacrificing speed or reasoning. Let's now take a closer look at how Gemma 4 12B achieves this.

Run state-of-the-art agents locally

... continue reading