The first Gemma model launched early last year and has since grown into a thriving Gemmaverse of over 160 million collective downloads. This ecosystem includes our family of over a dozen specialized models for everything from safeguarding to medical applications and, most inspiringly, the countless innovations from the community. From innovators like Roboflow building enterprise computer vision to the Institute of Science Tokyo creating highly-capable Japanese Gemma variants, your work has shown us the path forward. Building on this incredible momentum, we're excited to announce the full release of Gemma 3n. While last month's preview offered a glimpse, today unlocks the full power of this mobile-first architecture. Gemma 3n is designed for the developer community that helped shape Gemma. It’s supported by your favorite tools including Hugging Face Transformers, llama.cpp, Google AI Edge, Ollama, MLX, and many others, enabling you to fine-tune and deploy for your specific on-device applications with ease. This post is the developer deep dive: we'll explore some of the innovations behind Gemma 3n, share new benchmark results, and show you how to start building today. What’s new in Gemma 3n? Gemma 3n represents a major advancement for on-device AI, bringing powerful multimodal capabilities to edge devices with performance previously only seen in last year's cloud-based frontier models. Link to Youtube Video (visible only when JS is disabled) Multimodal by design: Gemma 3n natively supports image, audio, video, and text inputs and text outputs. Optimized for on-device: Engineered with a focus on efficiency, Gemma 3n models are available in two sizes based on effective parameters: E2B and E4B. While their raw parameter count is 5B and 8B respectively, architectural innovations allow them to run with a memory footprint comparable to traditional 2B and 4B models, operating with as little as 2GB (E2B) and 3GB (E4B) of memory. Groundbreaking architecture: At its core, Gemma 3n features novel components like the MatFormer architecture for compute flexibility, Per Layer Embeddings (PLE) for memory efficiency, and new audio and MobileNet-v5 based vision encoders optimized for on-device use cases. Enhanced quality: Gemma 3n delivers quality improvements across multilinguality (supporting 140 languages for text and multimodal understanding of 35 languages), math, coding, and reasoning. The E4B version achieves an LMArena score over 1300, making it the first model under 10 billion parameters to reach this benchmark. Achieving this leap in on-device performance required rethinking the model from the ground up. The foundation is Gemma 3n’s unique mobile-first architecture, and it all starts with MatFormer. MatFormer: One model, many sizes At the core of Gemma 3n is the MatFormer (🪆Matryoshka Transformer) architecture, a novel nested transformer built for elastic inference. Think of it like Matryoshka dolls: a larger model contains smaller, fully functional versions of itself. This approach extends the concept of Matryoshka Representation Learning from just embeddings to all transformer components. During the MatFormer training of the 4B effective parameter (E4B) model, a 2B effective parameter (E2B) sub-model is simultaneously optimized within it, as shown in the figure above. This provides developers two powerful capabilities and use cases today: 1: Pre-extracted models: You can directly download and use either the main E4B model for the highest capabilities, or the standalone E2B sub-model which we have already extracted for you, offering up to 2x faster inference. 2: Custom sizes with Mix-n-Match: For more granular control tailored to specific hardware constraints, you can create a spectrum of custom-sized models between E2B and E4B using a method we call Mix-n-Match. This technique allows you to precisely slice the E4B model's parameters, primarily by adjusting the feed forward network hidden dimension per layer (from 8192 to 16384) and selectively skipping some layers. We are releasing the MatFormer Lab, a tool that shows how to retrieve these optimal models, which were identified by evaluating various settings on benchmarks like MMLU. MMLU scores for the pre-trained Gemma 3n checkpoints at different model sizes (using Mix-n-Match) Looking ahead, the MatFormer architecture also paves the way for elastic execution. While not part of today’s launched implementations, this capability allows a single deployed E4B model to dynamically switch between E4B and E2B inference paths on the fly, enabling real-time optimization of performance and memory usage based on the current task and device load. Per-Layer Embeddings (PLE): Unlocking more memory efficiency Gemma 3n models incorporate Per-Layer Embeddings (PLE). This innovation is tailored for on-device deployment as it dramatically improves model quality without increasing the high-speed memory footprint required on your device's accelerator (GPU/TPU). While the Gemma 3n E2B and E4B models have a total parameter count of 5B and 8B respectively, PLE allows a significant portion of these parameters (the embeddings associated with each layer) to be loaded and computed efficiently on the CPU. This means only the core transformer weights (approximately 2B for E2B and 4B for E4B) need to sit in the typically more constrained accelerator memory (VRAM). With Per-Layer Embeddings, you can use Gemma 3n E2B while only having ~2B parameters loaded in your accelerator. KV Cache sharing: Faster long-context processing Processing long inputs, such as the sequences derived from audio and video streams, is essential for many advanced on-device multimodal applications. Gemma 3n introduces KV Cache Sharing, a feature designed to significantly accelerate time-to-first-token for streaming response applications. KV Cache Sharing optimizes how the model handles the initial input processing stage (often called the "prefill" phase). The keys and values of the middle layer from local and global attention are directly shared with all the top layers, delivering a notable 2x improvement on prefill performance compared to Gemma 3 4B. This means the model can ingest and understand lengthy prompt sequences much faster than before. Audio understanding: Introducing speech to text and translation Gemma 3n uses an advanced audio encoder based on the Universal Speech Model (USM). The encoder generates a token for every 160ms of audio (about 6 tokens per second), which are then integrated as input to the language model, providing a granular representation of the sound context. This integrated audio capability unlocks key features for on-device development, including: Automatic Speech Recognition (ASR): Enable high-quality speech-to-text transcription directly on the device. Automatic Speech Translation (AST): Translate spoken language into text in another language. We've observed particularly strong AST results for translation between English and Spanish, French, Italian, and Portuguese, offering great potential for developers targeting applications in these languages. For tasks like speech translation, leveraging Chain-of-Thought prompting can significantly enhance results. Here’s an example: user Transcribe the following speech segment in Spanish, then translate it into English: model Plain text Copied