Skip to content
Tech News
← Back to articles

Google’s latest trick gets Gemma 4 running 3x faster right on your phone

read original get Google Tensor Chip Smartphone → more articles
Why This Matters

Google's new Multi-Token Prediction drafters for Gemma 4 significantly enhance local AI model performance by enabling faster processing—up to three times quicker—while maintaining privacy. This advancement addresses the resource challenges of running powerful AI models on consumer hardware, making AI more accessible and efficient for users. It marks a notable step forward in optimizing edge AI for everyday devices, balancing speed, privacy, and resource management.

Key Takeaways

Google

TL;DR Google has introduced new assistant models, called “drafters,” that could significantly speed up Gemma 4.

Drafters work by predicting sections of prompts to the main model, which can focus on processing them in bigger batches.

This allows the model to use the memory and the compute more efficiently.

Google’s recently launched Gemma 4 edge AI models are especially designed to run locally on consumer-hosted hardware. While favorable from a privacy standpoint, local models can easily hog resources and slow down results, rendering them ineffective. So, Google is now offering a potential solution, which it claims can speed up Gemma 4 models by up to three times.

Google recently released Multi-Token Prediction (MTP) drafters for Gemma 4. These drafters are essentially smaller, assistive models that help the primary model by “predicting” part of the user’s request. These smaller models also work in parallel to the main model to manage the compute more effectively.

Don’t want to miss the best from Android Authority? Set us as a favorite source in Google Discover to never miss our latest exclusive reports, expert analysis, and much more.

to never miss our latest exclusive reports, expert analysis, and much more. You can also set us as a preferred source in Google Search by clicking the button below.

How does MTP improve Gemma 4? The process uses a technique called “Speculative Decoding,” in which the drafter models predict upcoming words in the prompt even before the main Gemma model has read through it. While the drafter moves on to the next sequence of words, the main model verifies the predicted set of words at the same time.

If the model accepts the drafted version, it moves on to verify the next set. If it disagrees, it replaces the incorrect word or chunk.

... continue reading