The IBM Granite 4.1 family of models

AI is increasingly at the heart of enterprise applications and software workflows. But even today’s most powerful AI systems rarely rely on a single model or capability. Instead, these systems tend to combine myriad technologies and abilities, including understanding language, perception and retrieval, as well as forecasting, and rigorous safety mechanisms, such as guardrails for harm detection. All of these can work together in tightly integrated AI workflows.

That’s why today IBM released the Granite 4.1 collection, the latest versions of its family of Granite models, that reflect this reality. The release covers small language models (SLMs), as well as Granite speech, vision, embeddings, and Guardian models. The aim is for developers to easily consume these models in real-world, enterprise grade AI systems. And despite their size, these models pack a punch.

Across the collection, Granite 4.1 features impressive language model performance in tool calling and instruction following; state-of-the-art transcription accuracy performance for the Granite speech models; harm detection capabilities delivered via Granite Guardian; and high leaderboard performance for Granite vision in table and chart extraction.

Language models with impressive instruction following and tool calling capabilities

At the heart of Granite 4.1 is a new generation of dense, decoder‑only language models, offered in 3B, 8B, and 30B parameter base and instruct model sizes. Across weight classes, the models significantly outperform similarly sized Granite 4.0 language models. The team found, for example, that the new Granite 4.1 8B instruct model consistently matches or outperforms the Granite 4.0 32B Mixture‑of‑Experts model, while using a simpler — and therefore more flexible — architecture for fine tuning for downstream tasks.

These models also perform competitively with other open-source, dense, decoder-only models on the market today, including the most recent Gemma and Qwen models, with thinking disabled, in two important metrics for enterprise use: instruction following and tool calling.

While reasoning models have grown in popularity in recent years, their abilities aren’t always the most efficient way to get a result. In enterprise settings, token costs and speed are often as important as performance. That is why turning to less expensive, non-reasoning models with similar benchmark performance for select tasks like instruction following and tool calling makes sense for enterprise users.

The performance breakthrough in the Granite 4.1 language models was driven by IBM’s training philosophy. The team prioritized data quality and staged refinement over just the raw amount of data used. The Granite 4.1 models are trained on approximately 15 trillion tokens across multiple phases, beginning with broad pre-training and progressively annealing toward higher-quality, technical, scientific and mathematical data that’s focused on instruction following. The last few training stages help extend the models’ context length to as much as 512K tokens, which ensures the models can work through long documents they’re presented with — without any performance hit on shorter-context tasks.

After pre-training, the models are refined through carefully curated supervised fine-tuning and a multi‑stage reinforcement learning (RL) pipeline. Each RL phase targets a distinct capability — such as how well the models can adhere to instructions, the quality of their ability to hold a conversation, factual accuracy, or mathematical reasoning. This helps to avoid the trade‑offs often introduced in single‑stage optimization. The result is a model family designed not just to answer questions, but to behave reliably across a wide range of enterprise workloads.

“Granite 4.1 delivers competitive instruction‑following and tool‑calling performance without relying on long chains of thought, offering predictable latency, stable token usage, and lower operational cost,” said Rameswar Panda, a distinguished engineer at IBM Research and the key architect of the Granite language models. “This makes it a strong, production‑ready choice for enterprise workloads, where efficiency and reliability matter most.”

... continue reading