Sarvam 105B, the first competitive Indian open source LLM

We're releasing Sarvam 30B and Sarvam 105B as open-source models. Both are reasoning models trained from scratch on large-scale, high-quality datasets curated in-house across every stage of training: pre-training, supervised fine-tuning, and reinforcement learning. Training was conducted entirely in India on compute provided under the IndiaAI mission.

These models represent a true full-stack effort. Beyond datasets, we optimized tokenization, model architecture, execution kernels, scheduling, and inference systems to make deployment efficient across a wide range of hardware, from flagship GPUs to personal devices like laptops. Both models are already in production. Sarvam 30B powers Samvaad, our conversational agent platform. Sarvam 105B powers Indus, our AI assistant built for complex reasoning and agentic workflows.

The Sarvam models are globally competitive for their class. Sarvam 105B performs well on reasoning, programming, and agentic tasks across a wide range of benchmarks. Sarvam 30B is optimized for real-time deployment, with strong performance on real-world conversational use cases. Both models achieve state-of-the-art results on Indian language benchmarks, outperforming models significantly larger in size.

This release marks an important milestone for Sarvam. Building these models required developing end-to-end capability across data, training, inference, and product deployment. With that foundation in place, we are ready to scale to significantly larger and more capable models, including models specialised for coding, agentic, and multimodal conversational tasks.

You can experience Sarvam 105B is available on Indus. Both models are accessible via our API at the API dashboard. Weights can be downloaded from AI Kosh (30B, 105B) and Hugging Face (30B, 105B). If you want to run inference locally with Transformers, vLLM, and SGLang, please refer the Hugging Face models page for sample implementations.

Architecture

Both models share a common architectural principle: high-capacity reasoning with efficient training and deployment. At the core is a Mixture-of-Experts (MoE) Transformer backbone that uses sparse expert routing to scale parameter count without increasing the compute required per token, while keeping inference costs practical. The architecture supports long-context inputs through rotary positional embeddings, RMSNorm-based stabilization, and attention designs optimized for efficient KV-cache usage during inference.

While the two models share the same design philosophy , they differ in scale and attention mechanism. Sarvam 30B uses Grouped Query Attention (GQA) to reduce KV-cache memory while maintaining strong performance. Sarvam 105B extends the architecture with greater depth and Multi-head Latent Attention (MLA), a compressed attention formulation that further reduces memory requirements for long-context inference.

Both models use sparse expert feedforward layers with 128 experts, but differ in expert capacity and routing configuration. This allows the larger model to scale to higher total parameters while keeping active compute bounded.

... continue reading