Bolmo’s architecture unlocks efficient byte‑level LM training without sacrificing quality

Enterprises that want tokenizer-free multilingual models are increasingly turning to byte-level language models to reduce brittleness in noisy or low-resource text. To tap into that niche — and make it practical at scale — the Allen Institute of AI (Ai2) introduced Bolmo, a new family of models that leverage its Olmo 3 models by “bytefiying” them and reusing their backbone and capabilities. The company launched two versions, Bolmo 7B and Bolmo 1B, which are “the first fully open byte-level language model,” according to Ai2. The company said the two models performed competitively with — and in some cases surpassed — other byte-level and character-based models.Byte-level language models operate directly on raw UTF-8 bytes, eliminating the need for a predefined vocabulary or tokenizer. This allows them to handle misspellings, rare languages, and unconventional text more reliably — key requirements for moderation, edge deployments, and multilingual applications.For enterprises deploying AI across multiple languages, noisy user inputs, or constrained environments, tokenizer-free models offer a way to reduce operational complexity. Ai2’s Bolmo is an attempt to make that approach practical at scale — without retraining from scratch.How Bolmo works and how it was built Ai2 said it trained the Bolmo models using its Dolma 3 data mix, which helped train its Olmo flagship models, and some open code datasets and character-level data.The company said its goal “is to provide a reproducible, inspectable blueprint for byteifying strong subword language models in a way the community can adopt and extend.” To meet this goal, Ai2 will release its checkpoints, code, and a full paper to help other organizations build byte-level models on top of its Olmo ecosystem. Since training a byte-level model completely from scratch can get expensive, Ai2 researchers instead chose an existing Olmo 3 7B checkpoint to byteify in two stages. In the first stage, Ai2 froze the Olmo 3 transformer so that they only train certain parts, such as the local encoder and decoder, the boundary predictor, and the language modeling head. This was designed to be “cheap and fast” and requires just 9.8 billion tokens. The next stage unfreezes the model and trains it with additional tokens. Ai2 said the byte-level approach allows Bolmo to avoid the vocabulary bottlenecks that limit traditional subword models.Strong performance among its peersByte-level language models are not as mainstream as small language models or LLMs, but this is a growing field in research. Meta released its BLT architecture research last year, aiming to offer a model that is robust, processes raw data, and doesn’t rely on fixed vocabularies. Other research models in this space include ByT5, Stanford’s MrT5, and Canine. Ai2 evaluated Bolmo using its evaluation suite, covering math, STEM reasoning, question answering, general knowledge, and code. Bolmo 7B showed strong performance, outperforming character-focused benchmarks like CUTE and EXECUTE, and also improving accuracy over the base LLM Olmo 3. Bolmo 7B outperformed models of comparable size in coding, math, multiple-choice QA, and character-level understanding. Why enterprises may choose byte-level modelsEnterprises find value in a hybrid model structure, using a mix of models and model sizes. Ai2 makes the case that organizations should also consider byte-level models not only for robustness and multilingual understanding, but because it “naturally plugs into an existing model ecosystem.”“A key advantage of the dynamic hierarchical setup is that compression becomes a toggleable knob,” the company said. For enterprises already running heterogeneous model stacks, Bolmo suggests that byte-level models may no longer be purely academic. By retrofitting a strong subword model rather than training from scratch, Ai2 is signaling a lower-risk path for organizations that want robustness without abandoning existing infrastructure.