Show HN: Steerling-8B, a language model that can explain any token it generates

We are releasing Steerling-8B, the first interpretable model that can trace any token it generates to its input context, concepts a human can understand, and its training data. Trained on 1.35 trillion tokens, the model achieves downstream performance within range of models trained on 2–7× more data. Steerling-8B unlocks several capabilities which include suppressing or amplifying specific concepts at inference time without retraining, training data provenance for any generated chunk, and inference-time alignment via concept control, replacing thousands of safety training examples with explicit concept-level steering.

Overview

For the first time, a language model, at the 8-billion-parameter scale, can explain every token it produces in three key ways. More specifically, for any group of output tokens that Steerling generates, we can trace these tokens to:

[Input context] the prompt tokens, [Concepts] human-understandable topics in the model’s representations, and [Training data] the training data drove the output.

Artifacts

We are releasing the weights of a base model trained on 1.35T tokens as well as companion code to interact and play with the model.

Steerling-8B in Action

Below we show Steerling-8B generating text from a prompt across various categories. You can select an example, then click on any highlighted chunk of the output. The panel below will update to show:

Input Feature attribution: which tokens in the input prompt strongly influenced that chunk. Concept attribution: the ranked list of concepts, both tone (e.g. analytical, clinical) and content (e.g. Genetic alteration methodologies), that the model routed through to produce that chunk. Training data attribution: how the concepts in that chunk distribute across training sources (ArXiv, Wikipedia, FLAN, etc.), showing where in the training data the model’s knowledge originates.

Loading explorer…

... continue reading