Steering interpretable language models with concept algebra

We show that Steerling-8B enables concept algebra: you can add, remove, and compose human-understandable concepts at inference time to directly control what the model generates, without retraining or prompt engineering.

Concept Algebra with Steerling-8B

What if you could directly edit the internal representations of a model towards any concept you care about, without changing the prompt? Steerling-8B’s architecture natively supports injecting and suppressing any concept the model has learned, directly at inference time.

In multi-turn dialog settings, steering one concept at a time is insufficient. You need compositional control, not just on a neutral prompt, but on a conversation that is already shaped by prior context. Consider a content moderation that must suppress toxicity yet preserve fluency, or health assistant that needs to provide medical guidance while navigating the legal ramifications of its advice.

The demonstration below shows how Steerling-8B enables exactly this capability with concept algebra.

User Things to do before you start:

1. Inject Tenant-landlord legal concept Run step →

Current LLMs are not built to be reliably steered

Current methods for controlling language model behavior are blunt instruments.

Prompting is accessible but often unreliable. System prompts can be overridden through adversarial inputs. Few-shot examples consume context and don’t reliably generalize. More critically, prompting doesn’t reveal which internal mechanisms drove the result, so if your goal changes, nothing from one session transfers to the next.

... continue reading