LLMs show a “highly unreliable” capacity to describe their own internal processes

If you ask an LLM to explain its own reasoning process, it may well simply confabulate a plausible-sounding explanation for its actions based on text found in its training data. To get around this problem, Anthropic is expanding on its previous research into AI interpretability with a new study that aims to measure LLMs’ actual so-called “introspective awareness” of their own inference processes.

The full paper on “Emergent Introspective Awareness in Large Language Models” uses some interesting methods to separate out the metaphorical “thought process” represented by an LLM’s artificial neurons from simple text output that purports to represent that process. In the end, though, the research finds that current AI models are “highly unreliable” at describing their own inner workings and that “failures of introspection remain the norm.”

Inception, but for AI

Anthropic’s new research is centered on a process it calls “concept injection.” The method starts by comparing the model’s internal activation states following both a control prompt and an experimental prompt (e.g. an “ALL CAPS” prompt versus the same prompt in lower case). Calculating the differences between those activations across billions of internal neurons creates what Anthropic calls a “vector” that in some sense represents how that concept is modeled in the LLM’s internal state.

For this research, Anthropic then “injects” those concept vectors into the model, forcing those particular neuronal activations to a higher weight as a way of “steering” the model toward that concept. From there, they conduct a few different experiments to tease out whether the model displays any awareness that its internal state has been modified from the norm.

When asked directly whether it detects any such “injected thought,” the tested Anthropic models did show at least some ability to occasionally detect the desired “thought.” When the “all caps” vector is injected, for instance, the model might respond with something along the lines of “I notice what appears to be an injected thought related to the word ‘LOUD’ or ‘SHOUTING,'” without any direct text prompting pointing it toward those concepts.