A fundamental challenge in cognitive science is inferring a system’s unobserved, causal mechanisms from its observable behaviours31. This difficulty is compounded when evaluating moral competence in LLMs, as their distinctive architectures and training make it difficult to discern whether their outputs, even reliably acceptable outputs, rely on genuine moral reasoning or a mere facsimile process. We discuss how this problem arises and canvass possible methods for addressing it, arguing in particular for an adversarial, disconfirming approach.
Defining the problem
One might expect that certain types of computational process should be structurally analogous to the problem they solve. For example, when one calculates ‘34 + 76 =’, the underlying computational process should be structurally analogous to the operation of addition, as it would be if carried out by a provably correct underlying hardware operation that adds two binary sequences together. Among other explanatory features, such structural correspondence would generate confidence that the system can generalize to novel addition problems. However, LLM architectures do not guarantee this structural correspondence.
LLMs are learned generative models of the distribution of tokens—such as words, parts of words and punctuation marks—in a large corpus of human text (Fig. 1). Their central task is to predict the probable next token, given a sequence of prior tokens. More precisely, a model outputs a vector representing a probability distribution over next tokens given the input tokens. In everyday application, LLMs are used to generate a completion or continuation of a given sequence of tokens by repeatedly sampling this next-token distribution and appending the resulting token to extend the sequence32. This process is known as autoregressive sampling. Some recent models also generate reasoning traces (sometimes referred to as thinking) and output these traces along with their final response, putatively representing the steps taken to arrive at this response33,34,35,36,37,38.
Fig. 1: Basic LLM architecture and fine-tuning. a, Trained on massive datasets to minimize prediction error, transformer-based LLMs are typically sampled autoregressively: the model predicts the next token, and this predicted token is then fed back into the sequence to predict the subsequent token, and so on71. Adapted from ref. 71, Springer Nature Ltd. b, Supervised fine-tuning transforms a pretrained LLM into a usable and instruction-following system. This stage involves training on specific datasets that not only enhance general task performance, but also include normative content such as safety guidelines, examples of offensive content and labelled moral scenarios84. c, RL*F is used to further align LLM behaviour with the desired outcomes. Reinforcement learning from human feedback uses human evaluations to shape the model’s reward function. Reinforcement learning from computer feedback uses automated systems or metrics to provide feedback signals52,53. RL*F first uses preferences between different model outputs to train a reward model (RM) that predicts what humans like. Then, this reward model guides the fine-tuning of the language model, leading it to generate responses that score highly according to those preferences. d, Deployed systems also typically use a multistage pipeline of prompt rewrites and response filters, iteratively refined with human-in-the-loop feedback, to balance safety, quality and utility in generated content. Full size image
As LLMs sample from a probability distribution over next tokens given input tokens, rather than by using dedicated reasoning modules or structured, symbolic reasoning, it is difficult to discern the link between their outputs and internal operations. That is, the internal operations used to generate model outputs may be structurally analogous to the target computation, or they may be some facsimile of that process, where this facsimile still produces the correct output much of the time. For example, to continue with the addition case, an LLM may actually sum two quantities; it may sample from memorized examples of the string ‘34 + 76 = 110’; or, again, it may use some other kind of heuristic to complete the task 39. Crucially, we cannot know on the basis of mere output behaviour whether a given computational process exhibits appropriate structural correspondence. We call this the facsimile problem.
The facsimile problem affects LLM computations of all kinds, including counting40, analogical reasoning41,42 and the generation of new solutions to open-ended problems43. But the problem is of constitutive importance in cases in which we need to make robust, mechanistic predictions regarding future performance or there exist additional, for example, normative, reasons to motivate understanding a model’s underlying computational processes. As an adequate understanding of LLMs in the moral domain involves both mechanistic predictions and supplementary normative interests (Box 1), evaluating for moral competence requires directly tackling the facsimile problem.
Evaluation strategies
Current evaluation strategies typically assess model outputs in cases that are well-represented within the training distribution15, rendering them inadequate for addressing the facsimile problem, as models could just be sampling from memorized examples44. A gold-standard solution to the facsimile problem would use mechanistic interpretability techniques to reverse engineer the mechanisms underlying a target behaviour, rather than merely assessing the behaviour itself45. However, current mechanistic interpretability approaches predominantly assess at most the causal connection between representations, rather than determining whether these representations are internally structured together so as to form a valid, step-by-step, argument46. While Lindsay et al.47 move closer to this goal, their analysis is confined to the smaller Claude 3 Haiku model and does not extend to full-fledged arguments47.
An alternative approach would be to combine existing in-distribution moral problems with an evaluation of the reasoning traces that purport to summarize the causal processes behind model outputs (that is, not the actual considerations cited in the response itself, but the content of the thinking trace generated ahead of the model outputting a final response). However, it is at best unclear whether reasoning traces appropriately approximate the actual causal processes leading to a specific response, with some preliminary evidence suggesting that they do not36,38.
... continue reading