Algorithm that gets ‘under the hood’ of AI models could effectively steer their responses

NEWS AND VIEWS

29 April 2026 Algorithm that gets ‘under the hood’ of AI models could effectively steer their responses A method for identifying representations of concepts in neural networks could provide a more-effective way to control and monitor artificial-intelligence systems. By Aaron Mueller ORCID: http://orcid.org/0009-0005-1148-5001 0 Aaron Mueller Aaron Mueller is in the Department of Computer Science, Boston University, Boston, Massachusetts 02215, USA. View author publications PubMed Google Scholar

Is it possible to know whether the response of an artificial-intelligence model is factually correct without having a human check it? Neural networks, on which many AI systems are based, can encode concepts such as truthfulness. Concepts are often represented by neural networks as numeric patterns, but identifying these patterns and using them to steer the behaviour of AI models is a substantial challenge. Writing in Science, Beaglehole et al.1 report an approach to AI steering that outperforms alternative methods on a coding task, and show that this approach can be used to control and monitor AI models from the ‘inside’.

doi: https://doi.org/10.1038/d41586-026-01267-4

References Beaglehole, D., Radhakrishnan, A., Boix-Adserà, E. & Belkin, M. Science 391, 787–792 (2026). Subramani, N., Suresh, N. & Peters, M. E. In Findings of the Association for Computational Linguistics: ACL 2022 (eds Muresan, S., Nakov, P. & Villavicencio, A.) 566–581 (ACM, 2022). Marks, S. & Tegmark, M. In Proc. 1st Conf. Lang. Model. (COLM, 2024). Radhakrishnan, A., Beaglehole, D., Pandit, P. & Belkin, M. Science 383, 1461–1467 (2024). Prasad, A. V. et al. Preprint at arXiv https://doi.org/10.48550/arXiv.2602.10067 (2026). Wu, Z. et al. In Proc. 42nd Intl. Conf. Mach. Learn. 267, 67035–67080 (2025). Mueller, A. et al. Comput. Linguist. 52, 331–378 (2026). Geiger, A. et al. J. Mach. Learn. Res. 26, 83 (2025). Download references

Competing Interests The author declares no competing interests.

Subjects