The quest to build increasingly powerful artificial-intelligence systems demands a clear definition of what counts as intelligence, and how it should be measured. AI systems are typically assessed using tests called benchmarks. These are often sets of question–answer pairs in which each question has a definitive, verifiable answer that enables the AI tool to be scored automatically. Benchmarks have been used to assess how quickly frontier AI models (such as those behind OpenAI’s ChatGPT and Google’s Gemini systems) are improving in capacities ranging from general common sense1 and domain-specific knowledge2 to code generation3 and mathematical problem-solving4. However, over time, many benchmarks become less effective at identifying genuine advances — a phenomenon known as benchmark saturation.
Nature 649, 1115-1116 (2026)
doi: https://doi.org/10.1038/d41586-025-04098-x
References Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A. & Choi, Y. in Proc. 57th Annu. Meet. Assoc. Comput. Linguist. 4791–4800 (Association for Computational Linguistics, 2019). Hendrycks, D. et al. in 8th Int. Conf. Learn. Represent. (ICLR, 2020). Jimenez, C. E. et al. Preprint at arXiv https://doi.org/10.48550/arXiv.2310.06770 (2024). Cobbe, K. et al. Preprint at arXiv https://doi.org/10.48550/arXiv.2110.14168 (2021). Center for AI Safety, Scale AI & HLE Contributors Consortium Nature 649, 1139–1146 (2026). Collins, K. M. et al. Nature Hum. Behav. 8, 1851–1863 (2024). Chu, J., Tenenbaum, J. B. & Schulz, L. E. Trends Cogn. Sci. 28, 628–642 (2024). Getzels, J. W. in Frontiers of Creativity Research: Beyond the Basics (ed. Isaksen, S. G.) 88–102 (Bearly, 1987). Download references
Competing Interests The authors know some colleagues who participated in the HLE question generation and review process.
Related Articles
Subjects