A benchmark of expert-level academic questions to assess AI capabilities

Related works

LLM benchmarks

Benchmarks are important tools for tracking the rapid advancement of LLM capabilities, including general and scientific knowledge1,10,12,13,14,15 and mathematical reasoning16,17,18,19,20,21, code generation22,23,24,25,26,27,28 and general-purpose human assistance7,29,30,31,32,33,34,35. Owing to their objectivity and ease of automated scoring at scale, evaluations commonly include multiple-choice and short-answer questions31,36,37,38,39, with benchmarks such as MMLU1 also spanning a broad range of academic disciplines and levels of complexity.

Saturation and frontier benchmark design

However, state-of-the-art models now achieve nearly perfect scores on many existing evaluations, obscuring the full extent of current and future frontier AI capabilities40,41,42,43. This has motivated the development of more challenging benchmarks that test for multi-modal capabilities17,22,24,44,45,46,47,48,49,50, strengthen existing benchmarks32,44,45,51,52, filter questions over multiple stages of review9,12,19,42,53,54 and use experts to write tests for advanced academic knowledge9,12,19,54,55,56. HLE combines these approaches: the questions are developed by subject-matter experts and undergo multiple rounds of review, while preserving the broad subject-matter coverage of MMLU. As a result, HLE provides a clear measurement of the gap between current AI capabilities and human expertise on closed-ended academic tasks, complementing other assessments of advanced capabilities in open-ended domains57,58.

Dataset

Submission process

To ensure question difficulty, we automatically check the accuracy of frontier LLMs on each question before submission. Our testing process uses multi-modal LLMs for text-and-image questions (GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet and o1) and adds two non-multi-modal models (o1-mini and o1-preview) for text-only questions. We use different submission criteria by question type: exact-match questions must stump all models, whereas multiple-choice questions must stump all but one model to account for potential lucky guesses. Users are instructed to submit only questions that meet these criteria. We note that due to non-determinism in models and a non-zero floor in multiple-choice questions, further evaluation on the dataset exhibits some low but non-zero accuracy.

Post-release

Late contributions

... continue reading