Towards end-to-end automation of AI research

Our research methodology is centred around two core automated systems: an AI scientist for generating new scientific research and an automated reviewer for rigorous evaluation. These systems work in concert to explore the potential of AI in accelerating scientific discovery.

The AI Scientist

The AI Scientist is an agentic system designed to autonomously conduct machine learning research. We present results for two modes: a template-based system that extends human-provided code and a more open-ended template-free system that operates with much less prior guidance. The detailed prompts used for each system are provided in Supplementary Information sections A.1.1 and A.2.6. More results and analyses of the papers generated by each system are provided in Supplementary Information sections B.1, C.1, C.2, D.1 and D.2.

Foundational technologies

Both versions are built upon autoregressive LLMs3,4,5, which learn to generate text by modelling the conditional probability of a new token given preceding tokens. Through vast data and model scaling, LLMs exhibit human-like abilities, including reasoning and code generation. Agentic patterns49, such as few-shot prompting50 and self-reflection51, are leveraged by The AI Scientist to improve performance and reliability. For code generation, the template-based system uses the state-of-the-art open-source coding assistant Aider52, which is designed to implement features, fix bugs or refactor code in existing codebases. To go further and effectively use more test-time compute, the template-free system uses LLMs to power a tree search without relying on Aider.

Template-based AI Scientist

The system is provided with a starting code template that reproduces a simple training run from a popular algorithm on a standard benchmark (for example, training a small transformer53 on the works of Shakespeare). Its workflow unfolds in three phases:

1. Idea generation: The process begins with a simple experiment defined by a human-provided code template. The system then enters an iterative loop of idea generation and refinement using LLMs as a mutation operator. In each iteration, it proposes a batch of new research ideas that are variations or extensions of existing ideas in its growing archive. Each idea is a structured object containing a descriptive title, a summary of the core hypothesis, a detailed experimental plan, and self-assessed scores for interestingness (1–10 scale), novelty (1–10 scale) and feasibility (1–10 scale). This iterative growth of an idea archive was inspired by open-endedness algorithms that maintain a diverse collection of artefacts20,54. To enforce novelty, each proposed idea is automatically checked against the scientific literature through the Semantic Scholar API31; ideas with high semantic similarity to existing works are discarded. The system is prompted to act as an ‘ambitious AI PhD student who is looking to publish a paper that will contribute significantly to the field’. For the novelty assessment, the system conducts up to ten rounds of literature search queries, and in each round, the system can refine its search based on previous results. 2. Experiment execution: Once a promising idea is selected from the archive, the system devises a multi-step experimental plan with up to five experiments. It then executes this plan sequentially using Aider to modify the codebase. A key feature of this phase is its robustness to runtime errors. The system automatically detects execution failures, captures the error logs and invokes an instance of the Aider agent52 to perform automated debugging. The Aider agent is prompted with the failing code and the error message, and it then generates a patch, with up to four reattempt cycles per experiment. The corrected code is then used to rerun the experiment with a timeout of 7,200 s per experiment. All experimental outcomes, including metrics, generated plots and observations, are logged in an experimental journal. This journal serves as a form of memory and informs the subsequent steps in the experimental plan. 3. Manuscript generation: Upon completing the experimental phase, the system synthesizes the findings into a full scientific paper. To do so, it uses Aider to populate a standard conference LaTeX template. Aider writes sections, including the introduction, methods, results and conclusion. The results section is written by analysing the experimental journal, summarizing key findings and embedding the generated figures. To situate the work within the broader scientific context, the system constructs a related work section by querying the Semantic Scholar API for relevant literature (up to 20 search rounds) and generating summaries for each cited paper. The manuscript undergoes several passes of automated editing and refinement to improve clarity and coherence. Finally, the system compiles the LaTeX source and automatically corrects any compilation errors (up to five correction rounds) to produce a final PDF.

Template-free AI Scientist

To overcome the limitations of a fixed starting codebase, we developed a template-free version capable of more open-ended discovery. We use OpenAI’s o3 for idea generation and code critique during experiments due to its strong reasoning capabilities, Anthropic’s Claude Sonnet 4 for code generation, OpenAI’s GPT-4o for cost-efficient vision-language tasks and OpenAI’s o4-mini for cost-efficient reasoning during the review stage. This version introduces several key enhancements.

... continue reading