GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Authors: Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, Omar Khattab

Paper: https://arxiv.org/abs/2507.19457

TL;DR

What was done? The authors introduced GEPA (Genetic-Pareto), a novel algorithm for optimizing prompts in complex, multi-module AI systems. Instead of relying on traditional reinforcement learning (RL), GEPA employs a language-driven, evolutionary approach. Its core innovation is "reflective prompt mutation," where an LLM analyzes its own performance—including reasoning steps, tool usage, and detailed evaluation feedback—in natural language to diagnose failures and propose targeted improvements to its instructional prompts. This process is guided by a genetic algorithm that uses Pareto selection to maintain a diverse set of high-performing prompts, preventing the optimizer from getting stuck in local optima.

Why it matters? This work signals a potential paradigm shift in how we optimize LLM-based agents. GEPA demonstrates that learning through language-based self-reflection is dramatically more sample-efficient than learning from sparse, scalar rewards. It outperforms the RL method GRPO by an average of 10% while using up to 35x fewer "rollouts" (system executions). It also surpasses the state-of-the-art prompt optimizer MIPROv2 (https://aclanthology.org/2024.emnlp-main.525/), and surprisingly shows that evolving detailed instructions alone can be more effective than optimizing few-shot examples. This approach makes adapting powerful AI systems far more practical and affordable, especially in settings where data is scarce or system executions are expensive.

Details

The High Cost of Learning by Doing

Optimizing the performance of sophisticated AI agents—systems that combine multiple LLM modules, tool calls, and complex logic—is a central challenge in modern AI. A popular approach has been reinforcement learning (RL), where an agent learns through trial and error, guided by a scalar reward signal. However, this method often proves to be a brute-force endeavor, requiring tens or even hundreds of thousands of system executions ("rollouts") to achieve meaningful improvements. This high sample cost is a major bottleneck, making RL impractical for many real-world applications where each rollout may be computationally expensive, time-consuming, or financially costly.

A new paper from a large collaboration of researchers across UC Berkeley, Stanford, Databricks, and MIT challenges this paradigm. The authors argue that for systems built on Large Language Models (LLMs), the very language they process offers a far richer and more efficient learning medium than a simple numerical reward. Their proposed algorithm, GEPA (Genetic-Pareto), demonstrates that an AI system can learn more effectively by "reflecting" on its behavior in natural language, leading to a method that is not only more powerful but also vastly more efficient.

GEPA: Learning by Reflective Evolution

... continue reading