OpenEvolve: Teaching LLMs to Discover Algorithms Through Evolution

How do we teach machines to discover algorithms? Traditional approaches rely on hand-crafted heuristics, exhaustive search, or gradient-based optimization. But what if we could harness the creative potential of large language models (LLMs) within an evolutionary framework?

OpenEvolve is an open-source evolutionary coding agent that integrates large language models into a quality-diversity search framework for algorithm discovery. Candidate programs are produced via LLM-guided edits (diff-based by default), evaluated with user-defined metrics, and organized using MAP-Elites while an island model with migration supports parallel, diversified exploration. The evaluation pipeline supports cascade staging and an artifact side-channel that feeds execution traces and errors back into subsequent prompts; optional LLM-based feedback can be incorporated into scoring.

OpenEvolve has been applied across many domains—here are a few examples: systems optimization, scientific discovery, geospatial algorithms, scaling law discovery, GPU kernel optimization, prompt optimization, and more.

Architecture Overview

Figure 1: OpenEvolve architecture showing the five interconnected components of the evolution loop

The Evolution Loop

Prompt Sampler: Constructs context-rich prompts by selecting a parent program from the current island and curating evidence sets (top performers by fitness, lineage ancestors, diverse extremes across feature bins, and random samples). Prompts include the parent's code, evaluation metrics, feature coordinates for MAP-Elites, evolution history, and (optionally) execution artifacts. Template selection supports diff-based editing by default or full rewrites, with controlled stochasticity.

Constructs context-rich prompts by selecting a parent program from the current island and curating evidence sets (top performers by fitness, lineage ancestors, diverse extremes across feature bins, and random samples). Prompts include the parent's code, evaluation metrics, feature coordinates for MAP-Elites, evolution history, and (optionally) execution artifacts. Template selection supports diff-based editing by default or full rewrites, with controlled stochasticity. LLM Ensemble: Generates candidate code using a weighted ensemble of OpenAI-compatible models (deterministic under seeds). In standard mode, a model is sampled by weight; in model-based islands, each island uses a fixed model. Responses drive either diff-based edits (SEARCH/REPLACE blocks) or full rewrites (JSON/code-block extraction), with generation parameters drawn from configuration.

Generates candidate code using a weighted ensemble of OpenAI-compatible models (deterministic under seeds). In standard mode, a model is sampled by weight; in model-based islands, each island uses a fixed model. Responses drive either diff-based edits (SEARCH/REPLACE blocks) or full rewrites (JSON/code-block extraction), with generation parameters drawn from configuration. Evaluator: Executes the user-provided evaluate(program_path) with timeouts and retries; optionally applies cascade evaluation ( evaluate_stage1/2/3 ) with thresholds to filter weak candidates early. It can incorporate LLM-based feedback into metrics and captures artifacts (e.g., stderr, tracebacks) for subsequent prompt context. Parallel evaluations are supported via an internal task pool.

... continue reading