SWE-Bench Pro

Code and data for the following works:

👋 Overview

SWE-Bench Pro is a challenging benchmark evaluating LLMs/Agents on long-horizon software engineering tasks. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem.

The dataset is inspired from SWE-Bench: https://github.com/SWE-bench/SWE-bench

To access SWE-bench Pro, copy and run the following code:

from datasets import load_dataset swebench = load_dataset ( 'ScaleAI/SWE-bench_Pro' , split = 'test' )

🚀 Set Up

SWE-bench Pro uses Docker for reproducible evaluations. In addition, the evaluation script requires Modal to scale the evaluation set.

Follow the instructions in the Docker setup guide to install Docker on your machine. If you're setting up on Linux, we recommend seeing the post-installation steps as well.

... continue reading