At Qodo, we’ve created a new benchmark dataset of real-world questions derived from large, complex code repositories. We are excited to release the dataset, methodology, and prompts used in its creation to support further research and development.
Motivation
Enterprises often maintain massive codebases that are difficult for any individual developer to navigate and fully understand. Whether onboarding, doing routine development, or using AI-assisted workflows, teams often have questions about their codebase. To effectively address this, we’ve developed specialized retrieval capabilities within our research agents. However, to benchmark and validate these systems effectively, we require a robust set of real-world questions and answers.
Prior Work
Existing benchmarks, such as CodeQA, primarily contain artificially generated code with questions limited to provided code snippets, requiring no retrieval from broader contexts. Another recent work (arXiv:2407.02883) involves real-world scenarios but focuses on retrieval from databases rather than code repositories, which does not adequately represent common real-world use-cases.
To address this gap, we propose a new approach. We introduce a benchmark based on realistic questions derived from pull requests that require retrieval across multiple files in a codebase.
Dataset Generation
To effectively challenge retrieval systems, questions in our benchmark must:
Require deep retrieval, often spanning multiple interconnected files. Reflect realistic questions developers encounter when solving actual issues.
We identified that pull requests (PRs) are good sources for complex code changes with proper context that can be used for question and answer generation. PRs naturally link related code, not always through explicit imports or function calls, but through functional changes made together. We leveraged this insight to generate context:
For each code change within a PR, we retrieved its containing method, class or file from the current default branch.
We bundled these retrieved code snippets along with the PR’s title and description to form a meaningful context..
Using the above mentioned PR data as context, we prompt large language models (LLMs) to generate questions relevant to real developer workflows, ensuring authenticity and practical value (see prompt in Appendix). The same context is also used to generate the ground-true answer.
Example
Consider PR 39363 of Hugging Face’s Transformers repository. This PR touches 4 files in different places, including methods such as:
BaseImageProcessorFast.__init__ (src/transformers/image_processing_utils_fast.py)
BaseVideoProcessor.__init__ (src/transformers/video_processing_utils.py)
We are not concerned with the exact code changes in the PR. Instead, we use the PR as a signal to locate blocks of code that are functionally related and should be considered together.
We try to find these methods in the updated code to avoid asking questions about irrelevant code. We don’t always find them, due to renames and refactors, but in this PR we found both BaseImageProcessorFast and BaseVideoProcessor.
In the next step, we pass the complete methods, along with the PR title, description, and the prompt in Appendix A, to an LLM, asking it to generate a question.
For this PR, the generated question is:
How do the fast image and video processor base classes prevent shared mutable state when instantiating multiple instances?
And the answer is:
Both the fast image and video processor bases deep-copy their mutable defaults when you instantiate them. In BaseImageProcessorFast.__init__ (src/transformers/image_processing_utils_fast.py) and BaseVideoProcessor.__init__ (src/transformers/video_processing_utils.py) they loop over all valid kwargs and do:
If you passed a value, use it;
Otherwise, setattr(self, key, deepcopy(getattr(self, key))), and they rebuild size/crop_size via get_size_dict.
By deep-copying every default dict/list on init, no two instances share the same mutable object.
Dataset Statistics
We generated 1,144 questions derived from eight open-source repositories. Below are detailed insights into the characteristics of the dataset:
Context Distribution
The histograms illustrate the distribution of context used for each question:
Number of Context Blocks: Indicates how many individual code blocks were involved in generating each question.
Indicates how many individual code blocks were involved in generating each question. Number of Context Files: Reflects the number of distinct files utilized per question.
In the example above, there are two blocks across two files. However, PRs often touch multiple methods within the same file, resulting in more blocks than files.
Categorical Breakdown
Scope:
Deep: Questions focusing on specific, detailed aspects of a single block of code.
Questions focusing on specific, detailed aspects of a single block of code. Broad: Questions involving interactions or relationships across multiple code blocks or files.
Core Questions: Questions targeting fundamental, core functionality versus those focusing on peripheral technical details.
Searchable Questions: Questions containing specific keywords or identifiers that facilitate direct searches within the codebase.
Evaluation Mechanism: LLM as a Judge
Evaluating model predictions requires an objective and scalable approach.Rather than relying solely on subjective LLM judgment, we:
Extracted discrete, verifiable facts from each ground-truth (GT) answer.
Checked whether each fact appeared in the predicted answer using a simple LLM call.
This method, that we call “fact recall,” was introduced in the 2003 TREC (Text REtrieval Conference) QA Track (paper, overview) and is widely used today – for example in Google/DeepMind’s SAFE and in the TREC 2024 RAG Track (e.g., MSR/Waterloo’s AutoNuggetizer). It ensures robust, objective, and scalable assessment of model performance.
Baselines
To better understand our dataset, we established several baseline evaluations:
Ground Truth (GT) answers: Verifies both the accuracy of fact extraction and the reliability of the automated fact verification method
Verifies both the accuracy of fact extraction and the reliability of the automated fact verification method LLM with full context: Provides an LLM with all context used to generate the questions, setting an upper-bound performance baseline.
Provides an LLM with all context used to generate the questions, setting an upper-bound performance baseline. LLM with no context: Evaluates how well an LLM could answer questions using only the repository name, capturing inherent model knowledge and setting a lower-bound baseline.
These baselines help evaluate the quality of the dataset, validate our evaluation methods, and measure the inherent knowledge of different LLMs.
Results
We evaluated Codex CLI, Claude Code, and our Deep Research agent in Qodo Aware.
Overall: Qodo’s deep-research agent achieves the best fact recall (~76%), just ahead of OpenAI’s Codex (~74%), while being about twice as fast. Also, with the high reasoning feature, we reached (~80%) with a tradeoff on runtimes, where we see a 10-second optimization for our agent. Both outperform Claude (~64%) and Gemini (~45%).
Qodo’s deep-research agent achieves the best fact recall (~76%), just ahead of OpenAI’s Codex (~74%), while being about twice as fast. Also, with the high reasoning feature, we reached (~80%) with a tradeoff on runtimes, where we see a 10-second optimization for our agent. Both outperform Claude (~64%) and Gemini (~45%). Searchable: All agents improved with searchable keywords in the question, but our DeepResearch’s gain was smallest thanks to strong semantic search.
All agents improved with searchable keywords in the question, but our DeepResearch’s gain was smallest thanks to strong semantic search. Scope: Codex and Claude preferred deep over broad questions, while DeepResearch performed equally well on both due to wide search capabilities.
Overall results
Results by data segment
scope codex-cli claude-code gemini-cli deep-research (Qodo) broad 0.72 0.6 0.41 0.76 deep 0.76 0.67 0.48 0.77
Searchable codex-cli claude-code gemini-cli deep-research(Qodo) False 0.73 0.59 0.43 0.76 True 0.76 0.68 0.47 0.77
What We’re Releasing
Dataset: 1,144 carefully curated question-answer pairs – deep_code_bench.
1,144 carefully curated question-answer pairs – deep_code_bench. Metadata and context : Each question is linked to the pull request (PR) it was generated from and tagged with category labels (e.g., broad/deep, is searchable).
: Each question is linked to the pull request (PR) it was generated from and tagged with category labels (e.g., broad/deep, is searchable). Prompts: The exact prompts used to guide question and answer generation.
Appendix A – prompt for question generation
System Prompt
You are helping build a high-quality dataset of real-world codebase questions to test our search AI agents. Each question should require the agent to search through the codebase to find the relevant code.
Guidelines
Your task is to generate exactly ONE onboarding question adhering to these guidelines:
The question must be clearly grounded in the provided code context.
Do not include exact file paths, line numbers, or raw code snippets in the question text.
Prefer questions involving relationships across multiple functions, components, or files.
Keep the wording concise, clear, and readable.
Avoid vague reference to code elements like ‘the function’ or ‘that class’.
Don’t make identifier references (function names, class names, variables, etc.) too obvious, so that the search will be as challenging as possible.
Despite the above, the question should still be answerable, and the context should be unambiguous.
The question should be answerable with a short, concise response—ideally, a single short sentence.
Scopes
There are 2 kinds of scopes. When provided with only 1–2 short code blocks, generate a DEEP question: a highly specific question that explores internal logic, error handling, edge cases, or detailed behaviors. When provided with multiple code blocks or a larger context, generate a BROAD question: a higher-level question about architecture, overall flow, interactions between modules, or general system design.
Core questions
Core questions targeting fundamental, core functionality versus non-core questions which are focusing on peripheral technical aspects.
PR details
If a PR title and description are provided, use them only to infer the high-level subject of the question. Think of questions that the developer needs to know in order to address the PR. The question must still be answerable using the code context. If the PR text lacks details, base the question solely on the code.
Examples
Here are examples to illustrate the desired style and scope:
Broad question examples:
What is the general workflow for training and deploying a transformer-based language model?
Can you describe the internal steps involved in performing hyperparameter tuning with a grid search?
What’s the end-to-end flow involved in generating images using diffusion-based models?
Deep question examples:
How are gradient updates managed when training gradient-boosted decision trees on sparse data?
Which parameter directly controls the number of leaves permitted in each decision tree of a gradient boosting algorithm?
How does a functional deep learning API internally handle merging layers with multiple input tensors?
Core question examples:
How are token and positional embeddings combined and fed into the BERT model?
How does the Keras Layer base class manage weight creation and the build/call lifecycle?
What happens in one XGBoost boosting iteration—how are new trees grown and combined?
Output format
Return the question, its type, whether it is a core question, and the relevant NODE IDENTIFIER headers from the context as a JSON object with keys ‘question’, ‘scope’, ‘is_core_question’, and ‘nodes’ (a list of strings). Wrap the JSON in triple backticks.
User message prompt
PR info:
{pr_context}
Code context:
{context}
Based on the PR information and code above, write ONE question and return only the requested JSON.