End-to-end (E2E) tests sit at the top of the test pyramid because they're slow, fragile, and expensive. But they're also the only tests that completely verify complete user workflows actually work across systems.
Due to time constraints, most teams run E2E nightly to avoid CI bottlenecks. However, this means bugs can slip through to production and be harder to fix because there are so many changes to isolate the root cause.
But what if we could run only the relevant E2E tests for specific code changes of a PR?
Instead of waiting hours for the entire suite, we could get the results in under 10 minutes, catch bugs before they ship, and keep our master branch always clean.
Thinking with globs
The first logical step toward running only relevant tests would be using glob patterns. We tell the system to test what changed by matching file paths.
Here's how a typical ./labeler.yml could work:
user-authentication-test: # Trigger this test # If any of these file changes - "src/auth/**/*" - "src/components/LoginForm.tsx" - "api/auth/**/*" checkout-flow-test: - "src/pages/checkout/**/*" - "src/components/PaymentForm.tsx" - "api/payments/**/*" admin-dashboard-test: - "src/admin/**/*" - "api/admin/**/*"
But globs are very limited. They require constant maintenance as the codebase evolves. Every new feature would require updating the glob patterns file.
More importantly, they cast too wide a net. A change to components/Button.tsx might need to trigger every E2E test that involves any page with a button interaction, depending on how deep the change is.
Enter Claude Code SDK
So, how can we determine which E2E tests should run for a given PR with both coverage and precision?
We need coverage because missing a critical test could let bugs slip through to production.
But we also need precision because running tests that will obviously pass just wastes time and resources.
The naive approach might be to dump the entire repository and changes into an LLM and ask it to figure out which tests are relevant. But this completely falls apart in practice. Repositories can easily contain millions of tokens worth of code, which makes it impossible for all AI models.
Claude Code takes a fundamentally different approach because of one key differentiator: tool calls. Instead of trying to process your entire codebase, Claude Code strategically examines specific files, searches for patterns, traces dependencies, and incrementally builds up an understanding of your changes.
So here's the hypothesis: If I see a PR, I will know which E2E tests it should run because I know the codebase. The question is: Can Claude Code replicate my human intuition by searching for it?
Let's build and find out.
Building the gatekeeper
For the E2E selection to be successful, Claude needs to know what I know: the PR modifications, the E2E tests, and the codebase structure. We need to glue all three together in a well-crafted prompt.
PR modifications
This is perhaps the easiest piece - we can leverage git to get exactly what we need. We start with the basic command:
git diff main...HEAD
This gives us the changes of a branch, but we can do much better. First, we want git to be less verbose, so we add --minimal --ignore-all-space to focus on the actual code changes rather than whitespace noise.
We also don't care about deleted files since we'll need to remove references in existing files anyway (unless we don't care about those tests), so we add --diff-filter=ACMR to exclude deleted files and focus on (A)dded, (C)opied, (M)odified, and (R)enamed files.
Finally, we need some strategic excludes because there are generally large files in PRs like package.lock that would blow up our token count. We add ':(exclude)package.lock' to keep things manageable.
Putting it all together:
git diff main...HEAD --minimal --ignore-all-space --diff-filter=ACMR -- . ':(exclude)package.lock'
The result is a clean diff showing the actual code modifications:
diff --git a/src/components/LoginForm.tsx b/src/components/LoginForm.tsx index 1234567..abcdefg 100644 --- a/src/components/LoginForm.tsx +++ b/src/components/LoginForm.tsx @@ -15,7 +15,7 @@ export const LoginForm = () => { const handleSubmit = async (values: LoginValues) => { - await authService.login(values.email, values.password); + await authService.loginWithValidation(values.email, values.password); };
E2E test inventory
We could hardcode a list of test files in our prompt, but that violates the single source of truth principle. We already maintain this list for our daily benchmarks, so let's reuse it. For example, if the test configuration lives in a WebdriverIO ( wdio.conf.ts ) config file, we can extract it programmatically:
bun -e "import('./wdio.conf.ts').then(m => console.log(JSON.stringify(m.config.suites.liveTests, null, 2)))"
This script dynamically reads the wdio.conf.ts file and outputs our exact test suite configuration:
[ "./test/specs/Admin/Login.spec.ts", "./test/specs/Admin/Vendors.spec.ts", ...rest of files ]
Crafting the prompt
The prompt needs to be precise about what we want. We start by setting clear expectations:
Read the "Branch git diff" below and match it against the active "Available E2E files" below, think deep, and inspect relevant files further to decide which E2E test should run.
The key phrase here is "think deep". This tells Claude Code not to be lazy with its analysis (while spending more thinking tokens). Without it, the output was very inconsistent. I used to joke that without it, Claude runs in “engineering manager mode” by delegating the work.
Next, we set boundaries:
You should only run tests listed in "Available E2E files". If there's a reasonable chance a change could affect a test based on the actual code modifications, include it. When in doubt, **include the test**.
The "only run tests listed" constraint was added because Claude was being "too smart," finding work-in-progress spec files and scheduling them to run. We added the last piece because it is better to run more specs than leave a test out.
Getting structured output
I initially asked for JSON output, and since I didn't want Claude's judgment to be a black box, I requested two keys: the list of tests to run and an explanation. This makes it easy to benchmark whether the reasoning is sound.
I initially tried using JSON mode and asking Claude to output only JSON:
2. Output a JSON containing 2 keys, JSON only - explanation: The explanation why you chose those E2E tests to run, be brief and concise, and use markdown format - tests: An array of spec file paths that should be run, like this, with no other explanation: ["./test/specs/Admin/Login.spec.ts", "./test/specs/Guest/Guest.spec.ts"] REALLY JUST OUPUT JSON ONLY, NO YAPPIN, PLEASE DUDE JUST THE JSON
But Claude has strong internal system instructions and couldn't stop adding commentary. I initially fixed this with a regex JSON parser to remove the commentaries, but when you use regex to solve a problem, you get two problems.
But then I realized: Claude Code is used to write files, duh
So instead of fighting with JSON mode and regex, I asked:
Create a file called test-recommendations.json in the @/gatekeeper directory
Works every time!
Stitching it all together
The final pipeline combines everything with what might be the ugliest bash command known to humankind:
"e2e:gatekeeper": cat ./gatekeeper/analysis-prompt.md \ <(echo '
Available E2E files:
') \ <(bun list-e2e-tests) \ <(echo '
Branch git diff:
') \ <(bun git-diff)
The result command is piped to Claude:
bun e2e:gatekeeper | claude -p --allowedTools "Edit Write"
We add --allowedTools "Edit Write", So it can write our test-recommendations.json file.
By the way, you should never use --skip-dangerous-permissions which gives all permissions, including Fetch . I am surprised by how many people are taught to do this. If we did add this flag, someone could write in the prompt file and instruct Claude to read our environment variables and send them to a URL using Fetch().
Since the CI runs on a PR open, not a merge, this would be similar to a “0-click” exploit.
Results: 44 minutes down to 7 minutes
I won't lie - this exceeded my expectations. We used to run all core tests, which took 44 minutes (and now it would take us more than 2 hours, since we keep adding tests). Most PRs complete E2E testing in less than 7 minutes, even for larger changes.
Even if it performed worse, it would still be an incredible success because our system has so many complexities that other types of tests (unit and integration) are nowhere near as effective as E2E.
The solution scales well because adding E2E test names consumes few tokens, and PR changes are mostly constant. Claude doesn't read all test files: it focuses on the ones with semantic naming and explores modified file patterns, which is surprisingly effective.
Did Claude catch all the edge cases? Yes, and I'm not exaggerating. Claude never missed a relevant E2E test. But it tends to run more tests than needed, which is fine - better safe than sorry.
How much does it cost? Without getting into sensitive details, the solution costs about $30 per contributor per month. Despite the steep price, it actually saves money on mobile device farm runners. And I expect these costs will drop as models become cheaper.
Overall, we're saving money, developer time, and preventing bugs that would make it to production. So it's a win-win-win!