Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs

agent-skills-eval A test runner for Agent Skills. Write a SKILL.md , drop in some evals, and find out — empirically — whether your skill actually makes the model better at the task. Documentation · Quickstart · SDK · agentskills.io

Why this exists

Agent Skills — the open standard from Anthropic for giving agents domain knowledge — make it easy to ship a SKILL.md and assume your agent is now better at the task. The hard part is proving it.

agent-skills-eval is the missing piece. It runs your skill against the same prompts twice — once with_skill loaded into context, once without_skill (baseline) — has a judge model grade both outputs, and gives you a side-by-side report. If the skill doesn't make a measurable difference, you'll see it. If it does, you have receipts.

It's the test framework for the Agent Skills ecosystem, separated from any specific agent runtime so it works wherever your skills do.

Quickstart

npx agent-skills-eval ./skills \ --target gpt-4o-mini \ --judge gpt-4o-mini \ --baseline \ --strict

That's it. Point it at a folder of skills, give it a target model and a judge model, and it produces a workspace with full artifacts and a static HTML report.

agent-skills-workspace/ └── iteration-1/ ├── meta.json # run metadata ├── benchmark.json # rolled-up pass/fail per skill ├── eval-basic/ │ ├── with_skill/ # output, timing, judge grading │ └── without_skill/ # ↑ same, with the skill stripped └── report/ └── index.html # the visual report

... continue reading