Skip to content
Tech News
← Back to articles

Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs

read original get AI Skill Assessment Kit → more articles
Why This Matters

agent-skills-eval provides a crucial tool for the AI industry by enabling developers to empirically verify the effectiveness of domain-specific skills in language models. This ensures that improvements are measurable and substantiated, fostering more reliable and targeted AI enhancements for consumers. Ultimately, it promotes transparency and confidence in deploying AI skills that genuinely enhance performance.

Key Takeaways

agent-skills-eval A test runner for Agent Skills. Write a SKILL.md , drop in some evals, and find out — empirically — whether your skill actually makes the model better at the task. Documentation · Quickstart · SDK · agentskills.io

Why this exists

Agent Skills — the open standard from Anthropic for giving agents domain knowledge — make it easy to ship a SKILL.md and assume your agent is now better at the task. The hard part is proving it.

agent-skills-eval is the missing piece. It runs your skill against the same prompts twice — once with_skill loaded into context, once without_skill (baseline) — has a judge model grade both outputs, and gives you a side-by-side report. If the skill doesn't make a measurable difference, you'll see it. If it does, you have receipts.

It's the test framework for the Agent Skills ecosystem, separated from any specific agent runtime so it works wherever your skills do.

Quickstart

npx agent-skills-eval ./skills \ --target gpt-4o-mini \ --judge gpt-4o-mini \ --baseline \ --strict

That's it. Point it at a folder of skills, give it a target model and a judge model, and it produces a workspace with full artifacts and a static HTML report.

agent-skills-workspace/ └── iteration-1/ ├── meta.json # run metadata ├── benchmark.json # rolled-up pass/fail per skill ├── eval-basic/ │ ├── with_skill/ # output, timing, judge grading │ └── without_skill/ # ↑ same, with the skill stripped └── report/ └── index.html # the visual report

... continue reading