· 4 min read
Eval-Driven Development for AI Agent Skills
Why skills need testing, not just writing — and how to do it systematically.
Published: · 6 min read
Technical decisions and lessons learned from rewriting a Python CLI tool as an OpenCode plugin.
Anthropic's skill-creator for Claude Code is excellent. It introduced eval-driven development for AI agent skills — write a skill, test it with evals, optimize the description, benchmark the results. The methodology is proven.
But it has a limitation: it only works with Claude Code, and skill access requires a paid subscription ($20/month minimum). Free tier users can't use it at all.
OpenCode is free and supports 300+ models. I wanted to bring the same methodology to OpenCode users — for free, with no paywall.
The original has this structure:
Anthropic skill-creator/
├── SKILL.md # The skill instructions
├── scripts/
│ ├── run_loop.py # Eval→improve optimization loop
│ ├── improve_description.py # LLM-powered description improvement
│ ├── aggregate_benchmark.py # Benchmark aggregation
│ └── generate_review.py # HTML report generation
└── evals/
└── evals.json # Test query definitions
My version:
opencode-skill-creator/
├── skill-creator/ # The SKILL
│ ├── SKILL.md # Main skill instructions
│ ├── agents/
│ │ ├── grader.md # Assertion evaluation
│ │ ├── analyzer.md # Benchmark analysis
│ │ └── comparator.md # Blind A/B comparison
│ ├── references/
│ │ └── schemas.md # JSON schema definitions
│ └── templates/
│ └── eval-review.html # Eval set review/edit UI
└── plugin/ # The PLUGIN (npm package)
├── package.json # npm package metadata
├── skill-creator.ts # Entry point
└── lib/
├── utils.ts # SKILL.md frontmatter parsing
├── validate.ts # Skill structure validation
├── run-eval.ts # Trigger evaluation
├── improve-description.ts # Description optimization
├── run-loop.ts # Eval→improve loop
├── aggregate.ts # Benchmark aggregation
├── report.ts # HTML report generation
└── review-server.ts # HTTP eval review server
Key difference: the skill provides workflow knowledge, the plugin provides executable tools. The agent orchestrates everything by calling tools during its session.
Original: Python scripts invoked via CLI
# Run the optimization loop
python -m scripts.run_loop --skill-path /path/to/skill --eval-set evals.json
New: Plugin tool calls in OpenCode sessions
skill_optimize_loop with:
evalSetPath: /path/to/evals.json
skillPath: /path/to/skill
maxIterations: 5
Why: OpenCode's plugin architecture lets agents call custom tools directly. No subprocess management, no script execution, no Python environment. The agent calls the tool inline and gets results back in the session.
This is cleaner integration but also more composable. The agent can interleave tool calls with other work — read files, ask the user questions, make decisions — between optimization iterations.
The original requires Python 3.11+ and pyyaml. My version requires nothing beyond Node.js (which OpenCode users already have).
All pipeline components — validation, eval, description improvement, loop runner, aggregation, report generation, review server — are TypeScript modules in the plugin. ~256kB unpacked on npm.
Dependency tree is minimal: the plugin only depends on @opencode-ai/plugin (peer dependency).
Original: Python script generates a static HTML file and opens it in the browser.
generate_review.py --workspace /path/to/workspace
# Opens /path/to/workspace/review.html in browser
New: Plugin starts a local HTTP server that serves an interactive eval viewer.
skill_serve_review with:
workspace: /path/to/workspace
skillName: "my-skill"
The HTTP server approach has advantages:
The server can also generate static HTML for headless environments:
skill_export_static_review with:
workspace: /path/to/workspace
outputPath: /path/to/report.html
Original: Claude Code's built-in subagent concept, where the skill directly spawns sub-agents.
New: OpenCode's Task tool with general and explore subagent types. The SKILL.md instructs the agent to spawn tasks for:
The agent orchestrates these tasks and synthesizes their results.
Original: Evals and benchmarks run alongside the skill in the same directory.
New: Draft skills and eval artifacts go to the system temp directory:
/tmp/opencode-skills/<skill-name>/ # Staged skill
/tmp/opencode-skills/<skill-name>-workspace/ # Eval artifacts
Only the final validated skill gets installed to:
.opencode/skills/<skill-name>/~/.config/opencode/skills/<skill-name>/This keeps the user's repository clean during skill development. Evals create a lot of artifacts (outputs, timing data, grading results, benchmark files) that you don't want mixed into your project.
Added a "review workflow guard" that enforces paired comparison data by default:
skill_serve_review and skill_export_static_review require each eval directory to include both with_skill AND baseline (without_skill or old_skill)allowPartial: true only when intentionally reviewing incomplete dataThis prevents a common mistake: reviewing eval results without a baseline comparison, which makes it impossible to judge whether the skill actually improved anything.
They need testing, not just writing. The eval-driven approach catches issues you'd never find manually — like a description that triggers on 80% of relevant queries but also fires on 30% of irrelevant ones.
The description field is the primary triggering mechanism. A well-optimized description on an average skill outperforms a poor description on a perfect skill. This is counterintuitive but matches the data.
Same lesson as ML hyperparameter tuning. If you only evaluate on the queries you optimize for, descriptions become overfit. The 60/40 split keeps you honest about generalization.
Automation measures triggering accuracy, but humans judge output quality. The visual eval viewer puts outputs side by side so you can see whether the skill produces genuinely useful results, not just correctly-triggered results.
Having eval, benchmarking, and review as separate tool calls (instead of a monolithic script) means the agent can interleave them with other work. It can ask the user a question between iterations, read relevant files during eval, or skip steps the user doesn't need.
npx opencode-skill-creator install --global
Apache 2.0, free, open source. Works with any of OpenCode's supported models.
GitHub: https://github.com/antongulin/opencode-skill-creator npm: https://www.npmjs.com/package/opencode-skill-creator
opencode-skill-creator is free and open source (Apache 2.0). Star it on GitHub. Install: npx opencode-skill-creator install --global
Get notified when I publish something new, and unsubscribe at any time.
· 4 min read
Why skills need testing, not just writing — and how to do it systematically.
· 6 min read
Playwright v1.59.0 ships the Screencast API, letting AI agents produce verifiable video evidence of their work. Engineers can replay agent actions with chapter markers and action annotations—no manual test replay required. Setup is three lines: start the screencast, run your agent logic, stop and save. This is the observability layer agentic workflows have been missing.