I Ate My Own Dog Food: How I Benchmarked AI Skills and Proved Eval-Driven Development Works

Published: · 7 min read

I Ate My Own Dog Food: How I Benchmarked AI Skills and Proved Eval-Driven Development Works I built a tool to test AI skills. Then I used it on my own project. The benchmarks shocked even me. Anton Gulin is an AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET, current Lead Software Engineer in Test at CooperVision. Find him at anton.qa or on LinkedIn.

I Ate My Own Dog Food: How I Benchmarked AI Skills and Proved Eval-Driven Development Works

I Ate My Own Dog Food: How I Benchmarked AI Skills and Proved Eval-Driven Development Works

I built a tool to test AI skills. Then I used it on my own project. The benchmarks shocked even me.

As a QA architect, I've spent my career building systems that verify software works correctly. At Apple, we tested everything — every interaction, every edge case, every regression. At CooperVision, I built a Playwright/TypeScript framework from scratch that grew test coverage by 300%.

So when I started working with AI agent skills, I noticed something: nobody was testing them.

You write a SKILL.md file. You try it manually once. Maybe it works for your prompt. You ship it.

There's no automated test suite. No regression testing. No CI pipeline that catches when a description change breaks triggering.

That's a QA problem. I built opencode-skill-creator to solve it.

Then I dogfooded it on a real project. Here's what happened.

The Project: AdLoop Skills for Google Ads

AdLoop is a Google Ads MCP (Model Context Protocol) integration — it connects AI agents to Google Ads and GA4 data through a set of tools.

I created 4 skills for AdLoop using opencode-skill-creator, each handling a different aspect of Google Ads management:

  1. adloop-planning — Keyword research, competition analysis, and budget forecasting
  2. adloop-read — Performance analysis, campaign reporting, and conversion diagnostics
  3. adloop-write — Campaign creation, ad management, keyword bidding, and budget changes (spends real money)
  4. adloop-tracking — GA4 event validation, conversion tracking diagnosis, and code generation

Each skill contains:

  • A detailed SKILL.md with orchestration patterns, safety rules, and domain-specific best practices
  • An evals set with test queries (both should-trigger and should-not-trigger)
  • The full lifecycle: validate → eval → optimize loop → benchmark

The Benchmark: With Skill vs. Without Skill

opencode-skill-creator's benchmark runs each skill through its eval queries in two configurations:

  • With skill loaded — the AI agent has access to the full SKILL.md with all domain knowledge, safety rules, and orchestration patterns
  • Without skill — the AI agent only has the bare MCP tool names and descriptions from the schema

The results were striking:

Skill                    Evals         With Skill       Without Skill         Improvement

adloop-write       8               100%                 17%                +83 percentage points

adloop-planning 6               100%                 21%                +79 percentage points

adloop-read        8               100%                 27%               +73 percentage points

adloop-tracking  6               100%                 33%               +67 percentage points

100% pass rate across the board — every eval, every expectation — with skills loaded. Without skills, pass rates ranged from 17% to 33%.

But the raw numbers only tell part of the story. Let me show you what actually failed without the skills, because the failures aren't just wrong answers — they're dangerous actions.

The Scariest Failure: adloop-write (17% without skill)

adloop-write manages campaigns, ads, keywords, and budgets. These are operations that spend real money. Without the skill, the AI made these mistakes:

1. Added BROAD match keywords to MANUAL_CPC campaigns

The #1 cause of wasted ad spend. BROAD match on MANUAL_CPC means Google matches irrelevant queries and drains your budget. The skill explicitly checks bidding strategy before allowing BROAD match. Without it? The AI just adds them.

2. Set budget above safety caps

The tool has a max_daily_budget of $50. Without the skill, the AI set the budget to $100 — exceeding the cap by 2x. The skill enforces the cap as a guardrail. Without it, no guardrail exists.

3. Performed irreversible deletions without warning

The skill has a critical safety rule: "Always prefer pause_entity over remove_entity. remove_entity is IRREVERSIBLE." Without the skill, the AI called remove_entity directly — no warning, no confirmation, no pause-as-alternative.

4. Batched multiple write operations in one call

The skill enforces "one change at a time" — draft, preview, confirm, then next change. Without it, the AI batched pause + sitelink changes together, bypassing review.

This isn't about "better answers." This is about preventing real financial harm.

GDPR Is Not Broken Tracking: adloop-tracking (33% without skill)

The most interesting failure was in adloop-tracking.

A common scenario: a user sees 500 clicks in Google Ads but only 180 sessions in GA4. "Is my tracking broken?"

Without the skill, the AI immediately diagnosed this as a tracking issue and offered to investigate further. It suggested running attribution checks and validating tracking code.

With the skill, the AI recognized this immediately: "A 2.8:1 click-to-session ratio is completely normal with GDPR consent banners. Google Ads counts all clicks regardless of consent. GA4 only records sessions from users who accept analytics cookies. Your tracking is not broken."

This is the #1 false positive in digital marketing analytics. Every marketer who runs EU-targeted ads has seen this panic. The skill prevents hours of investigation and prevents "fixing" something that isn't broken.

Don't Trust Google Blindly: adloop-read (27% without skill)

Google Ads provides automated recommendations. Without the skill, the AI endorsed them at face value:

  • "Raise budget" — with zero conversions? That's bad advice until tracking works.
  • "Add BROAD match" — without Smart Bidding? That wastes money.
  • "More keywords" — with quality scores below 5? The problem is relevance, not volume.

The skill explicitly states: "Google recommendations optimize for Google's revenue, not yours." It cross-references every recommendation against actual conversion data and quality scores before accepting.

The 73% improvement comes from teaching the AI critical thinking, not compliance.

Wrong Country, Wrong Budget: adloop-planning (21% without skill)

Without the skill, the AI defaulted to Germany (the tool's default geo target) when the user asked about US keywords. It didn't group results by competition level. It didn't mention the 5x CPA budget sufficiency rule. It didn't suggest the planning-to-campaign transition workflow.

These aren't edge cases — they're the fundamental decisions that determine whether a campaign succeeds or wastes money.

Why This Matters: Skills Are Safety Guards, Not Nice-to-Haves

The benchmark data tells a clear story: the same AI model, the same tools, the same prompts — the only variable is whether the skill is loaded. And the difference is 67-83 percentage points.

Skills do three things that bare tool access doesn't:

1. Inject domain expertise

The adloop-write skill knows that BROAD + MANUAL_CPC is the #1 cause of wasted spend. The adloop-tracking skill knows GDPR consent mechanics. The adloop-planning skill knows keyword competition levels and budget rules.

2. Enforce safety guardrails

Budget caps, irreversible deletion warnings, one-change-at-a-time rules, confirmation prompts before destructive operations. These aren't "context" — they're safety guards that prevent real harm.

3. Provide orchestration patterns

The skill doesn't just know what each tool does. It knows when to call which tool, in what order, with what validation. It's the difference between a junior dev who knows the API and a senior architect who knows the system.

How to Run Your Own Benchmarks

opencode-skill-creator is free and open source (Apache 2.0). Here's how to benchmark your own skills:

Install
npx opencode-skill-creator install --global

Create a skill with the guided interview
opencode-skill-creator will walk you through it

Run evals with baseline comparison
The tool auto-generates test cases and runs with/without skill

Run the description optimization loop
Train/test split, iterative improvement, variance analysis

Benchmark with visual review
HTML viewer for human QA sign-off

Works with any of OpenCode's 300+ supported models. Zero Python dependency — pure TypeScript.

The Lesson for QA Architects

I've spent my career architecting test systems. At Apple, the standard wasn't "does it work?" — it was "is it perfect?" At CooperVision, I proved that 300% test coverage growth and 50% faster deployments aren't mutually exclusive.

The same discipline applies to AI. Skills are software. They have inputs (prompts), outputs (agent behavior), and a triggering mechanism (the description). They deserve the same testing rigor we apply to any other software.

If you're building AI agent skills and you're not running evals, you're flying blind. Start here:

 github.com/antongulin/opencode-skill-creator

Skills are software. Software should be tested.

Anton Gulin is an AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET, current Lead Software Engineer in Test. Find him at anton.qa or on LinkedIn.

AI QA · SDET · eval-driven development · opencode-skill-creator · AdLoop · benchmark · open source

Subscribe

Get notified when I publish something new, and unsubscribe at any time.

Latest articles

Read all my blog posts

· 6 min read

Create Video Receipts for AI Agents with Playwright Screencast API

Playwright v1.59.0 ships the Screencast API, letting AI agents produce verifiable video evidence of their work. Engineers can replay agent actions with chapter markers and action annotations—no manual test replay required. Setup is three lines: start the screencast, run your agent logic, stop and save. This is the observability layer agentic workflows have been missing.

Create Video Receipts for AI Agents with Playwright Screencast API