Playwright Just Shipped the Fix For Flaky Tests I Built 3 Years Ago

Three years ago, I built a test framework that fixed itself. Nobody called it AI. "Agent" was still the thing antivirus ran on your laptop.

I split the framework into three parts. I called them Planner, Generator, and Healer. Not because I read a paper. Those were the three jobs I needed. I was out of good names.

Last October, Playwright shipped three Test Agents in version 1.56. Three of them.

They call them Planner, Generator, and Healer.

This month, version 1.59 shipped the rest of the plumbing. It added video recording inside tests (page.screencast). It added browser.bind(), so Claude or Cursor can connect to a running browser. It added async disposables (auto-cleanup for test resources). The agents shipped in October. Their plumbing shipped last week.

So this post is about one thing. The same three-part system that saved my career just shipped as a feature in the tool everyone uses.

Here is what Playwright got right. Here is what is still missing. And here is how to start using it today. Even if you stay on your own framework.

If your tests fail at random, and someone keeps asking you to "just make the flaky tests pass" — read this.

The Problem: the flaky-test cost nobody budgets for

Here is a cost every engineering manager forgets: the flaky-test cost.

One team I worked with had 1,200 end-to-end tests. About 4% failed at random on each run. Sounds small. It was not.

4% random failures × 20 PR runs a day = ~1,000 fake failures a week
Every fake failure starts a re-run, a check, a Slack thread
On a good week, 3 engineers each lost a day to false failures
On a bad week, it took the whole team for two sprints

That is the flaky-test cost. It costs you people, not money. That is why budgets miss it. It shows up as missed deadlines, canceled demos, and tired engineers.

The normal fix is "try harder."

Better locators (how tests find buttons on the page)
Wait on the right event
Don't trust the backend
Park the bad tests in quarantine
Review the quarantine every week

All true. None is enough. You can try harder. Flaky tests keep growing.

So I stopped fixing each test. I started fixing how all tests work together.

The Drama: two weeks that broke me

I won't name the company. I will say this. My tests passed on my laptop. They failed only on clean CI builds. They failed when they ran beside another team's tests.

Sometimes they failed. Not every time. Always on Tuesday, between 10:14 AM and 10:22 AM.

We lost two weeks. I tried everything. I tried everything again. I tried everything in a new order.

On day 11, I stood at a whiteboard at 9 PM. The board was full of arrows. I finally saw the truth.

The tests were fine. The framework was the problem.

My framework thought the app was the only thing under test. It was not. The CI server was under test too. So was the database snapshot job. So was the deploy timing on the staging server.

We fixed that one bug. But the two weeks taught me the big lesson:

Fixing flaky tests is not a writing problem. It is a design problem.

The tests don't need more rules. The framework around them needs to be smarter.

That is where the three-part system was born.

The Solution: Planner, Generator, Healer

Here is the whole system in short. The names are mine. The ideas are obvious once you stop pretending they are one job.

Planner

Job: read a feature, a user story, or a bug. Write a test plan.

Not code. A plan. A list of flows, edge cases, set-up, clean-up. In plain Markdown.

Why it is its own job: planning and writing are not the same skill. If one thing does both, tests drift from the plan. You get tests the agent can't explain. And gaps where it had no example to copy.

Plan first. Write later.

What I built three years ago: a plan generator that read from PR descriptions, Jira tickets, and production alerts. It produced a Markdown plan. Engineers reviewed it before any code was written. About 85% of plans were approved as-is. The 15% that were rejected were caught in minutes. Not days of debugging.

Generator

Job: take an approved plan. Write the test code. Pick the button names. Write the checks. Set up the test.

Why it is its own job: code writing works best with a narrow goal (one plan). Not a wide goal (the whole codebase). A focused generator with one plan beats a smart generator with the whole repo.

What I built: a generator that turned plan Markdown into Playwright tests in TypeScript. It picked button names in a fixed order (data-testid first, then role, then text as last resort). It set up fixtures. It used soft checks by default. No creativity. One plan in, one test file out.

Healer

Job: a test fails. Check why.

Is it a real bug? A button name that moved? Or a slow server that day?

Fix the things you can. Flag the real bugs. Park the rest with notes.

Why it is its own job: and this is the part no one wanted to hear. Healing is not "run it again until it passes." That is hiding. Healing is three steps: check, propose a fix, get it reviewed.

What I built: a Healer that compared the current page to the last green run. If the button name was stale, it proposed three new candidates. It scored each one. It opened a pull request with the best one-line change. A human reviewed it.

Humans merged about 80% of those fixes. The other 20% were caught in review. That is exactly what a good Healer looks like.

100% merged means humans aren't reading
20% merged means the Healer is broken
80% merged means both sides are working

The Numbers

I don't love numbers without a shop name. My rules don't let me name the shop. So here is what I can tell you plainly:

On one project, the three-part system let the test suite grow 3× in 18 months. The flaky-test rate stayed flat.
On another, each engineer spent a third less time on broken tests in the first quarter.
On a third, one CSS rename broke 100+ tests overnight. The Healer fixed it in one pull request by morning. The old way was a 3-week cleanup.

These numbers are not magic. They come from splitting the work into three small jobs. And from watching the handoff between each job. If you already do this with your services, you already know why it works.

Now Playwright Ships This As A Feature

Playwright versions 1.58 and 1.59 shipped a set of Test Agents in VS Code and the command line:

Planner agent — explores the app, writes test plans
Generator agent — turns plans into test code
Healer agent — fixes failing tests with AI help

The release notes: v1.58 and v1.59. The agent APIs are browser.bind() and page.screencast.

Same three jobs. Same split. Microsoft built what I built. They built it better in some ways. They missed one big thing.

What Microsoft got right

Each agent works alone. You can run Planner by itself. Pass its output to Generator. Never touch Healer. That split is the whole point. An agent system where everything is tangled is just one big prompt.

The agents are optional. You don't have to buy in all at once. Drop the Healer into your old tests. Leave Planner and Generator for later. That is how real teams adopt new tools.

They shipped the plumbing, not just the agents. Two pieces matter:

browser.bind() — added in v1.59. It lets any AI tool like Claude or Cursor connect to a running browser. No fresh browser. No lost cookies. No mocking your login.
Playwright MCP Bridge — a free Chrome extension. It connects your open tabs to a local Playwright server. Your real cookies. Your real profile. Your real logged-in session.

Together, those two things solve a problem QA teams have been hacking around for years. Let an AI agent work on your real browser. Not a fresh empty one. Microsoft built the plumbing. You don't have to.

What Microsoft missed

The review loop.

Self-healing is not a feature. It is a deal between the test, the app, and the team.

The Healer will happily propose fixes. But who reviews them? Who sets the merge rules? Who steps in when the Healer's fix rate drops? Playwright ships the agent. It does not ship the rules around the agent.

Those rules are the hard part. And you have to build them. Whether you use Microsoft's agents or your own.

A Healer with no review loop is just a bug generator with a nice screen.

If you're on Selenium, Cypress, or something older, the migration math got better with v1.59 this month — but the pattern is portable. You don't need Microsoft's implementation to build this. You need:

Plans as artifacts. Markdown. Version-controlled. Reviewable.
Generators with narrow context. One plan in. One test file out. No repo-wide reasoning.
A healer with a review loop. It proposes, a human approves, CI enforces. If the human always approves, your healer is working. If the human always rejects, your healer is broken. If it's 80/20, it's doing its job.

What to Do Next (Even if You Don't Migrate)

If you already use Playwright, the path is simple. Try the Planner agent in VS Code next sprint. Feed it one real user story. Compare its plan to your plan. Do that 10 times. If you would hand its plans to a junior engineer, it works. That means you found a 2–3× speed boost.

If you use Selenium, Cypress, or something older, migration got easier this month. But the system is portable. You don't need Microsoft's tools to build it. You need three things:

Plans as files. Markdown. In git. Reviewable.
A generator with a narrow goal. One plan in. One file out. No repo-wide thinking.
A Healer with a review loop. It proposes. A human approves. CI enforces. 80% merge rate means it works.

Start with the Healer if flaky tests block releases. Start with the Planner if you are short-staffed. Start with the Generator last. It is the flashy one. But it is the least useful without the other two.

If your team doesn't have this yet, print this post. Paste it in your design doc. Replace "I built" with "we can build." Take it to your next architecture review.

The Takeaway

Three years ago, this system was a weird thing a weird architect built. Nothing off the shelf solved the problem.

This month, it ships as a native feature in the tool serious web teams use. Last October, the agents shipped inside Playwright. This week's v1.59 release added the production parts: video receipts, MCP interop (AI tool bridge), and async disposables.

If you are still treating flaky tests as a writing problem, you are three years behind.

If you treat them as a design problem, you are on time. The pattern worked then. It ships natively now — agents in v1.56, infrastructure in v1.59. The contract around it is still yours to build.

If you have been treating them as a design problem for years, you are ahead of the team that ships the framework.

That is a fine place to be.

The system worked then. It ships in v1.59 now. The rules around it are still yours to build.

That is the job.

Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test systems where AI agents and human engineers work together on quality. Former Apple SDET (Apple.com and Apple Card pre-release testing). Find him at anton.qa or on LinkedIn.

Playwright Just Shipped the Fix For Flaky Tests I Built 3 Years Ago

The Problem: the flaky-test cost nobody budgets for

The Drama: two weeks that broke me

The Solution: Planner, Generator, Healer

Planner

Generator

Healer

The Numbers

Now Playwright Ships This As A Feature

What Microsoft got right

What Microsoft missed

What to Do Next (Even if You Don't Migrate)

The Takeaway

Subscribe

Latest articles

Native Drag-and-Drop Automation Arrives in Playwright MCP: What v0.0.71 Changes

Playwright Just Shipped the Fix For Flaky Tests I Built 3 Years Ago