You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(docs): remove unverified citations after re-verify
After re-verifying every external claim before publication:
- arXiv 2511.19477 ('Building Browser Agents' by Aram Vardanyan) exists, but its
thesis is about architecture > model capability — NOT specifically about
a11y-tree vs vision. Removed as misattributed support for that claim.
- Microsoft Fara-7B paper (Awadallah et al., Nov 2025) exists, but the paper
uses screenshots only and does not discuss a11y-tree alternatives. Removed
as misattributed support.
- Deque '13,000-page study, ~57%' — the specific URL is 404. The 30-57%
estimate is widely cited in a11y literature but no canonical Deque source
was findable. Replaced with 'estimates vary across studies (commonly 30-57%
range)' formulation that is honest about the uncertainty.
- Pocock skills repo: bumped 45k+ → 56k+ stars (re-verified directly).
The architectural argument now leans on what IS verifiable: Pramod Dutta's
token-cost analysis, Özal benchmark on Microsoft's own issue tracker, TestDino
benchmarks, Microsoft's own README updates, Simon Willison's TIL.
Same fix applied across launch/ content artefacts (dev.to, Habr, reddit posts,
twitter thread) which live outside this repo.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -93,8 +93,8 @@ This isn't a vibe-coded testing skill. The architecture comes from verified 2026
93
93
| Playwright MCP: ~**1.5M** tokens / e-commerce verify |[Özal benchmark](https://github.com/microsoft/playwright-mcp/issues/889)| Don't use MCP for replay |
94
94
| Playwright CLI: ~**25–27k** tokens / 30 actions |[TestDino](https://testdino.com/blog/playwright-cli/), [Morph](https://scrolltest.medium.com/playwright-mcp-burns-114k-tokens-per-test-the-new-cli-uses-27k-heres-when-to-use-each-65dabeaac7a0)| Use CLI for replay — 50–60× cheaper |
95
95
| 4-agent pipeline (Plan → Generate → Run → Heal): **~4×** less tokens vs live-MCP |[TestDino blog](https://testdino.com/blog)| Validate webtest-orch architecture choice |
96
-
|**a11y-tree primary + selective vision** beats vision-first on cost AND reliability |Microsoft Fara-7B, [arXiv 2511.19477](https://arxiv.org/abs/2511.19477), Browserbase evals| ARIA `browser_snapshot` is correct default, not screenshots|
97
-
|**axe-core**auto-detects ~**57%** of real WCAG issues|[Deque 13,000-page study](https://www.deque.com/)| The other 43% requires LLM judgment — skill ships both |
96
+
|**a11y-tree primary**, vision only when needed | Direct consequence of the token-cost asymmetry above — ARIA `browser_snapshot` returns text; screenshots return images. Microsoft's `init-agents` triplet uses snapshots, not screenshots.| ARIA `browser_snapshot` is the correct default |
97
+
|**axe-core**is widely used for deterministic WCAG checks|[@axe-core/playwright](https://github.com/dequelabs/axe-core-npm) — official Deque integration. Estimates of automated coverage vary across studies (commonly 30–57% range); the LLM judgment layer covers the qualitative remainder (alt-text relevance, layout sanity, focus order). | Use both, not either|
98
98
| Microsoft README now recommends **CLI + Skills over MCP** for coding agents |[Microsoft Playwright official docs](https://playwright.dev/docs/test-agents)| We're aligned with vendor's own architectural recommendation |
99
99
| WCAG 2.5.8 AA touch-target = **24×24 CSS px**|[W3C](https://www.w3.org/WAI/WCAG22/Understanding/target-size-minimum.html)| Hard rule, mobile project enforces |
100
100
| ADA Title II compliance deadline: **April 24, 2026** for state/local govt (WCAG 2.1 AA) |[W3C / DOJ](https://www.ada.gov/resources/2024-03-08-web-rule/)| Legal context for a11y findings |
@@ -289,7 +289,7 @@ What's next (`0.4`):
289
289
If you're researching AI-driven web testing in 2026, these are the canonical references:
290
290
291
291
-**[Simon Willison's Claude Code + Playwright MCP TIL](https://til.simonwillison.net/claude-code/playwright-mcp-claude-code)** — the de-facto reference for the integration; we follow his "say `Playwright:` explicitly or Claude shells out to bash" rule throughout the skill.
292
-
-**[Matt Pocock's `skills` repo](https://github.com/mattpocock/skills)** (45k+ stars) — owns the "skills as engineering primitives" frame. Has TDD + diagnose skills, no e2e yet — that's the gap webtest-orch fills.
292
+
-**[Matt Pocock's `skills` repo](https://github.com/mattpocock/skills)** (56k+ stars at time of writing) — *"Skills for Real Engineers. Straight from my .claude directory."* Has TDD + diagnose skills, no e2e yet — that's the gap webtest-orch fills.
293
293
-**[Alexander Opalic's "Building an AI QA Engineer"](https://alexop.dev/posts/building_ai_qa_engineer_claude_code_playwright/)** — most-cited tutorial in this niche. Personality framing ("Quinn"), humble skepticism, "complement not replacement" voice — same philosophy as webtest-orch.
294
294
-**[Pramod Dutta's token-cost analysis](https://scrolltest.medium.com/playwright-mcp-burns-114k-tokens-per-test-the-new-cli-uses-27k-heres-when-to-use-each-65dabeaac7a0)** — the viral piece that made the MCP-vs-CLI tradeoff legible. Architecture of webtest-orch is built on his insight.
295
295
-**[Microsoft `playwright init-agents`](https://playwright.dev/docs/test-agents)** — the official Planner / Generator / Healer triplet. webtest-orch is compatible; we add the audit + run-diff layer on top.
| Playwright MCP: ~**1.5M** токенов / e-commerce verify |[Özal benchmark](https://github.com/microsoft/playwright-mcp/issues/889)| Не использовать MCP для replay |
94
94
| Playwright CLI: ~**25–27k** токенов / 30 действий |[TestDino](https://testdino.com/blog/playwright-cli/), [Morph](https://scrolltest.medium.com/playwright-mcp-burns-114k-tokens-per-test-the-new-cli-uses-27k-heres-when-to-use-each-65dabeaac7a0)| Использовать CLI для replay — в 50–60× дешевле |
95
95
| 4-агентный pipeline (Plan → Generate → Run → Heal): **~4×** меньше токенов vs live-MCP |[TestDino blog](https://testdino.com/blog)| Validates webtest-orch architecture |
96
-
|**a11y-tree primary + selective vision** обходит vision-first по cost AND reliability |Microsoft Fara-7B, [arXiv 2511.19477](https://arxiv.org/abs/2511.19477), Browserbase evals| ARIA `browser_snapshot` — корректный default, не скриншоты|
|**a11y-tree primary**, vision только когда нужен | Прямое следствие token-cost asymmetry выше — ARIA `browser_snapshot` возвращает текст; скриншоты возвращают изображения. Microsoft'овский `init-agents` triplet использует snapshots, не screenshots.| ARIA `browser_snapshot` — корректный default |
97
+
|**axe-core**широко используется для детерминированных WCAG-проверок |[@axe-core/playwright](https://github.com/dequelabs/axe-core-npm) — официальная Deque-интеграция. Estimates автоматизированного покрытия варьируются между исследованиями (обычно диапазон 30–57%); LLM judgment layer покрывает качественный остаток (alt-text relevance, layout sanity, focus order). | Use both, not either|
98
98
| Microsoft README рекомендует **CLI + Skills over MCP** для coding-агентов |[Microsoft Playwright official docs](https://playwright.dev/docs/test-agents)| Мы выровнены с архитектурной рекомендацией самого вендора |
99
99
| WCAG 2.5.8 AA touch-target = **24×24 CSS px**|[W3C](https://www.w3.org/WAI/WCAG22/Understanding/target-size-minimum.html)| Hard rule, mobile project enforces |
100
100
| ADA Title II compliance deadline: **24 апреля 2026** для гос-секторов (WCAG 2.1 AA) |[W3C / DOJ](https://www.ada.gov/resources/2024-03-08-web-rule/)| Legal context для a11y findings |
Если ты исследуешь AI-driven web-тестирование в 2026 — это canonical references:
277
277
278
278
-**[Simon Willison's Claude Code + Playwright MCP TIL](https://til.simonwillison.net/claude-code/playwright-mcp-claude-code)** — де-факто справочник по интеграции; мы следуем его правилу «say `Playwright:` explicitly or Claude shells out to bash» по всему скиллу.
279
-
-**[Matt Pocock's `skills` repo](https://github.com/mattpocock/skills)** (45k+ stars) — задаёт фрейм «skills as engineering primitives». Есть TDD + diagnose, нет e2e — gap который заполняет webtest-orch.
279
+
-**[Matt Pocock's `skills` repo](https://github.com/mattpocock/skills)** (56k+ stars на момент написания) — *«Skills for Real Engineers. Straight from my .claude directory.»* Есть TDD + diagnose, нет e2e — gap который заполняет webtest-orch.
280
280
-**[Alexander Opalic's "Building an AI QA Engineer"](https://alexop.dev/posts/building_ai_qa_engineer_claude_code_playwright/)** — самый цитируемый туториал в нише. Personality framing («Quinn»), humble skepticism, голос «complement not replacement» — та же философия что у webtest-orch.
281
281
-**[Pramod Dutta's token-cost analysis](https://scrolltest.medium.com/playwright-mcp-burns-114k-tokens-per-test-the-new-cli-uses-27k-heres-when-to-use-each-65dabeaac7a0)** — вирусная статья сделавшая trade-off MCP-vs-CLI legible. Архитектура webtest-orch построена на этом insight.
0 commit comments