Skip to content

Commit e7f578a

Browse files
CreatmanCEOclaude
andcommitted
fix(docs): remove unverified citations after re-verify
After re-verifying every external claim before publication: - arXiv 2511.19477 ('Building Browser Agents' by Aram Vardanyan) exists, but its thesis is about architecture > model capability — NOT specifically about a11y-tree vs vision. Removed as misattributed support for that claim. - Microsoft Fara-7B paper (Awadallah et al., Nov 2025) exists, but the paper uses screenshots only and does not discuss a11y-tree alternatives. Removed as misattributed support. - Deque '13,000-page study, ~57%' — the specific URL is 404. The 30-57% estimate is widely cited in a11y literature but no canonical Deque source was findable. Replaced with 'estimates vary across studies (commonly 30-57% range)' formulation that is honest about the uncertainty. - Pocock skills repo: bumped 45k+ → 56k+ stars (re-verified directly). The architectural argument now leans on what IS verifiable: Pramod Dutta's token-cost analysis, Özal benchmark on Microsoft's own issue tracker, TestDino benchmarks, Microsoft's own README updates, Simon Willison's TIL. Same fix applied across launch/ content artefacts (dev.to, Habr, reddit posts, twitter thread) which live outside this repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 6512d02 commit e7f578a

2 files changed

Lines changed: 6 additions & 6 deletions

File tree

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -93,8 +93,8 @@ This isn't a vibe-coded testing skill. The architecture comes from verified 2026
9393
| Playwright MCP: ~**1.5M** tokens / e-commerce verify | [Özal benchmark](https://github.com/microsoft/playwright-mcp/issues/889) | Don't use MCP for replay |
9494
| Playwright CLI: ~**25–27k** tokens / 30 actions | [TestDino](https://testdino.com/blog/playwright-cli/), [Morph](https://scrolltest.medium.com/playwright-mcp-burns-114k-tokens-per-test-the-new-cli-uses-27k-heres-when-to-use-each-65dabeaac7a0) | Use CLI for replay — 50–60× cheaper |
9595
| 4-agent pipeline (Plan → Generate → Run → Heal): **~** less tokens vs live-MCP | [TestDino blog](https://testdino.com/blog) | Validate webtest-orch architecture choice |
96-
| **a11y-tree primary + selective vision** beats vision-first on cost AND reliability | Microsoft Fara-7B, [arXiv 2511.19477](https://arxiv.org/abs/2511.19477), Browserbase evals | ARIA `browser_snapshot` is correct default, not screenshots |
97-
| **axe-core** auto-detects ~**57%** of real WCAG issues | [Deque 13,000-page study](https://www.deque.com/) | The other 43% requires LLM judgment — skill ships both |
96+
| **a11y-tree primary**, vision only when needed | Direct consequence of the token-cost asymmetry above — ARIA `browser_snapshot` returns text; screenshots return images. Microsoft's `init-agents` triplet uses snapshots, not screenshots. | ARIA `browser_snapshot` is the correct default |
97+
| **axe-core** is widely used for deterministic WCAG checks | [@axe-core/playwright](https://github.com/dequelabs/axe-core-npm) — official Deque integration. Estimates of automated coverage vary across studies (commonly 30–57% range); the LLM judgment layer covers the qualitative remainder (alt-text relevance, layout sanity, focus order). | Use both, not either |
9898
| Microsoft README now recommends **CLI + Skills over MCP** for coding agents | [Microsoft Playwright official docs](https://playwright.dev/docs/test-agents) | We're aligned with vendor's own architectural recommendation |
9999
| WCAG 2.5.8 AA touch-target = **24×24 CSS px** | [W3C](https://www.w3.org/WAI/WCAG22/Understanding/target-size-minimum.html) | Hard rule, mobile project enforces |
100100
| ADA Title II compliance deadline: **April 24, 2026** for state/local govt (WCAG 2.1 AA) | [W3C / DOJ](https://www.ada.gov/resources/2024-03-08-web-rule/) | Legal context for a11y findings |
@@ -289,7 +289,7 @@ What's next (`0.4`):
289289
If you're researching AI-driven web testing in 2026, these are the canonical references:
290290

291291
- **[Simon Willison's Claude Code + Playwright MCP TIL](https://til.simonwillison.net/claude-code/playwright-mcp-claude-code)** — the de-facto reference for the integration; we follow his "say `Playwright:` explicitly or Claude shells out to bash" rule throughout the skill.
292-
- **[Matt Pocock's `skills` repo](https://github.com/mattpocock/skills)** (45k+ stars) — owns the "skills as engineering primitives" frame. Has TDD + diagnose skills, no e2e yet — that's the gap webtest-orch fills.
292+
- **[Matt Pocock's `skills` repo](https://github.com/mattpocock/skills)** (56k+ stars at time of writing) — *"Skills for Real Engineers. Straight from my .claude directory."* Has TDD + diagnose skills, no e2e yet — that's the gap webtest-orch fills.
293293
- **[Alexander Opalic's "Building an AI QA Engineer"](https://alexop.dev/posts/building_ai_qa_engineer_claude_code_playwright/)** — most-cited tutorial in this niche. Personality framing ("Quinn"), humble skepticism, "complement not replacement" voice — same philosophy as webtest-orch.
294294
- **[Pramod Dutta's token-cost analysis](https://scrolltest.medium.com/playwright-mcp-burns-114k-tokens-per-test-the-new-cli-uses-27k-heres-when-to-use-each-65dabeaac7a0)** — the viral piece that made the MCP-vs-CLI tradeoff legible. Architecture of webtest-orch is built on his insight.
295295
- **[Microsoft `playwright init-agents`](https://playwright.dev/docs/test-agents)** — the official Planner / Generator / Healer triplet. webtest-orch is compatible; we add the audit + run-diff layer on top.

README.ru.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -93,8 +93,8 @@ Image-budget invariant — архитектурный якорь: **parent ча
9393
| Playwright MCP: ~**1.5M** токенов / e-commerce verify | [Özal benchmark](https://github.com/microsoft/playwright-mcp/issues/889) | Не использовать MCP для replay |
9494
| Playwright CLI: ~**25–27k** токенов / 30 действий | [TestDino](https://testdino.com/blog/playwright-cli/), [Morph](https://scrolltest.medium.com/playwright-mcp-burns-114k-tokens-per-test-the-new-cli-uses-27k-heres-when-to-use-each-65dabeaac7a0) | Использовать CLI для replay — в 50–60× дешевле |
9595
| 4-агентный pipeline (Plan → Generate → Run → Heal): **~** меньше токенов vs live-MCP | [TestDino blog](https://testdino.com/blog) | Validates webtest-orch architecture |
96-
| **a11y-tree primary + selective vision** обходит vision-first по cost AND reliability | Microsoft Fara-7B, [arXiv 2511.19477](https://arxiv.org/abs/2511.19477), Browserbase evals | ARIA `browser_snapshot` — корректный default, не скриншоты |
97-
| **axe-core** auto-detects ~**57%** реальных WCAG issues | [Deque 13,000-page study](https://www.deque.com/) | Остальные 43% требуют LLM judgment — skill ships оба |
96+
| **a11y-tree primary**, vision только когда нужен | Прямое следствие token-cost asymmetry выше — ARIA `browser_snapshot` возвращает текст; скриншоты возвращают изображения. Microsoft'овский `init-agents` triplet использует snapshots, не screenshots. | ARIA `browser_snapshot` — корректный default |
97+
| **axe-core** широко используется для детерминированных WCAG-проверок | [@axe-core/playwright](https://github.com/dequelabs/axe-core-npm) — официальная Deque-интеграция. Estimates автоматизированного покрытия варьируются между исследованиями (обычно диапазон 30–57%); LLM judgment layer покрывает качественный остаток (alt-text relevance, layout sanity, focus order). | Use both, not either |
9898
| Microsoft README рекомендует **CLI + Skills over MCP** для coding-агентов | [Microsoft Playwright official docs](https://playwright.dev/docs/test-agents) | Мы выровнены с архитектурной рекомендацией самого вендора |
9999
| WCAG 2.5.8 AA touch-target = **24×24 CSS px** | [W3C](https://www.w3.org/WAI/WCAG22/Understanding/target-size-minimum.html) | Hard rule, mobile project enforces |
100100
| ADA Title II compliance deadline: **24 апреля 2026** для гос-секторов (WCAG 2.1 AA) | [W3C / DOJ](https://www.ada.gov/resources/2024-03-08-web-rule/) | Legal context для a11y findings |
@@ -276,7 +276,7 @@ Lingua-dogfood выдал 6 feedback-айтемов, ставших `0.2.0` фи
276276
Если ты исследуешь AI-driven web-тестирование в 2026 — это canonical references:
277277

278278
- **[Simon Willison's Claude Code + Playwright MCP TIL](https://til.simonwillison.net/claude-code/playwright-mcp-claude-code)** — де-факто справочник по интеграции; мы следуем его правилу «say `Playwright:` explicitly or Claude shells out to bash» по всему скиллу.
279-
- **[Matt Pocock's `skills` repo](https://github.com/mattpocock/skills)** (45k+ stars) — задаёт фрейм «skills as engineering primitives». Есть TDD + diagnose, нет e2e — gap который заполняет webtest-orch.
279+
- **[Matt Pocock's `skills` repo](https://github.com/mattpocock/skills)** (56k+ stars на момент написания) — *«Skills for Real Engineers. Straight from my .claude directory.»* Есть TDD + diagnose, нет e2e — gap который заполняет webtest-orch.
280280
- **[Alexander Opalic's "Building an AI QA Engineer"](https://alexop.dev/posts/building_ai_qa_engineer_claude_code_playwright/)** — самый цитируемый туториал в нише. Personality framing («Quinn»), humble skepticism, голос «complement not replacement» — та же философия что у webtest-orch.
281281
- **[Pramod Dutta's token-cost analysis](https://scrolltest.medium.com/playwright-mcp-burns-114k-tokens-per-test-the-new-cli-uses-27k-heres-when-to-use-each-65dabeaac7a0)** — вирусная статья сделавшая trade-off MCP-vs-CLI legible. Архитектура webtest-orch построена на этом insight.
282282
- **[Microsoft `playwright init-agents`](https://playwright.dev/docs/test-agents)** — официальный triplet Planner / Generator / Healer. webtest-orch совместим; добавляем audit + run-diff layer сверху.

0 commit comments

Comments
 (0)