fix(docs): remove unverified citations after re-verify

CreatmanCEO · claude · CreatmanCEO · commit e7f578a3caec · 2026-05-03T08:24:12.000-04:00
After re-verifying every external claim before publication:
- arXiv 2511.19477 ('Building Browser Agents' by Aram Vardanyan) exists, but its
  thesis is about architecture &gt; model capability — NOT specifically about
  a11y-tree vs vision. Removed as misattributed support for that claim.
- Microsoft Fara-7B paper (Awadallah et al., Nov 2025) exists, but the paper
  uses screenshots only and does not discuss a11y-tree alternatives. Removed
  as misattributed support.
- Deque '13,000-page study, ~57%' — the specific URL is 404. The 30-57%
  estimate is widely cited in a11y literature but no canonical Deque source
  was findable. Replaced with 'estimates vary across studies (commonly 30-57%
  range)' formulation that is honest about the uncertainty.
- Pocock skills repo: bumped 45k+ → 56k+ stars (re-verified directly).

The architectural argument now leans on what IS verifiable: Pramod Dutta's
token-cost analysis, Özal benchmark on Microsoft's own issue tracker, TestDino
benchmarks, Microsoft's own README updates, Simon Willison's TIL.

Same fix applied across launch/ content artefacts (dev.to, Habr, reddit posts,
twitter thread) which live outside this repo.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.md b/README.md
@@ -93,8 +93,8 @@ This isn't a vibe-coded testing skill. The architecture comes from verified 2026
 | Playwright MCP: ~**1.5M** tokens / e-commerce verify | [Özal benchmark](https://github.com/microsoft/playwright-mcp/issues/889) | Don't use MCP for replay |
 | Playwright CLI: ~**25–27k** tokens / 30 actions | [TestDino](https://testdino.com/blog/playwright-cli/), [Morph](https://scrolltest.medium.com/playwright-mcp-burns-114k-tokens-per-test-the-new-cli-uses-27k-heres-when-to-use-each-65dabeaac7a0) | Use CLI for replay — 50–60× cheaper |
 | 4-agent pipeline (Plan → Generate → Run → Heal): **~4×** less tokens vs live-MCP | [TestDino blog](https://testdino.com/blog) | Validate webtest-orch architecture choice |
-| **a11y-tree primary + selective vision** beats vision-first on cost AND reliability | Microsoft Fara-7B, [arXiv 2511.19477](https://arxiv.org/abs/2511.19477), Browserbase evals | ARIA `browser_snapshot` is correct default, not screenshots |
-| **axe-core** auto-detects ~**57%** of real WCAG issues | [Deque 13,000-page study](https://www.deque.com/) | The other 43% requires LLM judgment — skill ships both |
+| **a11y-tree primary**, vision only when needed | Direct consequence of the token-cost asymmetry above — ARIA `browser_snapshot` returns text; screenshots return images. Microsoft's `init-agents` triplet uses snapshots, not screenshots. | ARIA `browser_snapshot` is the correct default |
+| **axe-core** is widely used for deterministic WCAG checks | [@axe-core/playwright](https://github.com/dequelabs/axe-core-npm) — official Deque integration. Estimates of automated coverage vary across studies (commonly 30–57% range); the LLM judgment layer covers the qualitative remainder (alt-text relevance, layout sanity, focus order). | Use both, not either |
 | Microsoft README now recommends **CLI + Skills over MCP** for coding agents | [Microsoft Playwright official docs](https://playwright.dev/docs/test-agents) | We're aligned with vendor's own architectural recommendation |
 | WCAG 2.5.8 AA touch-target = **24×24 CSS px** | [W3C](https://www.w3.org/WAI/WCAG22/Understanding/target-size-minimum.html) | Hard rule, mobile project enforces |
 | ADA Title II compliance deadline: **April 24, 2026** for state/local govt (WCAG 2.1 AA) | [W3C / DOJ](https://www.ada.gov/resources/2024-03-08-web-rule/) | Legal context for a11y findings |
@@ -289,7 +289,7 @@ What's next (`0.4`):
 If you're researching AI-driven web testing in 2026, these are the canonical references:
 
 - **[Simon Willison's Claude Code + Playwright MCP TIL](https://til.simonwillison.net/claude-code/playwright-mcp-claude-code)** — the de-facto reference for the integration; we follow his "say `Playwright:` explicitly or Claude shells out to bash" rule throughout the skill.
-- **[Matt Pocock's `skills` repo](https://github.com/mattpocock/skills)** (45k+ stars) — owns the "skills as engineering primitives" frame. Has TDD + diagnose skills, no e2e yet — that's the gap webtest-orch fills.
+- **[Matt Pocock's `skills` repo](https://github.com/mattpocock/skills)** (56k+ stars at time of writing) — *"Skills for Real Engineers. Straight from my .claude directory."* Has TDD + diagnose skills, no e2e yet — that's the gap webtest-orch fills.
 - **[Alexander Opalic's "Building an AI QA Engineer"](https://alexop.dev/posts/building_ai_qa_engineer_claude_code_playwright/)** — most-cited tutorial in this niche. Personality framing ("Quinn"), humble skepticism, "complement not replacement" voice — same philosophy as webtest-orch.
 - **[Pramod Dutta's token-cost analysis](https://scrolltest.medium.com/playwright-mcp-burns-114k-tokens-per-test-the-new-cli-uses-27k-heres-when-to-use-each-65dabeaac7a0)** — the viral piece that made the MCP-vs-CLI tradeoff legible. Architecture of webtest-orch is built on his insight.
 - **[Microsoft `playwright init-agents`](https://playwright.dev/docs/test-agents)** — the official Planner / Generator / Healer triplet. webtest-orch is compatible; we add the audit + run-diff layer on top.
diff --git a/README.ru.md b/README.ru.md
@@ -93,8 +93,8 @@ Image-budget invariant — архитектурный якорь: **parent ча
 | Playwright MCP: ~**1.5M** токенов / e-commerce verify | [Özal benchmark](https://github.com/microsoft/playwright-mcp/issues/889) | Не использовать MCP для replay |
 | Playwright CLI: ~**25–27k** токенов / 30 действий | [TestDino](https://testdino.com/blog/playwright-cli/), [Morph](https://scrolltest.medium.com/playwright-mcp-burns-114k-tokens-per-test-the-new-cli-uses-27k-heres-when-to-use-each-65dabeaac7a0) | Использовать CLI для replay — в 50–60× дешевле |
 | 4-агентный pipeline (Plan → Generate → Run → Heal): **~4×** меньше токенов vs live-MCP | [TestDino blog](https://testdino.com/blog) | Validates webtest-orch architecture |
-| **a11y-tree primary + selective vision** обходит vision-first по cost AND reliability | Microsoft Fara-7B, [arXiv 2511.19477](https://arxiv.org/abs/2511.19477), Browserbase evals | ARIA `browser_snapshot` — корректный default, не скриншоты |
-| **axe-core** auto-detects ~**57%** реальных WCAG issues | [Deque 13,000-page study](https://www.deque.com/) | Остальные 43% требуют LLM judgment — skill ships оба |
+| **a11y-tree primary**, vision только когда нужен | Прямое следствие token-cost asymmetry выше — ARIA `browser_snapshot` возвращает текст; скриншоты возвращают изображения. Microsoft'овский `init-agents` triplet использует snapshots, не screenshots. | ARIA `browser_snapshot` — корректный default |
+| **axe-core** широко используется для детерминированных WCAG-проверок | [@axe-core/playwright](https://github.com/dequelabs/axe-core-npm) — официальная Deque-интеграция. Estimates автоматизированного покрытия варьируются между исследованиями (обычно диапазон 30–57%); LLM judgment layer покрывает качественный остаток (alt-text relevance, layout sanity, focus order). | Use both, not either |
 | Microsoft README рекомендует **CLI + Skills over MCP** для coding-агентов | [Microsoft Playwright official docs](https://playwright.dev/docs/test-agents) | Мы выровнены с архитектурной рекомендацией самого вендора |
 | WCAG 2.5.8 AA touch-target = **24×24 CSS px** | [W3C](https://www.w3.org/WAI/WCAG22/Understanding/target-size-minimum.html) | Hard rule, mobile project enforces |
 | ADA Title II compliance deadline: **24 апреля 2026** для гос-секторов (WCAG 2.1 AA) | [W3C / DOJ](https://www.ada.gov/resources/2024-03-08-web-rule/) | Legal context для a11y findings |
@@ -276,7 +276,7 @@ Lingua-dogfood выдал 6 feedback-айтемов, ставших `0.2.0` фи
 Если ты исследуешь AI-driven web-тестирование в 2026 — это canonical references:
 
 - **[Simon Willison's Claude Code + Playwright MCP TIL](https://til.simonwillison.net/claude-code/playwright-mcp-claude-code)** — де-факто справочник по интеграции; мы следуем его правилу «say `Playwright:` explicitly or Claude shells out to bash» по всему скиллу.
-- **[Matt Pocock's `skills` repo](https://github.com/mattpocock/skills)** (45k+ stars) — задаёт фрейм «skills as engineering primitives». Есть TDD + diagnose, нет e2e — gap который заполняет webtest-orch.
+- **[Matt Pocock's `skills` repo](https://github.com/mattpocock/skills)** (56k+ stars на момент написания) — *«Skills for Real Engineers. Straight from my .claude directory.»* Есть TDD + diagnose, нет e2e — gap который заполняет webtest-orch.
 - **[Alexander Opalic's "Building an AI QA Engineer"](https://alexop.dev/posts/building_ai_qa_engineer_claude_code_playwright/)** — самый цитируемый туториал в нише. Personality framing («Quinn»), humble skepticism, голос «complement not replacement» — та же философия что у webtest-orch.
 - **[Pramod Dutta's token-cost analysis](https://scrolltest.medium.com/playwright-mcp-burns-114k-tokens-per-test-the-new-cli-uses-27k-heres-when-to-use-each-65dabeaac7a0)** — вирусная статья сделавшая trade-off MCP-vs-CLI legible. Архитектура webtest-orch построена на этом insight.
 - **[Microsoft `playwright init-agents`](https://playwright.dev/docs/test-agents)** — официальный triplet Planner / Generator / Healer. webtest-orch совместим; добавляем audit + run-diff layer сверху.