Skip to content

feat(qa,qa-only): add --evidence-per-finding evidence layout#1484

Open
itstimwhite wants to merge 2 commits into
garrytan:mainfrom
itstimwhite:feat/qa-evidence-per-finding
Open

feat(qa,qa-only): add --evidence-per-finding evidence layout#1484
itstimwhite wants to merge 2 commits into
garrytan:mainfrom
itstimwhite:feat/qa-evidence-per-finding

Conversation

@itstimwhite

Copy link
Copy Markdown
Contributor

Why

The default flat `screenshots/issue-001-step-1.png` layout works well for 1-5 findings but gets noisy past that, and it's a chore to hand a single finding to a developer — you have to pluck their three screenshots out of a shared bucket and explain which lines of the report apply.

Companion PR #1483 adds `$B record` (video evidence at the browse layer). This PR adds the QA-side complement so that when an interactive bug is captured on video, the `.webm` lives next to the rest of the finding's evidence in its own folder.

The shape mirrors the report-folder pattern that ships in Vercel Labs' `agent-browser` `dogfood` skill — that skill exists as report-only; ours integrates the same evidence shape into both the report-only `/qa-only` and the fix-loop `/qa`, with our existing severity/health-score model on top.

What

Opt-in flag `--evidence-per-finding` (or natural-language: `evidence per finding`, `one folder per bug`). Writes one self-contained folder per finding under a per-run report dir:

```
.gstack/qa-reports/qa-report-{domain}-{date}/
├── REPORT.md
├── findings/
│ ├── 001-critical-checkout-500-on-submit/
│ │ ├── finding.md # severity, repro, env, expected/actual
│ │ ├── step-1.png
│ │ ├── step-2.png
│ │ ├── result.png
│ │ └── repro.webm # OPTIONAL — present iff `$B record` was active
│ └── 002-high-search-no-results/
│ └── ...
└── baseline.json
```

`finding.md` schema (defined in the shared methodology):

  • Severity / Category / Page / Detected
  • What's wrong (one paragraph)
  • Repro steps (referencing the step-N.png files)
  • Expected vs actual
  • Environment (browser, viewport, auth)
  • Evidence file index

When to use it (also defined in the methodology, so the LLM picks correctly):

  • Run produces ≥5 findings
  • Any critical or high severity finding
  • An interactive bug has video evidence
  • Findings handed off as Linear/Jira tickets (folder zips to one attachment)

When NOT to use it: quick smoke runs, 1-2 findings, regression-mode reruns. The default flat layout stays exactly as it was.

Implementation note (shape, not size)

Per gstack's prompt-size guidance, the structure and `finding.md` template go through the shared `generateQAMethodology()` resolver (loaded into both /qa and /qa-only via `{{QA_METHODOLOGY}}`). Each leaf `SKILL.md.tmpl` only adds one new Setup-table row and a one-line pointer in the Output Structure section. No copy-paste across leaves.

Commits are bisect-friendly per the contributor guide:

  1. `feat(qa,qa-only): add --evidence-per-finding evidence layout` — resolver + both .tmpl source files (78 insertions).
  2. `docs(qa,qa-only): regenerate SKILL.md` — `bun run gen:skill-docs --host all` output (146 insertions, generated).

Verified

  • `bun test test/gen-skill-docs.test.ts test/skill-validation.test.ts` — 712 pass, 0 fail. No regressions in resolver or template validation.
  • `bun run skill:check` — all 10 host-output freshness checks green (one pre-existing `claude/SKILL.md` missing-generated warning reproduces on `main` without these changes; unrelated).
  • Spot-checked rendered output:
    • `qa/SKILL.md` and `qa-only/SKILL.md` both contain the new Setup row, the Document-phase Evidence layout section, and the Output Structure pointer.
    • The `{{QA_METHODOLOGY}}` block expands identically in both files.

Out of scope

  • VERSION bump / CHANGELOG entry — left for the merge so the entry stays in your voice. The flag carries its own discoverability via the Setup table.
  • Telemetry on which layout users pick. Worth measuring after this ships if you care.
  • Auto-detection ("pick per-finding when ≥5 findings"). Kept explicit for now — the heuristic is documented in the methodology so the LLM can choose, but the parameter is opt-in.

Pairs with

PR #1483 (`feat(browse): add record command for video evidence of interactive bug repros`). The `record` primitive is what produces the `.webm` that lands in each finding folder. Each PR is reviewable independently; the QA flag works without `record` (the .webm just won't be there).

Tim White added 2 commits May 13, 2026 20:16
Adds an opt-in per-finding evidence layout to /qa and /qa-only. When the
user passes --evidence-per-finding (or natural-language variants like
"evidence per finding" / "one folder per bug"), the run writes one
self-contained folder per finding instead of the flat shared-screenshots
layout:

  .gstack/qa-reports/qa-report-{domain}-{date}/
  ├── REPORT.md
  ├── findings/
  │   ├── 001-critical-checkout-500-on-submit/
  │   │   ├── finding.md       (severity, repro, env, expected/actual)
  │   │   ├── step-1.png
  │   │   ├── step-2.png
  │   │   ├── result.png
  │   │   └── repro.webm       (optional — present iff $B record was active)
  │   └── 002-high-search-no-results/
  │       └── ...
  └── baseline.json

The default flat layout is unchanged.

When per-finding is the right call (now in the shared methodology):
- Run produces ≥5 findings — the flat layout gets noisy past that.
- Any finding is critical or high — those tickets travel further and need
  self-contained evidence.
- An interactive bug needs video evidence — pairs with $B record (a
  separate PR adds the recording primitive at the browse layer).
- Findings will be handed off as Linear/Jira tickets — each folder zips
  into a single attachment.

Skip per-finding for quick smoke runs, 1-2 findings, or regression-mode
reruns where baseline.json is the canonical artifact.

Why a shared resolver: the structure and finding.md template are
identical for /qa and /qa-only. Per gstack's "no copy-paste across
leaves" prompt-size guidance, the shared content goes through
generateQAMethodology() (loaded into both via {{QA_METHODOLOGY}}). Each
leaf SKILL.md.tmpl only gets one new Setup-table row and a one-line
Output-Structure pointer to the shared section.

712 existing tests in test/gen-skill-docs.test.ts and
test/skill-validation.test.ts still pass.
Output of `bun run gen:skill-docs --host all` after the prior commit.
Picks up the new Setup-table row + the shared Document-phase Evidence
Layout section in both qa/SKILL.md and qa-only/SKILL.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant