|
| 1 | +# Confidence Gate |
| 2 | + |
| 3 | +## Principle |
| 4 | + |
| 5 | +When generating tests, scaffolding fixtures, classifying risk, or proposing any non-trivial test artifact, emit a confidence assessment before writing code. If confidence is below the threshold, stop and ask the user instead of generating plausible-looking output built on guesses. |
| 6 | + |
| 7 | +## Rationale |
| 8 | + |
| 9 | +The failure mode of LLM-generated tests is rarely "refused to try" — it is "generated something plausible that passes locally and breaks silently in CI." Hallucinated selectors, invented endpoint paths, fabricated risk scores, and reverse-engineered schemas all produce code that looks correct and tests nothing real. A confidence gate makes that failure mode loud by forcing the agent to declare its evidence and its unknowns before any artifact is committed. |
| 10 | + |
| 11 | +## Required output shape |
| 12 | + |
| 13 | +Every non-trivial test artifact proposal must include: |
| 14 | + |
| 15 | +``` |
| 16 | +Confidence: <1-10> |
| 17 | +Rationale: <one or two sentences citing concrete evidence from the repo or contract> |
| 18 | +Unknowns: <bulleted list of things the agent does not know> |
| 19 | +``` |
| 20 | + |
| 21 | +The Rationale must cite a file path, a contract document, an existing pattern, or a captured observation. Vague rationale ("based on standard patterns", "looks similar to other tests") is not evidence and forces the score down. |
| 22 | + |
| 23 | +## Threshold rule |
| 24 | + |
| 25 | +- **Confidence ≥ 7** — proceed with generation. |
| 26 | +- **Confidence 5–6** — proceed but surface the assumptions to the user in the output so they can correct mid-flight. |
| 27 | +- **Confidence < 5** — STOP. Do not generate. Ask the user to resolve the most-blocking Unknown first. |
| 28 | + |
| 29 | +## When to apply |
| 30 | + |
| 31 | +Apply the gate when generating or proposing: |
| 32 | + |
| 33 | +- **Selectors and page objects.** Must have explored the live application via `playwright-cli` or read existing page object patterns. Confidence < 5 if neither. |
| 34 | +- **Endpoint paths and request shapes.** Must have read the OpenAPI / Swagger contract or existing endpoint enums. Confidence < 5 if the endpoint is being invented. |
| 35 | +- **Risk classification (test-design, NFR).** Must cite probability and impact evidence. Confidence < 5 if scoring is vibes-based. |
| 36 | +- **Fixture composition.** Must understand existing `mergeTests` patterns and fixture boundaries in the repo. Confidence < 5 if composing blindly. |
| 37 | +- **Schema authoring (Zod, Ajv, JSON Schema).** Must have a documented contract source (OpenAPI, JSON schema, existing schema file). Confidence < 5 if reverse-engineering from a single sample response. |
| 38 | +- **Data factories.** Must understand the production data shape and constraints. Confidence < 5 if guessing field validity rules. |
| 39 | + |
| 40 | +## When NOT to apply |
| 41 | + |
| 42 | +- Mechanical refactors with clear scope (rename a variable, add a tag, update an import). |
| 43 | +- Reading or summarizing existing artifacts. |
| 44 | +- Producing reports from already-gathered data. |
| 45 | +- Trivial test additions that copy an existing pattern exactly. |
| 46 | + |
| 47 | +The gate exists to prevent fabrication, not to bureaucratize obvious work. |
| 48 | + |
| 49 | +## Anti-patterns |
| 50 | + |
| 51 | +❌ **Vanity scores.** `Confidence: 9` with no Rationale, or Rationale that does not cite evidence. Score the evidence, not the optimism. |
| 52 | + |
| 53 | +❌ **Listing then ignoring Unknowns.** Listing unknowns and then proceeding anyway when Confidence is below threshold. If the gate is below threshold, the only valid next action is to ask the user. |
| 54 | + |
| 55 | +❌ **Asking generically.** Asking "should I proceed?" instead of resolving the most-blocking Unknown with a concrete one-sentence question. |
| 56 | + |
| 57 | +❌ **Inflating to clear the bar.** Adjusting Confidence upward to avoid the stop rule. If the evidence is weak, the score is weak; resolve the evidence, not the number. |
| 58 | + |
| 59 | +## Patterns that work |
| 60 | + |
| 61 | +✅ **Cite the source.** "Confidence: 8 — Rationale: read `src/openapi/users.yaml` line 142-167 and existing schema at `tests/api/users.schema.ts`." |
| 62 | + |
| 63 | +✅ **One concrete Unknown.** When below threshold, ask one specific question: "Is `POST /users/{id}/role` documented anywhere? I can't find it in the OpenAPI spec and there are no existing tests for it." |
| 64 | + |
| 65 | +✅ **Promote evidence.** When the user answers the Unknown, the Rationale gets stronger and Confidence rises legitimately. The gate is a feedback loop, not a checkpoint. |
| 66 | + |
| 67 | +## Related fragments |
| 68 | + |
| 69 | +- `test-quality.md` — Definition of Done for tests; the gate protects DoD compliance. |
| 70 | +- `risk-governance.md` — risk scoring discipline that informs Rationale for risk-related gates. |
| 71 | +- `probability-impact.md` — scoring scales used in risk-related Rationale. |
| 72 | +- `selector-resilience.md` — selector confidence specifically. |
| 73 | +- `playwright-cli.md` — the sanctioned exploration tool that promotes selector Confidence. |
0 commit comments