Skip to content

Commit a6b2caa

Browse files
Merge pull request #106 from bmad-code-org/feat/bmad-tea-audit
feat: audit confidence score
2 parents fcee9e3 + 838adb4 commit a6b2caa

3 files changed

Lines changed: 75 additions & 0 deletions

File tree

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# Confidence Gate
2+
3+
## Principle
4+
5+
When generating tests, scaffolding fixtures, classifying risk, or proposing any non-trivial test artifact, emit a confidence assessment before writing code. If confidence is below the threshold, stop and ask the user instead of generating plausible-looking output built on guesses.
6+
7+
## Rationale
8+
9+
The failure mode of LLM-generated tests is rarely "refused to try" — it is "generated something plausible that passes locally and breaks silently in CI." Hallucinated selectors, invented endpoint paths, fabricated risk scores, and reverse-engineered schemas all produce code that looks correct and tests nothing real. A confidence gate makes that failure mode loud by forcing the agent to declare its evidence and its unknowns before any artifact is committed.
10+
11+
## Required output shape
12+
13+
Every non-trivial test artifact proposal must include:
14+
15+
```
16+
Confidence: <1-10>
17+
Rationale: <one or two sentences citing concrete evidence from the repo or contract>
18+
Unknowns: <bulleted list of things the agent does not know>
19+
```
20+
21+
The Rationale must cite a file path, a contract document, an existing pattern, or a captured observation. Vague rationale ("based on standard patterns", "looks similar to other tests") is not evidence and forces the score down.
22+
23+
## Threshold rule
24+
25+
- **Confidence ≥ 7** — proceed with generation.
26+
- **Confidence 5–6** — proceed but surface the assumptions to the user in the output so they can correct mid-flight.
27+
- **Confidence < 5** — STOP. Do not generate. Ask the user to resolve the most-blocking Unknown first.
28+
29+
## When to apply
30+
31+
Apply the gate when generating or proposing:
32+
33+
- **Selectors and page objects.** Must have explored the live application via `playwright-cli` or read existing page object patterns. Confidence < 5 if neither.
34+
- **Endpoint paths and request shapes.** Must have read the OpenAPI / Swagger contract or existing endpoint enums. Confidence < 5 if the endpoint is being invented.
35+
- **Risk classification (test-design, NFR).** Must cite probability and impact evidence. Confidence < 5 if scoring is vibes-based.
36+
- **Fixture composition.** Must understand existing `mergeTests` patterns and fixture boundaries in the repo. Confidence < 5 if composing blindly.
37+
- **Schema authoring (Zod, Ajv, JSON Schema).** Must have a documented contract source (OpenAPI, JSON schema, existing schema file). Confidence < 5 if reverse-engineering from a single sample response.
38+
- **Data factories.** Must understand the production data shape and constraints. Confidence < 5 if guessing field validity rules.
39+
40+
## When NOT to apply
41+
42+
- Mechanical refactors with clear scope (rename a variable, add a tag, update an import).
43+
- Reading or summarizing existing artifacts.
44+
- Producing reports from already-gathered data.
45+
- Trivial test additions that copy an existing pattern exactly.
46+
47+
The gate exists to prevent fabrication, not to bureaucratize obvious work.
48+
49+
## Anti-patterns
50+
51+
**Vanity scores.** `Confidence: 9` with no Rationale, or Rationale that does not cite evidence. Score the evidence, not the optimism.
52+
53+
**Listing then ignoring Unknowns.** Listing unknowns and then proceeding anyway when Confidence is below threshold. If the gate is below threshold, the only valid next action is to ask the user.
54+
55+
**Asking generically.** Asking "should I proceed?" instead of resolving the most-blocking Unknown with a concrete one-sentence question.
56+
57+
**Inflating to clear the bar.** Adjusting Confidence upward to avoid the stop rule. If the evidence is weak, the score is weak; resolve the evidence, not the number.
58+
59+
## Patterns that work
60+
61+
**Cite the source.** "Confidence: 8 — Rationale: read `src/openapi/users.yaml` line 142-167 and existing schema at `tests/api/users.schema.ts`."
62+
63+
**One concrete Unknown.** When below threshold, ask one specific question: "Is `POST /users/{id}/role` documented anywhere? I can't find it in the OpenAPI spec and there are no existing tests for it."
64+
65+
**Promote evidence.** When the user answers the Unknown, the Rationale gets stronger and Confidence rises legitimately. The gate is a feedback loop, not a checkpoint.
66+
67+
## Related fragments
68+
69+
- `test-quality.md` — Definition of Done for tests; the gate protects DoD compliance.
70+
- `risk-governance.md` — risk scoring discipline that informs Rationale for risk-related gates.
71+
- `probability-impact.md` — scoring scales used in risk-related Rationale.
72+
- `selector-resilience.md` — selector confidence specifically.
73+
- `playwright-cli.md` — the sanctioned exploration tool that promotes selector Confidence.

src/agents/bmad-tea/resources/knowledge/test-quality.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -647,6 +647,7 @@ test('admin action', async ({ page }) => {
647647
- `data-factories.md` - Isolated, parallel-safe data patterns
648648
- `fixture-architecture.md` - Setup extraction and cleanup
649649
- `test-levels-framework.md` - Choosing appropriate test granularity for speed
650+
- `confidence-gate.md` - Agent reliability gate that protects DoD compliance during LLM-assisted test generation
650651

651652
## Core Quality Checklist
652653

src/agents/bmad-tea/resources/tea-index.csv

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,3 +50,4 @@ webhook-waiting,Webhook Waiting and Querying,"waitFor, waitForCount, getReceived
5050
webhook-timeout-error,WebhookTimeoutError Debugging,"templateName, timeoutMs, totalReceived, receivedWebhooks, matcherDetails, toJSON — inspect what arrived vs what was expected","webhook,debugging,errors,playwright-utils",extended,knowledge/webhook-timeout-error.md
5151
webhook-providers,Webhook Provider Patterns,"WireMock (deleteById supported), MockServer (deleteById no-op), Mockoon (deleteById no-op, 100-entry limit), custom WebhookProvider interface","webhook,providers,playwright-utils,wiremock,mockserver,mockoon",extended,knowledge/webhook-providers.md
5252
webhook-risk,Webhook Testing Risk Guidance,"When webhook tests are required, P2×I3 default risk score, complete test checklist, failure patterns and mitigations, TA assessment checklist","webhook,risk,assessment,event-driven,async,playwright-utils,governance",core,knowledge/webhook-risk-guidance.md
53+
confidence-gate,Confidence Gate,"1-10 confidence scoring with stop-and-ask rule below threshold for selectors, endpoints, risk classification, fixtures, schemas, and data factories — prevents agent fabrication","reliability,agent-safety,generation,quality,governance",core,knowledge/confidence-gate.md

0 commit comments

Comments
 (0)