title	Specification
description	Agent QC v0.5.0 portable draft specification.

Specification

Agent QC v0.5.0 is a portable draft standard for evidence-driven quality control of Agent projects.

An Agent project can be a runtime CLI, SDK, tool server, MCP/ACP gateway, multi-channel bot, GUI/TUI/desktop client, skills or plugin ecosystem, background scheduler, distribution package, or evaluation suite. Agent QC does not assume one product shape. It starts by classifying the project profile and then selects gates that match its risk.

Scope

Agent QC standardizes:

Project profiles for Agent systems.
Test plan, case, gate, run, evidence, verdict, and report objects.
Gate taxonomy from static checks to live provider and release smoke.
Evidence-backed pass/fail semantics.
qcloop-compatible batch QC for repeated independent cases.
Benchmark and hill-climbing evidence for runtime, prompt, tool, and context improvement.
Case-study mapping for representative runtime, TUI, gateway, scheduler, UI, skills, release, and eval projects.

Agent QC does not standardize any single programming language, CI vendor, test framework, browser driver, model protocol, storage backend, or UI skin.

Document set

The latest standard is split by use:

Page	Purpose
Quickstart	fastest path to a QC plan
Best practices	authoring rules and anti-patterns
Test techniques and compositions	snapshots, smoke tests, black-box, white-box, runtime/UI/skills testing, and advanced evidence braids
Benchmark and hill climbing	frozen tasks, trials, rewards, and trajectories for proving Lime improves
Project classification	profile taxonomy and mixed-profile rules
Gate matrix	profile/surface/risk to gate mapping
Interaction surface testing	CLI/TUI/WebUI/desktop/browser/channel/eval UI evidence
Evidence contract	portable evidence, verdict, waiver fields
Performance and reliability metrics	timing, flake, cleanup, scheduler, release metrics
Flow and taxonomy	complete lifecycle and taxonomy reference
Star project testing systems	representative Agent project testing-system case studies

Project profiles

A qc_plan.project_profiles array declares which project shapes apply.

Profile	Typical risks	Example gates
`agent-runtime-cli`	tool execution, sandboxing, permission, streams, resume, subprocess cleanup	unit, protocol, fake model server, CLI e2e, sandbox tests
`agent-sdk-api`	public API compatibility, generated contracts, fake server behavior, async cancellation	signature tests, generated contract diff, fake server integration
`agent-tool-mcp-gateway`	tool declaration drift, stdio/http transport, recovery, resource access, audit refs	protocol conformance, mock server, transport recovery, contract tests
`multi-channel-agent-gateway`	channel adapters, auth, secrets, webhook verification, provider drift, media routing	channel contract tests, secret isolation, live opt-in, docker smoke
`agent-ui-tui-desktop`	rendering, terminal/browser state, user controls, screenshots, accessibility, bridge readiness	UI unit, snapshot, Playwright, terminal fixtures, GUI smoke
`agent-skills-plugins`	manifest shape, loader, package boundary, trust, marketplace or registry drift	schema, discovery, package export, fixture install, security scan
`background-agent-scheduler`	cron, queues, leases, retries, concurrency, idempotency, stuck-loop recovery	deterministic scheduler tests, race tests, stress tests, checkpoint/reclaim
`agent-distribution-release`	install, package contents, Docker, cross-platform, lockfiles, supply-chain	install smoke, package dry run, Docker smoke, OS matrix, lock checks
`agent-evals-quality`	model behavior regressions, prompt drift, rubric quality, answer grounding	eval suite, baseline comparison, LLM/human judge, qcloop batch

A project MAY combine profiles. For example OpenClaw combines channel gateway, tool gateway, distribution, live provider, and plugin profiles.

Interaction surfaces

A project profile says what the project owns. An interaction surface says where users or operators observe the Agent. qc_case.surface is optional in the JSON schema but SHOULD be present for user-visible gates.

Surface	Applies to	Extra evidence required
`cli-stream`	command output, JSONL/NDJSON, stdout/stderr	exit status, stdout/stderr transcript, structured event sample
`tui`	terminal UI, Ink, ratatui, curses	terminal snapshot, viewport size, key sequence, runtime transcript
`webui`	browser dashboard, extension UI, admin/QA console	screenshot/trace, console log, route state, browser-only assertion
`desktop-gui`	Tauri, Electron, native shell	shell start evidence, bridge health, workspace/session readiness, OS note
`browser-automation`	CDP, Playwright, browser-use, remote browser providers	screenshot, DOM/a11y snapshot, console/network log, cleanup evidence
`channel-ui`	mobile, QR, chat apps, webhook surfaces	channel transcript, media fixture, auth/webhook replay, device/emulator log
`eval-ui`	QA dashboards and semantic evaluation reports	rubric, judge output, baseline delta, reviewer note

A ui-interaction gate SHOULD name one of these surfaces. A pass without surface-specific evidence is incomplete. Surface proof SHOULD connect entrypoint, user action, visible frame, runtime backing, and cleanup evidence.

Core objects

Object	Purpose
`qc_plan`	A test plan for one change, release, investigation, or regression sweep.
`qc_case`	One behavior-level item with steps, expected result, required gates, and evidence.
`qc_gate`	A validation boundary such as static, unit, contract, integration, e2e, live, stress, release, or review.
`qc_run`	One execution attempt with command, executor, environment, output refs, duration, and result.
`qc_evidence`	A reference to logs, reports, traces, screenshots, fixtures, qcloop attempts, CI runs, or review notes.
`qc_verdict`	A judgment over evidence: passed, failed, blocked, exhausted, waived, or needs-review.
`qc_report`	The aggregate result, remaining risk, waivers, and next action.

Gate families

Gate families define the quality boundary. They are implemented with concrete techniques such as static checks, white-box unit tests, property/fuzz tests, golden transcripts, snapshots, contract tests, fake integrations, black-box smoke, runtime E2E, surface E2E, replay/regression, stress/chaos, security/adversarial tests, semantic evals, benchmark evals, and release install smoke.

Family	Purpose	Evidence examples
`static`	format, lint, type, dependency and policy hygiene	command logs, SARIF, lockfile check output
`unit`	deterministic local behavior	test report, coverage, fixture output
`property-fuzz`	invariants and generated input	seed, corpus, failing case artifact
`contract-protocol`	schemas, APIs, generated clients, command/tool surfaces	contract report, schema diff, mock transcript
`fake-integration`	integration against fake servers or local adapters	fake server log, request/response transcript
`runtime-e2e`	real CLI/runtime/task flow without external provider risk	CLI transcript, process cleanup evidence, state snapshot
`ui-interaction`	GUI/TUI/browser/terminal behavior	screenshot, trace, video, accessibility report
`live-provider`	opt-in real provider or network path	redacted transcript, credentials-scope note, cost/budget
`stress-concurrency`	races, leases, retries, long-running loops	stress report, worker timeline, seed, benchmark
`distribution-release`	install, package, Docker, cross-platform release readiness	tarball manifest, Docker smoke, OS matrix, release check
`semantic-eval`	model output quality, grounding, policy, user intent	eval result, rubric, judge output, baseline delta
`benchmark-eval`	runtime/prompt/tool/context changes outperforming a baseline	dataset/task version, trial trajectory, reward details, pass@k or delta
`review`	human or LLM review	reviewer decision, rubric, evidence refs

Status values

qc_case.status, qc_gate.status, and qc_report.status use:

planned
running
passed
failed
blocked
exhausted
waived
skipped
needs-review

A waived gate MUST include waiver.reason, waiver.approver, and waiver.expires when the project has a waiver process.

Evidence rules

A passed verdict MUST include evidence. A failed verdict MUST include the smallest actionable failure. A blocked verdict MUST identify the missing environment fact. An exhausted verdict MUST preserve attempts and verifier feedback.

Self-report is not evidence. The sentence "the agent checked it" is only valid when it links to command output, test report, transcript, trace, screenshot, or review record.

qcloop mapping

A qc_case can become a qcloop item_value. qcloop attempt maps to qc_run; qcloop qc_round maps to qc_verdict; qcloop exhausted maps to Agent QC exhausted, not generic failure.

Use qcloop when cases are repeated, independent, and verifier-friendly. Do not use qcloop to replace required project gates or to hide live-provider risk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Specification

Scope

Document set

Project profiles

Interaction surfaces

Core objects

Gate families

Status values

Evidence rules

qcloop mapping

Uh oh!

FilesExpand file tree

specification.md

Latest commit

History

specification.md

File metadata and controls

Specification

Scope

Document set

Project profiles

Interaction surfaces

Core objects

Gate families

Status values

Evidence rules

qcloop mapping