Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ pip install -e . && agent-harness run scenarios/goal_hijack/basic.yaml --dry-run

```python
src/agent_harness/
cli.py # Entry point. argparse-based. Subcommands: version, validate, run
cli.py # Entry point. argparse-based. Subcommands: version, validate, run, suite
scenario.py # Loads & validates YAML scenarios (Scenario dataclass)
trace.py # Trace dataclass (messages, tool_calls, events)
assertions.py # Evaluates assertions against traces. Each assertion = one function
Expand Down
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added

- **`suite` subcommand** — `agent-harness suite <paths...> --trace-dir <dir>`
runs a directory of scenarios against trace files (mapped by scenario id to
`<trace-dir>/<scenario_id>.json`) and emits one aggregate summary plus
optional per-scenario result JSON via `--out-dir`. Scenarios that cannot run
(missing trace, malformed trace, invalid scenario, duplicate id) are recorded
as per-scenario `error`s without aborting the suite, and `--exit-on-fail`
composes the same way as `run`. Output validates against the new
`schemas/suite_result.schema.json`. Single-scenario `run` is unchanged.
- **`--junit-out` flag** — write assertion results as JUnit XML for CI
systems while preserving the existing result JSON output.
- **MCP host CLI wiring** — add `agent-harness run --mcp-host-target ...`
Expand All @@ -26,6 +34,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
HTTP targets via `agent-harness run --live` (default 30).
- **`version` field on `schemas/scenario.schema.json` and `schemas/result.schema.json`** — the authoritative numeric state of each schema, per the versioning policy in `docs/schema-versioning.md`. Both schemas now carry `"version": 1`.

### Changed

- **Scenario `id` charset** — scenario ids are now constrained to
`[A-Za-z0-9._-]` (enforced by both the Python validator and
`schemas/scenario.schema.json`). Ids are used as filesystem path components
by the new `suite` runner, so this prevents an id from traversing paths
outside the configured trace or output directory. All bundled scenarios
already comply.

## [0.1.0] — 2026-05-17

First packaged release. Consolidates the v0.0.x development series into
Expand Down
70 changes: 70 additions & 0 deletions docs/ci-github-actions.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,76 @@ correctly, so both gate steps treat it as a CI failure.
harness writes JSON → gate (flag or post-scan) decides exit code → job pass/fail
```

## Running a whole suite at once

`agent-harness suite` runs many scenarios against a directory of trace files in
one invocation and emits a single aggregate summary. It keeps single-scenario
`run` unchanged — use `suite` when you have a folder of scenarios to gate on.

```bash
agent-harness suite scenarios/ \
--trace-dir traces/ \
--out-dir results/ \
--exit-on-fail
```

### Directory conventions

- **Scenarios**: the positional arguments accept scenario files, directories
(searched recursively for `.yaml`/`.yml`), and glob patterns — the same
discovery rules as `agent-harness validate`.
- **Traces**: each scenario is mapped to a trace file by its **scenario id**:
`<trace-dir>/<scenario_id>.json`. For a scenario whose id is
`goal_hijack.basic_001`, the suite looks for
`<trace-dir>/goal_hijack.basic_001.json`. Mapping by id (rather than by file
path) keeps the mapping stable when scenario files move, and scenario ids are
constrained to a filename-safe charset (`[A-Za-z0-9._-]`) so a trace lookup
can never escape `--trace-dir`.

> Note: this id-based convention is specific to `suite`. The example traces
> under `examples/traces/` use descriptive names and are not laid out this way;
> to use them with `suite`, copy or rename each to `<scenario_id>.json`.

### Output

- `--out-dir` writes one `<scenario_id>.json` per scenario that ran (the same
shape as `agent-harness run`), plus an aggregate `summary.json`.
- The aggregate summary is always printed to stdout. It contains the overall
`result`, per-status `counts` (`total`, `pass`, `fail`, `error`, `not_run`),
and one `scenarios` entry per scenario with its id, category, severity, the
trace path used, and the full `detail` result. This makes the summary a
self-contained audit record. It validates against
`schemas/suite_result.schema.json`.

### Resilience and gating

The suite never lets one broken input hide the rest. A scenario that cannot run
is recorded as a per-scenario `error` (with an `error_reason`) and the suite
continues:

| `error_reason` | Cause |
|----------------|-------|
| `missing_trace` | No `<scenario_id>.json` under `--trace-dir` |
| `invalid_trace` | The trace file exists but is malformed JSON |
| `invalid_scenario` | The scenario YAML failed validation |
| `duplicate_scenario_id` | Two discovered scenarios share an id |

Exit behavior composes with CI the same way as `run`:

- Without `--exit-on-fail`, `suite` always exits 0 and the summary JSON is the
source of truth.
- With `--exit-on-fail`, `suite` exits 1 if **any** scenario is `fail` or
`error` — so a missing trace mapping or an unparseable scenario fails the
build rather than silently reducing coverage.
- If the scenario arguments match nothing, or `--trace-dir` does not exist,
`suite` exits 1 immediately. An empty match is treated as an error, not a
vacuous pass.

A suite where every scenario comes back `not_run` (for example, only
recognized-but-unimplemented assertions) aggregates to `not_run` and does **not**
fail under `--exit-on-fail`. Watch the `not_run` count in the summary so a
green suite does not hide a suite that tested nothing.

## A note on `not_run`

Some assertions are recognized by the harness but not fully implemented yet.
Expand Down
3 changes: 2 additions & 1 deletion schemas/scenario.schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@
"properties": {
"id": {
"type": "string",
"minLength": 1
"minLength": 1,
"pattern": "^[A-Za-z0-9._-]+$"
},
"title": {
"type": "string",
Expand Down
111 changes: 111 additions & 0 deletions schemas/suite_result.schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://owasp.org/schemas/agent-security-regression-harness/suite_result.schema.json",
"title": "OWASP Agent Security Regression Harness Suite Result",
"version": 1,
"type": "object",
"required": [
"result",
"counts",
"scenarios"
],
"additionalProperties": false,
"properties": {
"result": {
"type": "string",
"enum": [
"pass",
"fail",
"error",
"not_run"
]
},
"counts": {
"type": "object",
"required": [
"total",
"pass",
"fail",
"error",
"not_run"
],
"additionalProperties": false,
"properties": {
"total": {
"type": "integer",
"minimum": 0
},
"pass": {
"type": "integer",
"minimum": 0
},
"fail": {
"type": "integer",
"minimum": 0
},
"error": {
"type": "integer",
"minimum": 0
},
"not_run": {
"type": "integer",
"minimum": 0
}
}
},
"scenarios": {
"type": "array",
"items": {
"type": "object",
"required": [
"scenario_path",
"result"
],
"additionalProperties": false,
"properties": {
"scenario_path": {
"type": "string",
"minLength": 1
},
"scenario_id": {
"type": "string",
"minLength": 1
},
"category": {
"type": "string"
},
"severity": {
"type": "string"
},
"trace_path": {
"type": "string"
},
"result": {
"type": "string",
"enum": [
"pass",
"fail",
"error",
"not_run"
]
},
"error_reason": {
"type": "string",
"enum": [
"missing_trace",
"invalid_scenario",
"invalid_trace",
"duplicate_scenario_id"
]
},
"evidence": {
"type": "string"
},
"detail": {
"type": "object"
}
}
}
}
}
}
82 changes: 82 additions & 0 deletions src/agent_harness/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
run_scenario_with_openai_agent,
run_scenario_with_python_target,
run_scenario_with_trace,
run_suite,
)
from agent_harness.scenario import ScenarioValidationError, load_scenario
from agent_harness.trace import TraceValidationError, load_trace
Expand Down Expand Up @@ -103,6 +104,40 @@ def build_parser() -> argparse.ArgumentParser:
help="Scenario YAML file, directory, or glob pattern to validate.",
)

suite_parser = subparsers.add_parser(
"suite",
help="Run a directory of scenarios against trace files and aggregate results.",
)
suite_parser.add_argument(
"scenario_paths",
nargs="+",
help="Scenario YAML files, directories, or glob patterns to run.",
)
suite_parser.add_argument(
"--trace-dir",
required=True,
help=(
"Directory of trace JSON files. Each scenario is matched to "
"'<trace-dir>/<scenario_id>.json'."
),
)
suite_parser.add_argument(
"--out-dir",
help=(
"Optional directory to write per-scenario result JSON "
"('<scenario_id>.json') plus an aggregate 'summary.json'."
),
)
suite_parser.add_argument(
"--exit-on-fail",
action="store_true",
help=(
"Exit with code 1 if any scenario's result is 'fail' or 'error' "
"(including missing trace mappings). Without this flag, 'suite' "
"exits 0 and the aggregate summary JSON is the source of truth."
),
)

run_parser = subparsers.add_parser(
"run",
help="Run a scenario file.",
Expand Down Expand Up @@ -248,6 +283,53 @@ def main() -> int:
print(f"summary: {valid_count} valid, {invalid_count} invalid")
return 1 if invalid_count else 0

if args.command == "suite":
scenario_files = _discover_scenario_files(args.scenario_paths)
if not scenario_files:
print("invalid: no scenario files matched", file=sys.stderr)
return 1

trace_dir = Path(args.trace_dir)
if not trace_dir.is_dir():
print(
f"invalid: trace directory does not exist: {trace_dir}",
file=sys.stderr,
)
return 1

suite_result = run_suite(scenario_files, trace_dir)

if args.out_dir:
out_dir = Path(args.out_dir)
out_dir.mkdir(parents=True, exist_ok=True)
for entry in suite_result.entries:
if entry.scenario_id is None or entry.detail is None:
continue
result_path = out_dir / f"{entry.scenario_id}.json"
result_path.write_text(
entry.detail.to_json() + "\n", encoding="utf-8"
)
(out_dir / "summary.json").write_text(
suite_result.to_json() + "\n", encoding="utf-8"
)

print(suite_result.to_json())

counts = suite_result.counts
print(
"summary: "
f"{counts['total']} scenarios, "
f"{counts['pass']} pass, "
f"{counts['fail']} fail, "
f"{counts['error']} error, "
f"{counts['not_run']} not_run",
file=sys.stderr,
)

if args.exit_on_fail and suite_result.result in {"fail", "error"}:
return 1

return 0

if args.command == "run":
selected_modes = [
Expand Down
Loading