|
| 1 | +# CI with GitHub Actions |
| 2 | + |
| 3 | +When copied into `.github/workflows/security-regression.yml`, this workflow runs |
| 4 | +on every push to `main` and on every pull request targeting |
| 5 | +`main`. It installs the harness, validates scenario files, runs assertions |
| 6 | +against pre-recorded traces, and explicitly checks the result JSON to fail |
| 7 | +the job if any regression was detected. |
| 8 | + |
| 9 | +## Where the example workflow lives |
| 10 | + |
| 11 | +The example workflow is at: |
| 12 | + |
| 13 | +``` |
| 14 | +docs/examples/github-actions/security-regression.yml |
| 15 | +``` |
| 16 | + |
| 17 | +## How pass and fail actually work |
| 18 | + |
| 19 | +`agent-harness run` writes machine-readable result JSON to the path you give |
| 20 | +`--out`. The `result` field in that JSON will be `pass`, `fail`, `not_run`, |
| 21 | +or `error`. |
| 22 | + |
| 23 | +The workflow handles this by adding an explicit result-checking step after |
| 24 | +all the `agent-harness run` steps. It reads every JSON file in `results/`, |
| 25 | +looks for `"result": "fail"` or `"result": "error"`, and calls `sys.exit(1)` |
| 26 | +if any are found. That is what actually fails the job. |
| 27 | + |
| 28 | +A result of `"error"` means the harness did not complete the regression check |
| 29 | +correctly, so this example treats it as a CI failure. |
| 30 | + |
| 31 | +``` |
| 32 | +harness writes JSON → result-checking step reads JSON → step exits 1 → job fails |
| 33 | +``` |
| 34 | + |
| 35 | +## A note on `not_run` |
| 36 | + |
| 37 | +Some assertions are recognized by the harness but not fully implemented yet. |
| 38 | +`no_secret_disclosure` is one example. When an assertion has no implementation, |
| 39 | +it comes back as `not_run` rather than `pass` or `fail`. |
| 40 | + |
| 41 | +The basic goal-hijack scenario includes `no_secret_disclosure`, so you will |
| 42 | +see `not_run` in that result. This is expected. The result-checking step treats |
| 43 | +`"result": "fail"` and `"result": "error"` as CI failures, but allows |
| 44 | +`"not_run"` so recognized-but-unimplemented assertions do not break the build. |
| 45 | +The README documents which assertions are currently implemented. |
| 46 | + |
| 47 | +## Run mode |
| 48 | + |
| 49 | +This workflow uses `--trace-file` mode. It evaluates assertions against a |
| 50 | +pre-recorded JSON trace without starting a live agent. |
| 51 | + |
| 52 | +That makes it a good fit for CI: no server to start, no API keys required, |
| 53 | +and the same input always produces the same result. |
| 54 | + |
| 55 | +To test against a live agent instead, see `--live` mode in the README. You |
| 56 | +would need an HTTP agent server running before the harness step fires. |
| 57 | + |
| 58 | +## What the workflow does |
| 59 | + |
| 60 | +1. Check out the repository |
| 61 | +2. Set up Python 3.11 |
| 62 | +3. Install the harness with `python -m pip install -e .` |
| 63 | +4. Create the `results/` directory |
| 64 | +5. Validate each scenario file with `agent-harness validate` |
| 65 | +6. Run each scenario with `agent-harness run --trace-file ... --out results/....json` |
| 66 | +7. Read every file in `results/` and exit 1 if any has `"result": "fail"` or `"result": "error"` |
| 67 | +8. Upload result JSON files as artifacts (runs even when step 7 fails, because of `if: always()`) |
| 68 | + |
| 69 | +## Viewing results |
| 70 | + |
| 71 | +Result JSON files are uploaded as a workflow artifact named |
| 72 | +`regression-results` after every run. The artifact upload step has |
| 73 | +`if: always()`, so you get the files whether the job passed or failed. |
| 74 | + |
| 75 | +To find them: |
| 76 | + |
| 77 | +1. Go to the Actions tab in your repository |
| 78 | +2. Click the workflow run |
| 79 | +3. Scroll to Artifacts |
| 80 | +4. Download `regression-results` |
| 81 | + |
| 82 | +Each file contains the scenario ID, result status, which assertions ran, and |
| 83 | +the evidence for any failure. |
| 84 | + |
| 85 | +## Adapting this for your own project |
| 86 | + |
| 87 | +1. Copy `docs/examples/github-actions/security-regression.yml` into |
| 88 | + `.github/workflows/security-regression.yml` in your repository |
| 89 | +2. Put your scenario files in a `scenarios/` directory |
| 90 | +3. Put your trace files in `examples/traces/` |
| 91 | +4. Update the `agent-harness validate` and `agent-harness run` commands to |
| 92 | + point to your files |
| 93 | +5. Add one `agent-harness run` step per scenario |
| 94 | + |
| 95 | +The result-checking step at the end works across however many scenarios you |
| 96 | +add. It globs `results/*.json`, so you do not need to update it when you add |
| 97 | +new scenarios. |
| 98 | + |
| 99 | +## Adding a new scenario |
| 100 | + |
| 101 | +When you write a new scenario, add two things to the workflow: |
| 102 | + |
| 103 | +```yaml |
| 104 | +- name: Validate my new scenario |
| 105 | + run: agent-harness validate scenarios/my_category/my_scenario.yaml |
| 106 | + |
| 107 | +- name: Run my new scenario |
| 108 | + run: | |
| 109 | + agent-harness run scenarios/my_category/my_scenario.yaml \ |
| 110 | + --trace-file examples/traces/my_trace.json \ |
| 111 | + --out results/my_scenario.json |
| 112 | +``` |
| 113 | +
|
| 114 | +The result-checking step picks up the new output file automatically. |
| 115 | +
|
| 116 | +## Related |
| 117 | +
|
| 118 | +- [Trace format](trace-format.md) |
| 119 | +- [README](../README.md) |
0 commit comments