Guidance for AI coding agents (Copilot, Claude, Codex, etc.) working in this
repository. Humans should read README.md first; this file documents the
conventions and operational details that an agent needs in order to make safe,
useful changes without breaking the end-to-end test harness or the live
GitHub repository it talks to.
gh-aw-test is the live integration test bed for
github/gh-aw — the GitHub CLI extension
that compiles natural-language "agentic workflows" (.md files with YAML
frontmatter) into real GitHub Actions workflows (.lock.yml).
This repo is not a library and produces no artifacts. Its only purpose is
to exercise gh-aw against a real GitHub repository, with real AI engines
(Copilot, Claude, Codex), real issues, real pull requests, real discussions,
real branches, and real code scanning alerts. Test runs mutate the live
githubnext/gh-aw-test repository on GitHub.
Treat every change as if it will ship to production tonight: the nightly
matrix in .github/workflows/e2e.yml will exercise it against three different
gh-aw refs (main, latest pre-release, latest stable) at 03:00 UTC.
e2e.sh # The test runner. ~3.6k lines of bash.
clean.sh # Closes/deletes stray issues/PRs/branches.
README.md # Human-facing usage + coverage matrix.
AGENTS.md # This file.
fails.txt # Persistent list of currently-failing tests.
.github/workflows/
e2e.yml # Nightly matrix runner (CI mode).
cleaner.yml # Periodic cleanup of stray resources.
permissions.yml # Permissions audit workflow.
copilot-setup-steps.yml # Bootstraps the Copilot engine in CI.
agentics-maintenance.yml # gh-aw self-maintenance workflow.
mcp-lockdown-mode-proof.yml # MCP lockdown demonstration.
test-<engine>-<feature>.md # 88 agentic workflow source files.
test-<engine>-<feature>.lock.yml # Generated lockfiles — DO NOT hand-edit.
shared/mcp/ # Reusable MCP server definitions.
trials/ # Scratch space, ignored by tests.
e2e-test-*.log # Per-run logs (gitignored).
*.md status files # CONSOLIDATION_COMPLETE.md, etc. — historical
# design notes, safe to ignore unless asked.
e2e.sh discovers tests by globbing .github/workflows/test-*.md and parses
the filename to decide engine, variant, and expected behaviour. The schema is:
test-<engine>[-<variant>]-<feature>.md
<engine>is one ofclaude,codex,copilot. This is parsed byextract_ai_typeine2e.shand drives which labels, title prefix, and expected outputs the harness asserts.<variant>is optional and is one ofnosandbox,siderepo. These variants run the same feature under different sandboxing/network configurations.<feature>matches a gh-aw safe-output name (create-issue,add-comment,push-to-pull-request-branch, ...) or a higher-level capability (mcp,gh-steps,command,custom-safe-outputs).
If you add a workflow that does not match this pattern, the runner will silently skip it. If you rename an engine substring, you will break the label-matching and pass/fail detection.
Every test has two files:
test-foo.md— the gh-aw source. Edit this.test-foo.lock.yml— the compiled Actions workflow. Generated.
To regenerate after editing a .md:
gh aw compile .github/workflows/test-foo.mdTo recompile the entire suite (the standard pre-PR step):
gh aw compile .github/workflows/Never hand-edit a .lock.yml. The nightly matrix recompiles them against
multiple gh-aw refs and will overwrite any local edits. If a lockfile looks
wrong, the bug is either in the .md source or in gh-aw itself.
Most workflows declare a samples: block under safe-outputs: (see
test-copilot-create-issue.md for a canonical example). When e2e.sh is run
with --use-samples, gh-aw uses the declared sample instead of calling the
AI engine, which is faster, free, and reproducible.
Inventory of which workflows do/do not have samples:
ls .github/workflows/test-*.md | wc -l # total
grep -l "samples:" .github/workflows/test-*.md | wc -l # with samples
for f in .github/workflows/test-*.md; do
grep -q "samples:" "$f" || basename "$f"
done # without samplesThe only workflows intentionally without samples: are
test-copilot-custom-safe-outputs and test-copilot-dispatch-workflow,
because their purpose is to exercise the live engine path.
Prerequisites: gh CLI authenticated, push access to githubnext/gh-aw-test,
and a local clone of github/gh-aw if you plan to use --gh-aw-ref.
./e2e.sh # Everything.
./e2e.sh --dry-run # See what would run.
./e2e.sh --workflow-dispatch-only # Skip issue/PR/command-triggered tests.
./e2e.sh --use-samples # Deterministic; no engine calls.
./e2e.sh --batch-size 5 # Default is 10 parallel.
./e2e.sh --no-parallel # Serial; easier to read logs.
./e2e.sh test-copilot-create-issue # Single test.
./e2e.sh 'test-copilot-*' # Glob patterns.
./e2e.sh rerun # Re-run everything in fails.txt.
./e2e.sh report # File GitHub issues for fails.txt.
./e2e.sh --gh-aw-ref main # Build ../gh-aw at <ref> and use it.The runner writes a timestamped e2e-test-YYYYMMDD-HHMMSS.log and updates
fails.txt in place. Both are gitignored as *.log / fails.txt.
fails.txt is a plain-text list of currently-failing test names, one per
line, optionally followed by space-separated GitHub Actions run IDs. The
runner mutates it as it goes:
record_test_passremoves a test fromfails.txt.record_test_failappends the test name and the failing run ID.
When you fix a test, do not edit fails.txt by hand — run the test, let the
runner remove it. When triaging, use ./e2e.sh rerun to re-run only what is
in fails.txt, and ./e2e.sh report to open issues for each entry.
.github/workflows/e2e.yml runs ./e2e.sh --gh-aw-ref <ref> --workflow-dispatch-only for main, the latest pre-release, and the latest
stable release of gh-aw, every night at 03:00 UTC. Because GitHub Actions
exports CI=true, the runner enters CI mode:
- It does not commit/push recompiled lockfiles.
- It does not mutate the repository's
TEMP_USER_PATsecret. - It does not run issue/comment/PR-triggered tests (only
workflow_dispatch).
The practical effect: the nightly matrix validates that gh aw compile
succeeds against every .md for every gh-aw ref, and that the dispatch
tests currently on main still pass. Cross-trigger behaviour is only
exercised locally or via manual workflow_dispatch of e2e.yml.
If you change e2e.sh, sanity-check the CI-mode branches by running
CI=true ./e2e.sh --dry-run before pushing.
The runner uses bash arrays plus disk locks for parallel execution:
BATCH_SIZE=10controls fan-out; tune with--batch-size N.RESULTS_LOCK=/tmp/e2e-results-$$.lockserialises updates toPASSED_TESTS/FAILED_TESTS.GLOBAL_WORKFLOWS_LOCK=/tmp/e2e-workflows-$$.lockserialises the enable/disable list used by the exit trap.RESULTS_FILE=/tmp/e2e-results-$$.txtaggregates child-process results.
The exit trap in cleanup_on_exit disables every workflow recorded in
GLOBAL_WORKFLOWS_TO_DISABLE, even on Ctrl-C. If you add a new code path
that enables a workflow, also push its name into that array, or it will be
left enabled and consume scheduled runs.
See PARALLEL_FIXES_IMPLEMENTED.md and PARALLEL_TEST_CONFLICT_ASSESSMENT.md
for the prior incidents this design avoids — read them before increasing
BATCH_SIZE past 10 or removing locks.
set -uo pipefailis intentional. Do not add-e: individual test failures must not abort the suite.- Every user-visible message goes through
log/info/success/warning/error/progress. They tee to both stdout andLOG_FILE. - Wrap commands that may fail (network,
ghAPI) insafe_run. - Functions live in
e2e.sh; do not split into sourced files without updating the trap and lock paths. bash -n e2e.shmust pass before any commit that touches it.- Quote every variable (
"$var"), prefer[[ ]]over[ ], prefer$(...)over backticks.
clean.sh closes open issues, closes open PRs, closes discussions, and
deletes test branches. It supports --dry-run. Run it any time the live
repo accumulates noise, and always run ./clean.sh --dry-run first if
you have changed it — a bug here can mass-close real work.
Cleanup is best-effort. If the GraphQL discussions API rejects a node ID, the script logs and continues; it does not fail the run.
GH_AW_TEST_PAT— repository secret; PAT used by CI for all operations except Copilot-engine requests.TEMP_USER_PAT— set/unset bye2e.shduring local runs only, never in CI (guarded byCI=true). Used to exercise cross-repo flows (siderepovariants).- Engine credentials (Copilot, Claude, Codex) are configured at the GitHub app / repository level; agents should never write engine keys to disk.
Do not echo tokens. Do not add set -x to e2e.sh without first scrubbing
secret-handling sections.
- Pick a name:
test-<engine>-<feature>.md. Match an existing pattern. - Copy the closest existing
.md(e.g.test-copilot-create-issue.md) and edit the frontmatter and prompt. - Add a
samples:block so the test can run deterministically with--use-samples. - Run
gh aw compile .github/workflows/test-<engine>-<feature>.mdto produce the lockfile. Commit both. - Run
./e2e.sh test-<engine>-<feature>locally and confirm pass. - Update the coverage matrix in
README.md(the[ ]/[x]checklist). - Do not edit
fails.txt. If the test fails the first time, leave it for the runner to record.
- Do not commit
e2e-test-*.logorfails.txt.bak(gitignored, but occasionally untracked copies sneak in). - Do not delete
e2e.sh.backupore2e.sh.before-full-consolidationwithout confirming with the maintainer — they document the consolidation documented inCONSOLIDATION_COMPLETE.md. - Do not change
REPO_OWNER/REPO_NAMEine2e.sh. They are hard-coded togithubnext/gh-aw-teston purpose. - Do not enable a workflow without queueing its name for disable in the exit trap.
- Do not bypass the lock files; bash array writes from parallel children will corrupt results otherwise.
- Do not generate URLs in code or docs that you have not verified. Live
GitHub run IDs in
fails.txtare written by the runner from realgh runoutput.
The following files are historical design notes left in the repo root. They are not consumed by any tool — feel free to skim, but do not treat them as current spec:
CONSOLIDATION_COMPLETE.md— thee2e.shconsolidation that produced the current single-file runner.PARALLEL_FIXES_IMPLEMENTED.md— what changed when parallel batching landed.PARALLEL_TEST_CONFLICT_ASSESSMENT.md— assessment that justifies the current locking scheme.WORKFLOW_STATE_MANAGEMENT.md— design of enable/disable + trap.ERROR_HANDLING_ASSESSMENT.md— whyset -eis deliberately absent.
When these notes contradict the code, the code wins.
bash -n e2e.sh # syntax
bash -n clean.sh
./e2e.sh --dry-run # exercises discovery + parsing
gh aw compile .github/workflows/ # all sources compile
git diff --stat .github/workflows/*.lock.yml # confirm only intended lockfile changesIf any step fails, fix it before pushing. The nightly matrix is the last line of defence, not the first.