Understand how the ControlFlow repository is tested: what npm test checks, how scenarios work, and how to add a new check after making a change.
- Eval harness — a set of offline checks in
evals/that do not call live agents. - Scenario — a JSON file in
evals/scenarios/that pairs an input with an expected output. - Drift check — a test that verifies agent files haven't gone out of sync with contracts and governance files.
- Companion rule — a
_must_containassertion about which sections must be present in an agent file.
A Node.js test runner in evals/. It is completely offline.
Key properties:
- No network — no live agents, no LLM calls.
- Offline only — runs in CI without credentials.
- Deterministic — same input always produces the same pass/fail.
evals/
package.json — scripts: test, test:structural, test:behavior
validate.mjs — main structural validator (Passes 1–13)
drift-checks.mjs — drift detection helpers
tests/ — behavior test files (.test.mjs)
scenarios/ — JSON scenario fixtures
<agent-name>/ — folder per agent
<scenario>.json — individual scenario
| Command | What it runs | Speed |
|---|---|---|
cd evals && npm test |
Full suite (all 18 passes & behaviors) | Slower |
npm run test:structural |
validate.mjs structural passes |
Fast |
npm run test:behavior |
Prompt-behavior + orchestration-handoff | Fast |
The current authoritative pass list enforced by validate.mjs:
- Each
schemas/*.jsonis a valid JSON Schema (draft 2020-12). - Validates
governance/runtime-policy.jsonagainstschemas/runtime-policy.schema.jsonand the three fixtures underevals/scenarios/runtime-policy/. - No syntax errors.
- Each
evals/scenarios/**/*.jsonfile is valid against its corresponding schema.
- Each agent file that mentions a schema has the correct path.
skill_references[]values point to files that exist inskills/patterns/.
- Verifies that critical files like
plans/project-context.mdexist.
- Validates tool arrays against governance configs.
- Validates agent arrays against governance configs.
- Each
*.agent.mdhas sections in this exact order: Prompt → Archive → Resources → Tools. - Any missing or reordered section fails.
- Agent companion rules validating strict routing policy mentions.
- Validates the structural integrity of the
skills/index mapping.
- Tests renaming files to ensure governance around drift isn't bypassed.
- Ensures agents reference the unified memory architecture.
- Verifies that agents correctly include instructions for memory cleanups, hygiene rules, and persistent storage boundaries.
In placeholder mode (current default — _status: "placeholder" in evals/scenarios/tutorial-parity/allowlist.json), Pass 7c only logs that the parity check is installed and skips validation. Activation flips _status to "active" in a follow-up phase, after which validateTutorialParity runs and emits per-chapter-pair pass/fail by comparing level-2 heading sets between docs/tutorial-en/ and docs/tutorial-ru/.
- Verifies that the agent roster in
plans/project-context.mdand theexecutor_agentenum inschemas/planner.plan.schema.jsonstay in sync in both directions.
- For every schema referenced from an agent's
Resourcessection, verifies the schema file actually exists.
- Detects accidental file-list overlap across active plans in
plans/.
- Asserts invariants on
governance/runtime-policy.jsonand related governance files (review pipeline by tier, retry budgets, approval gate thresholds).
- Verifies that
review_scope: "final"references in Orchestrator and CodeReviewer agent prompts are coupled in both directions and reference the same fields.
A scenario is a JSON fixture describing an input/output pair. It is used for two purposes:
- Schema validation — verifies the structure.
- Regression testing — verifies behavior doesn't change unexpectedly.
Examples of scenario types:
| Scenario | Folder | Checked against |
|---|---|---|
| Planner plan with 5 phases | scenarios/planner/ |
planner.plan.schema.json |
| PlanAuditor APPROVED verdict | scenarios/plan-auditor/ |
plan-auditor.plan-audit.schema.json |
| CoreImplementer NEEDS_INPUT | scenarios/core-implementer/ |
core-implementer.execution-report.schema.json |
| Orchestrator gate event | scenarios/orchestrator/ |
orchestrator.gate-event.schema.json |
A typical npm test result now includes all 18 passes:
Pass 1: Schema Validity — OK
Pass 2: Scenario Integrity — OK
Pass 3: Reference Integrity — OK
...
Pass 7c: Tutorial Parity — OK
Pass 13: Drift Detection — review_scope=final Bidirectional Coupling — OK
Total: All checks passed
If a check fails:
FAIL Pass 4 — P.A.R.T. order
CoreImplementer-subagent.agent.md:
Section order is [Prompt, Resources, Archive, Tools]
Expected [Prompt, Archive, Resources, Tools]
The error tells you exactly what file, what check, and what the diff is.
flowchart TD
New[Need to add a scenario] --> Which[Which schema does it cover?]
Which --> Folder[Create file in\nevals/scenarios/<agent-name>/]
Folder --> Write[Write valid JSON\nagainst the schema]
Write --> Run[cd evals && npm test]
Run -->|passed| Done[Scenario added]
Run -->|failed| Fix[Fix JSON or schema]
Fix --> Run
- Create the agent file —
<Name>.agent.md(P.A.R.T. order). - Create the schema —
schemas/<name>.schema.json. - Add eval scenarios — at least one scenario in
evals/scenarios/<name>/. - Register the agent in
plans/project-context.md.
After each step, run npm test to verify nothing is broken.
- Does the agent solve the task correctly? — Not verified; that's a human review.
- Does the LLM follow behavioral invariants at runtime? — Not verified at eval time (only at code review).
- Network dependencies — no live tools, no API calls.
- UI rendering — no visual output.
.github/workflows/ci.yml:
- run: cd evals && npm test
env:
NODE_ENV: testThe CI gate requires all checks to pass. No partial passes.
- Running
npm testfrom the repo root instead ofevals/. The command works only fromevals/. - Adding a scenario file but forgetting the folder (wrong naming convention → schema not found).
- Changing
agents:frontmatter but not updatingplans/project-context.md— companion rule fails. - Reordering P.A.R.T. sections — Pass 4 fails immediately.
- Treating eval failures as "optional" — CI uses the same command; a local failure is a CI failure.
- (beginner) Run
cd evals && npm test. How many checks pass? Optionally redirect the output to a local file (npm test > out.txt, which is gitignored) to review the last run. - (beginner) Open
evals/scenarios/— how many agent folders are there? - (intermediate) Add a
ABSTAINverdict scenario for PlanAuditor. What JSON fields are required? - (intermediate) What companion rule exists for
Orchestrator.agent.md? Find the rule indrift-checks.mjs. - (advanced) Write a test for a new
needs_replanscenario for BrowserTester. What fields are required in the schema?
- How many checks does the full eval suite run?
- Can the eval harness make LLM calls?
- What does Pass 4 check?
- How many steps are needed to add a new agent to the repo?
- What command do you run before declaring a change "done"?