Hypercart-Dev-Tools · noelsaw1 · Jun 8, 2026 · Jun 8, 2026 · Jun 8, 2026
diff --git a/.mcp.json b/.mcp.json
@@ -3,6 +3,10 @@
     "ai-ddtk": {
       "command": "bash",
       "args": ["tools/mcp-server/start.sh"]
+    },
+    "tick": {
+      "command": "node",
+      "args": ["experiments/coordination-layer/mcp/tick-mcp.js"]
     }
   }
 }
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -13,6 +13,30 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
 - Do not edit a version block that has already been committed and pushed
 -->
 
+## [2.3.1] - 2026-06-08
+
+### Added
+- **`trial mcp-doctor`** — preflight health check for the tick MCP server: spawns it, completes the MCP handshake, lists its tools, and round-trips a non-mutating `tick_analyze` call. Defaults to a throwaway `.tick/` state; `--repo-root .` checks the repo's real state read-only.
+- **tick MCP wired into `.mcp.json`** — the `tick` server is now registered at the repo root (resolving to the repo's `.tick/`), so MCP clients (Claude Code) can coordinate via the `tick_*` tools directly without extra setup.
+
+### Changed
+- **`mcp/client.js`** — extracted the tiny MCP stdio JSON-RPC client shared by `mcp-doctor` and `mcp-smoke.js` (removes the duplicated client in the smoke test).
+
+## [2.3.0] - 2026-06-08
+
+### Added
+- **tick MCP server** (`experiments/coordination-layer/mcp/tick-mcp.js`) — exposes all 13 coordination verbs as typed MCP tools (`tick_take`, `tick_done`, `tick_break`, …) as a drop-in alternative to the `tick` CLI. Zero-dependency JSON-RPC-over-stdio; every tool calls the same `src/` modules as the CLI, so CLI and MCP are interchangeable fronts on one `.tick/` state. Includes wiring docs, an example `.mcp.json`, and a self-contained `test/mcp-smoke.js` that drives the server over real JSON-RPC (8/8 green).
+- **`--transport cli|mcp`** for the trial harness — selects whether agents coordinate via the `./tick` shim or the tick MCP tools (with an MCP-flavored agent prompt and an auto-generated workspace `.mcp.json` bound to the run's isolated state).
+- **CLI-orchestration confirmation** (`harness/test/confirm-cli-orchestration.sh` + `test/fake-cli/`) — proves the harness can execute and monitor the real `gemini`/`codex` CLI invocations (`gemini --yolo`, `codex exec --full-auto -`) end-to-end via stand-in binaries that honor the headless contract: prompt on stdin, identity from the prompt, coordinate through `tick`. Asserts real command shapes, stdin delivery, clean monitored exits, and collision-free path-routing (7/7 green).
+
+## [2.2.0] - 2026-06-08
+
+### Added
+- **Trinity trial harness** (`experiments/coordination-layer/harness/`) — drives automated, headless multi-agent coordination trials from the system CLI, replacing the manual "paste prompts into VS Code chat panels and babysit" workflow used in Trinity Runs 1–3. `trial run <spec>` parses a structured project spec, runs a preflight question round with a human gate, seeds an isolated `.tick/` backlog, spawns each agent CLI (Gemini/Codex/Claude) concurrently and headlessly, then scores the run with `tick analyze`. Full observability per run: structured `run.jsonl` spine, per-agent transcripts, and a `report/SUMMARY.md`.
+- **Driver abstraction + mock driver** (`harness/src/drivers.js`, `mock-agent.js`) — headless invocation specs for `gemini --yolo`, `codex exec --full-auto`, and `claude -p`, each env-overridable; a deterministic mock driver speaks the full `tick take → done|break` protocol so the entire battery runs (and the harness is validated) with no API keys.
+- **Deterministic spec parser** (`harness/src/spec.js`) — implements stage 1 of the previously-scaffolded ingestion pipeline: parses `PROJECT-SPEC` markdown into validated task events, hard-failing on duplicate IDs, empty scopes, non-numeric priority, and dependency cycles.
+- **Trial battery** (`harness/trials/`) — two build scenarios (Todo API, URL shortener) and two debug scenarios (seeded-bug fix-iterate, poisoned-task circuit-breaker) with fixtures, plus a `test/smoke.sh` that runs all four on the mock driver (11/11 green).
+
 ## [2.1.6] - 2026-04-23
 
 ### Changed

diff --git a/PROJECT/2-WORKING/P1-TRINITY-ROUND2.md b/PROJECT/2-WORKING/P1-TRINITY-ROUND2.md
@@ -150,6 +150,27 @@ and remove the `tick claim` instruction.
 3. Coordinator integrates the two halves and boots the app.
 4. Append a Run 3 section to `RECAP.md` and update this doc's status.
 
+### Run 3 automation — CLI trial harness (2026-06-08)
+
+Run 3 no longer requires manual chat-panel coordination. The
+[`experiments/coordination-layer/harness/`](../../experiments/coordination-layer/harness/README.md)
+harness drives the whole trial from the system CLI: it parses a structured spec,
+runs a preflight question round with a human gate, seeds the backlog, spawns
+**Gemini CLI and Codex CLI headlessly and concurrently** (`gemini --yolo`,
+`codex exec --full-auto`), captures full transcripts, and runs `tick analyze` —
+producing the concurrent-claim-time metric automatically. To run Run 3:
+
+```bash
+cd experiments/coordination-layer/harness
+node bin/trial doctor                    # confirm gemini + codex are installed with keys
+node bin/trial run build-todo-api        # the Run 2/3 6-task split, now automated
+```
+
+The mock driver (`--agents gemini:mock,codex:mock`) validates the harness with no
+keys; `bash test/smoke.sh` runs the full build+debug battery. The
+load-bearing open question is unchanged — only a **real-CLI** run answers whether
+Gemini/Codex actually comply with the integration prompt.
+
 ---
 
 ## Open questions for Run 3+

diff --git a/experiments/coordination-layer/harness/.gitignore b/experiments/coordination-layer/harness/.gitignore
@@ -0,0 +1,3 @@
+# Per-run trial artifacts (workspaces, transcripts, reports). Regenerated on
+# every `trial run`; never commit them.
+runs/
diff --git a/experiments/coordination-layer/harness/README.md b/experiments/coordination-layer/harness/README.md
@@ -0,0 +1,166 @@
+# Trinity trial harness — automated multi-agent runs from the CLI
+
+**Question this answers:** *Can the Trinity coordination layer use Gemini CLI and
+Codex CLI to run automated trials, instead of a human manually coordinating agent
+chat sessions in VS Code?*
+
+**Answer: yes.** The coordination layer (`../bin/tick`) was already fully
+headless. The only manual part of Runs 1–3 was *driving the agents* — pasting
+prompts into VS Code chat panels and babysitting. This harness automates exactly
+that: it seeds a backlog from a structured spec, runs a preflight question round,
+spawns each agent CLI **headlessly and concurrently**, captures everything, and
+scores the run with `tick analyze`. No human in the loop during the run.
+
+```
+spec (.md) ─▶ parse ─▶ preflight Q&A ─▶ [human gate] ─▶ seed .tick/ ─▶ spawn agents ─▶ analyze ─▶ report
+              stage 1    agents ask        review &        backlog       gemini/codex/…   metrics    SUMMARY.md
+              (deterministic) clarifying    answer          (isolated)    in parallel
+                              questions
+```
+
+## Quick start
+
+```bash
+cd experiments/coordination-layer/harness
+
+node bin/trial list                       # the battery of trial specs
+node bin/trial doctor                     # which agent CLIs are installed
+node bin/trial mcp-doctor                 # health-check the tick MCP server
+node bin/trial validate debug-calc-bugs   # parse + validate a spec
+
+# Real run (needs gemini + codex installed with API keys):
+node bin/trial run build-todo-api
+
+# Harness self-test with the deterministic mock driver (no keys needed):
+node bin/trial run build-todo-api --agents gemini:mock,codex:mock --auto
+bash test/smoke.sh                        # runs the whole battery on the mock
+```
+
+## How a run works
+
+1. **Parse + validate** the spec (`src/spec.js`) — this is the deterministic
+   "stage 1" the [ingestion scaffold](../ingestion/README.md) described:
+   duplicate IDs, empty scopes, non-numeric priority, dependency cycles all
+   hard-fail before anything spawns.
+2. **Isolated workspace** — each run gets `runs/<spec>-<ts>/workspace/`, its own
+   throwaway git repo with its own `.tick/` state (`TICK_REPO_ROOT` points at
+   it). The real repo's `.tick/` is never touched, so trials are safe to run
+   repeatedly and in parallel.
+3. **Preflight + human gate** — every agent gets a read-only prompt and may emit
+   `QUESTION:` lines (or `NO_QUESTIONS`). Questions are collected to
+   `preflight/QUESTIONS.md` and the run **pauses** for a human to review/answer
+   by editing the spec. `--auto` skips the gate; `--skip-preflight` skips the
+   round entirely.
+4. **Seed** the backlog — one `tick log task.created` per task.
+5. **Spawn agents concurrently** — each driver runs its CLI headlessly with the
+   integration prompt (`prompts/agent-loop.md`), looping
+   `tick take → work → tick done|break`.
+6. **Verify + analyze** — runs the project `Verify` command, then `tick analyze`,
+   and writes `report/SUMMARY.md`.
+
+## Observability
+
+Everything a run does is recorded so it's fully reconstructable:
+
+| Artifact | What's in it |
+|---|---|
+| `run.jsonl` | structured spine — one JSON line per harness event (`seed.task`, `agent.spawn`, `agent.exit`, `verify.result`, `trial.end`, …) |
+| `logs/<agent>.log` | raw stdout+stderr transcript of each agent process |
+| `prompt-<agent>.md` | the exact prompt each agent received |
+| `preflight/QUESTIONS.md` | clarifying questions per agent |
+| `report/analyze.json` / `.md` | full `tick analyze` output (claims, dones, breaks, concurrent-claim time) |
+| `report/SUMMARY.md` | human-readable verdict + per-agent table |
+| `spec.json` | the parsed, resolved spec |
+
+## The battery
+
+| Spec | Kind | Tasks | Exercises |
+|---|---|---|---|
+| `build-todo-api` | build | 6 | two-half path routing (HTTP / store), claim cap |
+| `build-url-shortener` | build | 4 | three-concern routing (codec / store / http), contract deps |
+| `debug-calc-bugs` | debug | 2 | fix-iterate: seeded failing tests → green |
+| `debug-poisoned-task` | debug | 2 | circuit-breaker: one fixable bug + one unsatisfiable task |
+
+Debug specs carry a **Fixture** (seeded-bug code under `trials/fixtures/`) and a
+per-task **Verify** command. The fixtures include a `.solutions/` dir and a
+`Mock-solution:` mapping so the mock driver can drive red→green deterministically
+— **real drivers ignore that field and actually debug.**
+
+## Drivers
+
+Headless invocation per agent (`src/drivers.js`). Flags are overridable with env
+vars so you can tune them without editing code (CLIs move fast):
+
+| Driver | Run invocation | Override env |
+|---|---|---|
+| `gemini` | `gemini --yolo` (prompt on stdin) | `GEMINI_CMD`, `GEMINI_ARGS` |
+| `codex` | `codex exec --full-auto -` | `CODEX_CMD`, `CODEX_ARGS` |
+| `claude` | `claude -p --permission-mode acceptEdits` | `CLAUDE_CMD`, `CLAUDE_ARGS` |
+| `mock` | deterministic in-process stand-in | — |
+
+`--agents id:driver,id:driver` overrides the roster, e.g.
+`--agents gemini:gemini,codex:codex` for a real run, or
+`--agents a:mock,b:mock` to validate the harness anywhere.
+
+## CLI vs MCP transport
+
+Agents can coordinate two ways, selectable with `--transport`:
+
+- `--transport cli` (default) — agents call the `./tick` shim. Simplest for any
+  agent that can run shell commands.
+- `--transport mcp` — agents call the **`tick` MCP server**
+  ([`../mcp/`](../mcp/README.md)) tools (`tick_take`, `tick_done`, …). The run
+  drops a ready `.mcp.json` into the workspace bound to that run's isolated
+  state, and agents get the MCP-flavored prompt (`prompts/agent-loop-mcp.md`).
+
+Both fronts drive the same engine on the same `.tick/` state, so they're
+interchangeable and can even be mixed across agents.
+
+## Confirming Claude Code can execute + monitor the real CLIs
+
+`test/confirm-cli-orchestration.sh` proves the execute-and-monitor path without
+the real binaries: it points the `gemini`/`codex` drivers at stand-in binaries
+(`test/fake-cli/`) that honor the exact headless contract (prompt on stdin,
+identity from the prompt, coordinate via `./tick`), then runs a real trial with
+the `gemini`/`codex` driver names. It asserts the harness spawned the real
+command shapes (`gemini --yolo`, `codex exec --full-auto -`), delivered the
+prompt on stdin, monitored both processes to a clean exit, and that path-routing
+split the work (HTTP half vs store half) with no collisions. Swap the stand-ins
+for the real binaries (with keys) and nothing else changes.
+
+## Running for real (gemini + codex)
+
+1. Install both CLIs and set their API keys:
+   - Gemini CLI — `GEMINI_API_KEY` (or its configured auth).
+   - Codex CLI — `OPENAI_API_KEY` (or its configured auth).
+2. `node bin/trial doctor` — confirm both show ✅.
+3. `node bin/trial run build-todo-api` — preflight pauses at the gate; review
+   `preflight/QUESTIONS.md`, edit the spec if needed, then re-run with
+   `--skip-preflight` (or `--auto`).
+4. Read `report/SUMMARY.md`. The load-bearing metric is **concurrent-claim time**
+   (target ≥ 50%, per the Run 3 success criterion in
+   [`P1-TRINITY-ROUND2.md`](../../../PROJECT/2-WORKING/P1-TRINITY-ROUND2.md)).
+
+## Adding a trial
+
+Copy any `trials/*.project.md`, following
+[`../ingestion/PROJECT-SPEC.template.md`](../ingestion/PROJECT-SPEC.template.md).
+Project metadata (`**Key:** value`) goes above the first `##` heading; one
+`### TASK-<ID>` block per task. For a debug trial, add a `Fixture:` dir and a
+per-task `Verify:` command (plus `Mock-solution:` if you want the mock to drive
+it green). `node bin/trial validate <spec>` checks it before you run.
+
+## Limitations / honest notes
+
+- **The mock is not a real agent.** It proves the *harness* — protocol calls,
+  concurrency, observability, red→green, circuit-break — not that Gemini/Codex
+  will comply with the prompt. That's still the load-bearing open question from
+  the Run 1/2 retros; only a real-CLI run answers it.
+- **Mock concurrent-claim time runs ~30–40%** because simulated work is fast and
+  tasks are few. Real agents (minutes per task) should overlap far more; the
+  metric is computed identically either way.
+- **The claim lock is shared across agents** — simultaneous `tick take` calls
+  collide and the loser must retry. The mock retries with backoff; the real
+  agent prompt instructs the same. If real runs show this as friction, the
+  Phase-2 fix is a per-agent lock or a queue.
+- Single branch / single host only, same as the underlying `tick` protocol.