Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .mcp.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
"ai-ddtk": {
"command": "bash",
"args": ["tools/mcp-server/start.sh"]
},
"tick": {
"command": "node",
"args": ["experiments/coordination-layer/mcp/tick-mcp.js"]
}
}
}
24 changes: 24 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,30 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
- Do not edit a version block that has already been committed and pushed
-->

## [2.3.1] - 2026-06-08

### Added
- **`trial mcp-doctor`** — preflight health check for the tick MCP server: spawns it, completes the MCP handshake, lists its tools, and round-trips a non-mutating `tick_analyze` call. Defaults to a throwaway `.tick/` state; `--repo-root .` checks the repo's real state read-only.
- **tick MCP wired into `.mcp.json`** — the `tick` server is now registered at the repo root (resolving to the repo's `.tick/`), so MCP clients (Claude Code) can coordinate via the `tick_*` tools directly without extra setup.

### Changed
- **`mcp/client.js`** — extracted the tiny MCP stdio JSON-RPC client shared by `mcp-doctor` and `mcp-smoke.js` (removes the duplicated client in the smoke test).

## [2.3.0] - 2026-06-08

### Added
- **tick MCP server** (`experiments/coordination-layer/mcp/tick-mcp.js`) — exposes all 13 coordination verbs as typed MCP tools (`tick_take`, `tick_done`, `tick_break`, …) as a drop-in alternative to the `tick` CLI. Zero-dependency JSON-RPC-over-stdio; every tool calls the same `src/` modules as the CLI, so CLI and MCP are interchangeable fronts on one `.tick/` state. Includes wiring docs, an example `.mcp.json`, and a self-contained `test/mcp-smoke.js` that drives the server over real JSON-RPC (8/8 green).
- **`--transport cli|mcp`** for the trial harness — selects whether agents coordinate via the `./tick` shim or the tick MCP tools (with an MCP-flavored agent prompt and an auto-generated workspace `.mcp.json` bound to the run's isolated state).
- **CLI-orchestration confirmation** (`harness/test/confirm-cli-orchestration.sh` + `test/fake-cli/`) — proves the harness can execute and monitor the real `gemini`/`codex` CLI invocations (`gemini --yolo`, `codex exec --full-auto -`) end-to-end via stand-in binaries that honor the headless contract: prompt on stdin, identity from the prompt, coordinate through `tick`. Asserts real command shapes, stdin delivery, clean monitored exits, and collision-free path-routing (7/7 green).

## [2.2.0] - 2026-06-08

### Added
- **Trinity trial harness** (`experiments/coordination-layer/harness/`) — drives automated, headless multi-agent coordination trials from the system CLI, replacing the manual "paste prompts into VS Code chat panels and babysit" workflow used in Trinity Runs 1–3. `trial run <spec>` parses a structured project spec, runs a preflight question round with a human gate, seeds an isolated `.tick/` backlog, spawns each agent CLI (Gemini/Codex/Claude) concurrently and headlessly, then scores the run with `tick analyze`. Full observability per run: structured `run.jsonl` spine, per-agent transcripts, and a `report/SUMMARY.md`.
- **Driver abstraction + mock driver** (`harness/src/drivers.js`, `mock-agent.js`) — headless invocation specs for `gemini --yolo`, `codex exec --full-auto`, and `claude -p`, each env-overridable; a deterministic mock driver speaks the full `tick take → done|break` protocol so the entire battery runs (and the harness is validated) with no API keys.
- **Deterministic spec parser** (`harness/src/spec.js`) — implements stage 1 of the previously-scaffolded ingestion pipeline: parses `PROJECT-SPEC` markdown into validated task events, hard-failing on duplicate IDs, empty scopes, non-numeric priority, and dependency cycles.
- **Trial battery** (`harness/trials/`) — two build scenarios (Todo API, URL shortener) and two debug scenarios (seeded-bug fix-iterate, poisoned-task circuit-breaker) with fixtures, plus a `test/smoke.sh` that runs all four on the mock driver (11/11 green).

## [2.1.6] - 2026-04-23

### Changed
Expand Down
21 changes: 21 additions & 0 deletions PROJECT/2-WORKING/P1-TRINITY-ROUND2.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,27 @@ and remove the `tick claim` instruction.
3. Coordinator integrates the two halves and boots the app.
4. Append a Run 3 section to `RECAP.md` and update this doc's status.

### Run 3 automation — CLI trial harness (2026-06-08)

Run 3 no longer requires manual chat-panel coordination. The
[`experiments/coordination-layer/harness/`](../../experiments/coordination-layer/harness/README.md)
harness drives the whole trial from the system CLI: it parses a structured spec,
runs a preflight question round with a human gate, seeds the backlog, spawns
**Gemini CLI and Codex CLI headlessly and concurrently** (`gemini --yolo`,
`codex exec --full-auto`), captures full transcripts, and runs `tick analyze` —
producing the concurrent-claim-time metric automatically. To run Run 3:

```bash
cd experiments/coordination-layer/harness
node bin/trial doctor # confirm gemini + codex are installed with keys
node bin/trial run build-todo-api # the Run 2/3 6-task split, now automated
```

The mock driver (`--agents gemini:mock,codex:mock`) validates the harness with no
keys; `bash test/smoke.sh` runs the full build+debug battery. The
load-bearing open question is unchanged — only a **real-CLI** run answers whether
Gemini/Codex actually comply with the integration prompt.

---

## Open questions for Run 3+
Expand Down
3 changes: 3 additions & 0 deletions experiments/coordination-layer/harness/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Per-run trial artifacts (workspaces, transcripts, reports). Regenerated on
# every `trial run`; never commit them.
runs/
166 changes: 166 additions & 0 deletions experiments/coordination-layer/harness/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# Trinity trial harness — automated multi-agent runs from the CLI

**Question this answers:** *Can the Trinity coordination layer use Gemini CLI and
Codex CLI to run automated trials, instead of a human manually coordinating agent
chat sessions in VS Code?*

**Answer: yes.** The coordination layer (`../bin/tick`) was already fully
headless. The only manual part of Runs 1–3 was *driving the agents* — pasting
prompts into VS Code chat panels and babysitting. This harness automates exactly
that: it seeds a backlog from a structured spec, runs a preflight question round,
spawns each agent CLI **headlessly and concurrently**, captures everything, and
scores the run with `tick analyze`. No human in the loop during the run.

```
spec (.md) ─▶ parse ─▶ preflight Q&A ─▶ [human gate] ─▶ seed .tick/ ─▶ spawn agents ─▶ analyze ─▶ report
stage 1 agents ask review & backlog gemini/codex/… metrics SUMMARY.md
(deterministic) clarifying answer (isolated) in parallel
questions
```

## Quick start

```bash
cd experiments/coordination-layer/harness

node bin/trial list # the battery of trial specs
node bin/trial doctor # which agent CLIs are installed
node bin/trial mcp-doctor # health-check the tick MCP server
node bin/trial validate debug-calc-bugs # parse + validate a spec

# Real run (needs gemini + codex installed with API keys):
node bin/trial run build-todo-api

# Harness self-test with the deterministic mock driver (no keys needed):
node bin/trial run build-todo-api --agents gemini:mock,codex:mock --auto
bash test/smoke.sh # runs the whole battery on the mock
```

## How a run works

1. **Parse + validate** the spec (`src/spec.js`) — this is the deterministic
"stage 1" the [ingestion scaffold](../ingestion/README.md) described:
duplicate IDs, empty scopes, non-numeric priority, dependency cycles all
hard-fail before anything spawns.
2. **Isolated workspace** — each run gets `runs/<spec>-<ts>/workspace/`, its own
throwaway git repo with its own `.tick/` state (`TICK_REPO_ROOT` points at
it). The real repo's `.tick/` is never touched, so trials are safe to run
repeatedly and in parallel.
3. **Preflight + human gate** — every agent gets a read-only prompt and may emit
`QUESTION:` lines (or `NO_QUESTIONS`). Questions are collected to
`preflight/QUESTIONS.md` and the run **pauses** for a human to review/answer
by editing the spec. `--auto` skips the gate; `--skip-preflight` skips the
round entirely.
4. **Seed** the backlog — one `tick log task.created` per task.
5. **Spawn agents concurrently** — each driver runs its CLI headlessly with the
integration prompt (`prompts/agent-loop.md`), looping
`tick take → work → tick done|break`.
6. **Verify + analyze** — runs the project `Verify` command, then `tick analyze`,
and writes `report/SUMMARY.md`.

## Observability

Everything a run does is recorded so it's fully reconstructable:

| Artifact | What's in it |
|---|---|
| `run.jsonl` | structured spine — one JSON line per harness event (`seed.task`, `agent.spawn`, `agent.exit`, `verify.result`, `trial.end`, …) |
| `logs/<agent>.log` | raw stdout+stderr transcript of each agent process |
| `prompt-<agent>.md` | the exact prompt each agent received |
| `preflight/QUESTIONS.md` | clarifying questions per agent |
| `report/analyze.json` / `.md` | full `tick analyze` output (claims, dones, breaks, concurrent-claim time) |
| `report/SUMMARY.md` | human-readable verdict + per-agent table |
| `spec.json` | the parsed, resolved spec |

## The battery

| Spec | Kind | Tasks | Exercises |
|---|---|---|---|
| `build-todo-api` | build | 6 | two-half path routing (HTTP / store), claim cap |
| `build-url-shortener` | build | 4 | three-concern routing (codec / store / http), contract deps |
| `debug-calc-bugs` | debug | 2 | fix-iterate: seeded failing tests → green |
| `debug-poisoned-task` | debug | 2 | circuit-breaker: one fixable bug + one unsatisfiable task |

Debug specs carry a **Fixture** (seeded-bug code under `trials/fixtures/`) and a
per-task **Verify** command. The fixtures include a `.solutions/` dir and a
`Mock-solution:` mapping so the mock driver can drive red→green deterministically
— **real drivers ignore that field and actually debug.**

## Drivers

Headless invocation per agent (`src/drivers.js`). Flags are overridable with env
vars so you can tune them without editing code (CLIs move fast):

| Driver | Run invocation | Override env |
|---|---|---|
| `gemini` | `gemini --yolo` (prompt on stdin) | `GEMINI_CMD`, `GEMINI_ARGS` |
| `codex` | `codex exec --full-auto -` | `CODEX_CMD`, `CODEX_ARGS` |
| `claude` | `claude -p --permission-mode acceptEdits` | `CLAUDE_CMD`, `CLAUDE_ARGS` |
| `mock` | deterministic in-process stand-in | — |

`--agents id:driver,id:driver` overrides the roster, e.g.
`--agents gemini:gemini,codex:codex` for a real run, or
`--agents a:mock,b:mock` to validate the harness anywhere.

## CLI vs MCP transport

Agents can coordinate two ways, selectable with `--transport`:

- `--transport cli` (default) — agents call the `./tick` shim. Simplest for any
agent that can run shell commands.
- `--transport mcp` — agents call the **`tick` MCP server**
([`../mcp/`](../mcp/README.md)) tools (`tick_take`, `tick_done`, …). The run
drops a ready `.mcp.json` into the workspace bound to that run's isolated
state, and agents get the MCP-flavored prompt (`prompts/agent-loop-mcp.md`).

Both fronts drive the same engine on the same `.tick/` state, so they're
interchangeable and can even be mixed across agents.

## Confirming Claude Code can execute + monitor the real CLIs

`test/confirm-cli-orchestration.sh` proves the execute-and-monitor path without
the real binaries: it points the `gemini`/`codex` drivers at stand-in binaries
(`test/fake-cli/`) that honor the exact headless contract (prompt on stdin,
identity from the prompt, coordinate via `./tick`), then runs a real trial with
the `gemini`/`codex` driver names. It asserts the harness spawned the real
command shapes (`gemini --yolo`, `codex exec --full-auto -`), delivered the
prompt on stdin, monitored both processes to a clean exit, and that path-routing
split the work (HTTP half vs store half) with no collisions. Swap the stand-ins
for the real binaries (with keys) and nothing else changes.

## Running for real (gemini + codex)

1. Install both CLIs and set their API keys:
- Gemini CLI — `GEMINI_API_KEY` (or its configured auth).
- Codex CLI — `OPENAI_API_KEY` (or its configured auth).
2. `node bin/trial doctor` — confirm both show ✅.
3. `node bin/trial run build-todo-api` — preflight pauses at the gate; review
`preflight/QUESTIONS.md`, edit the spec if needed, then re-run with
`--skip-preflight` (or `--auto`).
4. Read `report/SUMMARY.md`. The load-bearing metric is **concurrent-claim time**
(target ≥ 50%, per the Run 3 success criterion in
[`P1-TRINITY-ROUND2.md`](../../../PROJECT/2-WORKING/P1-TRINITY-ROUND2.md)).

## Adding a trial

Copy any `trials/*.project.md`, following
[`../ingestion/PROJECT-SPEC.template.md`](../ingestion/PROJECT-SPEC.template.md).
Project metadata (`**Key:** value`) goes above the first `##` heading; one
`### TASK-<ID>` block per task. For a debug trial, add a `Fixture:` dir and a
per-task `Verify:` command (plus `Mock-solution:` if you want the mock to drive
it green). `node bin/trial validate <spec>` checks it before you run.

## Limitations / honest notes

- **The mock is not a real agent.** It proves the *harness* — protocol calls,
concurrency, observability, red→green, circuit-break — not that Gemini/Codex
will comply with the prompt. That's still the load-bearing open question from
the Run 1/2 retros; only a real-CLI run answers it.
- **Mock concurrent-claim time runs ~30–40%** because simulated work is fast and
tasks are few. Real agents (minutes per task) should overlap far more; the
metric is computed identically either way.
- **The claim lock is shared across agents** — simultaneous `tick take` calls
collide and the loser must retry. The mock retries with backoff; the real
agent prompt instructs the same. If real runs show this as friction, the
Phase-2 fix is a per-agent lock or a queue.
- Single branch / single host only, same as the underlying `tick` protocol.
Loading