diff --git a/README.md b/README.md index 7b9c5f8..5abb877 100644 --- a/README.md +++ b/README.md @@ -8,11 +8,11 @@ Package metadata lives in [prpm.json](prpm.json). The repo currently publishes ` | Skill | Version | Description | |-------|---------|-------------| -| [choosing-swarm-patterns](skills/choosing-swarm-patterns/SKILL.md) | 1.1.3 | Pick the right Agent Relay orchestration pattern across the 10 core swarm patterns plus specialized patterns. | +| [choosing-swarm-patterns](skills/choosing-swarm-patterns/SKILL.md) | 1.1.4 | Pick the right Agent Relay orchestration pattern across the 10 core swarm patterns plus specialized patterns. | | [writing-agent-relay-workflows](skills/writing-agent-relay-workflows/SKILL.md) | 1.6.17 | Build multi-agent workflows with WorkflowBuilder, DAG dependencies, verification gates, review-depth review/fix loops with test hardening, channels, and chat-native coordination recipes. | | [setting-up-relayfile](skills/setting-up-relayfile/SKILL.md) | 1.1.0 | Set up Relayfile mounts and writeback for provider files through local filesystem access. | -| [using-agent-relay](skills/using-agent-relay/SKILL.md) | 1.3.0 | Participant-side MCP reference for a **registered** relay agent (spawned worker / registered lead): messaging, channels, threads, reactions, search, webhooks. Counterpart to `orchestrating-agent-relay`. | -| [orchestrating-agent-relay](skills/orchestrating-agent-relay/SKILL.md) | 2.1.2 | The canonical way to run agent-relay: self-bootstrap the broker and autonomously spawn, monitor, and coordinate a worker team without human intervention. | +| [using-agent-relay](skills/using-agent-relay/SKILL.md) | 1.4.0 | Participant-side MCP reference for a **registered** relay agent (spawned worker / registered lead): messaging, channels, threads, reactions, and search. Counterpart to `orchestrating-agent-relay`. | +| [orchestrating-agent-relay](skills/orchestrating-agent-relay/SKILL.md) | 2.2.0 | The canonical way to run agent-relay: self-bootstrap the broker and autonomously spawn, monitor, and coordinate a worker team without human intervention. | | [relay-80-100-workflow](skills/relay-80-100-workflow/SKILL.md) | 1.0.8 | Author workflows that close the 80-to-100 validation gap with repair-aware test, verify, review-depth review/fix with test hardening, and commit gates. | | [review-fix-signoff-loop](skills/review-fix-signoff-loop/SKILL.md) | 1.0.2 | Loop review, repair, validation, and fresh-context dual-agent signoff until independent reviewers both satisfy the verdict contract. | | [activity-summary](skills/activity-summary/SKILL.md) | 1.0.0 | Answer "what did I work on yesterday" questions by reading `digests/yesterday.md` first instead of crawling provider directories. | @@ -20,14 +20,14 @@ Package metadata lives in [prpm.json](prpm.json). The repo currently publishes ` | [writeback-as-files](skills/writeback-as-files/SKILL.md) | 1.0.0 | File-creation writeback contract — drop a JSON file at the canonical path and relayfile delivers the mutation, with dead-letter recovery. | | [workspace-layout](skills/workspace-layout/SKILL.md) | 1.0.0 | Navigate a relayfile mount via root and per-provider `LAYOUT.md` files plus `by-*` alias indexes instead of `find`/`grep -r`. | | [adding-swarm-patterns](skills/adding-swarm-patterns/SKILL.md) | 1.0.0 | Checklist for extending agent-relay with a new swarm pattern — TypeScript types, JSON schema, YAML template, and pattern/template docs. | -| [openclaw-orchestrator](skills/openclaw-orchestrator/SKILL.md) | 1.0.0 | Run headless multi-agent orchestration sessions via Agent Relay — spawn teams across Claude/Codex/Gemini/Pi/Droid, create channels, and manage agent lifecycle. | +| [openclaw-orchestrator](skills/openclaw-orchestrator/SKILL.md) | 1.1.0 | Run headless multi-agent orchestration sessions via Agent Relay — spawn teams across Claude/Codex/Gemini/Pi/Droid, create channels, and manage agent lifecycle. | ## Slash Commands | Command | Version | Description | |---------|---------|-------------| | [/create-workflow](commands/create-workflow.md) | 1.0.4 | Scaffold a model-agnostic Agent Relay workflow using the workflow and swarm-pattern skills, including selected review-depth review/fix loops with test hardening. | -| [/spawn](commands/spawn.md) | 1.0.0 | Bootstrap the broker and spawn a worker for `claude`, `codex`, `opencode`, `droid`, `gemini`, or `pi`. | +| [/spawn](commands/spawn.md) | 1.0.1 | Bootstrap the broker and spawn a worker for `claude`, `codex`, `opencode`, `droid`, `gemini`, or `pi`. | | [/review-loop](commands/review-loop.md) | 1.0.1 | Run a dual-reviewer code-review loop with repair and fresh-context signoff. | ## Claude Relay Plugin diff --git a/commands/spawn.md b/commands/spawn.md index ac90b47..23158e0 100644 --- a/commands/spawn.md +++ b/commands/spawn.md @@ -21,26 +21,26 @@ Bootstrap the agent-relay broker (if not already running) and spawn a worker on - Optional `--task ""` — the task prompt for the spawned worker. If omitted, prompt the user for the task before spawning. 3. **Bootstrap the broker (idempotent).** Per the orchestrator skill: - - Run `agent-relay status` first. If broker is up, skip startup. - - If down, ensure a workspace key is available (`$RELAYCAST_WORKSPACE_KEY` or prompt the user). Then `agent-relay up --workspace-key $KEY --background --no-spawn`. + - Run `agent-relay local status` first. If broker is up, skip startup. + - If down, ensure a workspace key is available (`$RELAYCAST_WORKSPACE_KEY` or `$AGENT_RELAY_WORKSPACE_KEY`, or prompt the user). Then `agent-relay local up --workspace-key $KEY --background --no-spawn`. - Verify broker came up cleanly before spawning. -4. **Ensure a coordination channel exists.** Default to `#orchestrator` unless the user specified one. Create it via `mcporter call relaycast create_channel` if missing, then join it. +4. **Ensure a coordination channel exists.** Default to `orchestrator` unless the user specified one. Create it via `mcporter call agent-relay create_channel` if missing, then join it with `mcporter call agent-relay join_channel`. 5. **Spawn the worker.** Construct the spawn command from parsed args: ```text - agent-relay spawn $1 [--model ] [--team orchestrator] "" + agent-relay local agent spawn $1 --name [--model ] --channels orchestrator --task "" ``` - `` should be unique per run (e.g., `worker-`) to avoid 409 conflicts. - Inject the standard task-prompt template from the orchestrator skill (channel posting, inbox checks, completion event) so the worker can communicate. -6. **Report back.** Print the spawned agent name, the channel it joined, and the tail command (`agent-relay agents:logs `) so the user can monitor it. +6. **Report back.** Print the spawned agent name, the channel it joined, the relay message commands for coordination, and the debug tail command (`agent-relay local tail --agent `). ## Output Contract - One-line confirmation: broker state, agent name, harness, model, channel. -- Monitoring commands (logs, channel messages, kill). -- If something failed (workspace key missing, harness unsupported, broker won't start), surface the exact error and the orchestrator-skill fix from its "Gotchas" table — do not silently continue. +- Monitoring commands (relay messages, liveness, debug tail, release). +- If something failed (workspace key missing, harness unsupported, broker won't start), surface the exact error and the orchestrator-skill fix from its "Common Mistakes" table — do not silently continue. ## Constraints diff --git a/prpm.json b/prpm.json index 503e96f..95c93f7 100644 --- a/prpm.json +++ b/prpm.json @@ -10,7 +10,7 @@ "packages": [ { "name": "choosing-swarm-patterns", - "version": "1.1.3", + "version": "1.1.4", "description": "Use when coordinating multiple AI agents with Agent Relay's workflow engine and need to pick the right orchestration pattern - covers the 10 core patterns (fan-out, pipeline, hub-spoke, consensus, mesh, handoff, cascade, dag, debate, hierarchical) plus 14 specialized ones, with decision framework and accurate SDK/YAML examples.", "format": "claude", "subtype": "skill", @@ -71,8 +71,8 @@ }, { "name": "using-agent-relay", - "version": "1.3.0", - "description": "Use when you are a registered relay agent (a spawned worker, or a lead that called agent_register) coordinating with peers in real time over Relaycast MCP tools - messaging, channels, threads, reactions, search, webhooks. Participant-side reference; the counterpart for driving a team from outside is orchestrating-agent-relay.", + "version": "1.4.0", + "description": "Use when you are a registered relay agent (a spawned worker, or a lead that called register_agent) coordinating with peers in real time over Agent Relay MCP tools - messaging, channels, threads, reactions, and search. Participant-side reference; the counterpart for driving a team from outside is orchestrating-agent-relay.", "format": "claude", "subtype": "skill", "tags": [ @@ -89,8 +89,8 @@ }, { "name": "orchestrating-agent-relay", - "version": "2.1.2", - "description": "The canonical way to run agent-relay - self-bootstrap the broker and autonomously spawn, monitor, and coordinate a team of worker agents without human intervention. Covers infrastructure startup, agent spawning, lifecycle monitoring, CLI-first reading, and team coordination.", + "version": "2.2.0", + "description": "The canonical way to run agent-relay - self-bootstrap the broker and autonomously spawn, monitor, and coordinate a team of worker agents without human intervention. Covers infrastructure startup, agent spawning, lifecycle monitoring, relay-mediated messaging, and team coordination.", "format": "claude", "subtype": "skill", "tags": [ @@ -249,7 +249,7 @@ }, { "name": "spawn", - "version": "1.0.0", + "version": "1.0.1", "description": "Slash command for relay-orchestrator that bootstraps the agent-relay broker and spawns a worker on the chosen harness (claude, codex, opencode, droid, gemini, pi) with optional --model and --task - loads the orchestrator skill so infrastructure, channels, and lifecycle just work", "format": "claude", "subtype": "slash-command", @@ -309,7 +309,7 @@ }, { "name": "openclaw-orchestrator", - "version": "1.0.0", + "version": "1.1.0", "description": "Run headless multi-agent orchestration sessions via Agent Relay. Use when spawning teams of agents, creating channels for coordination, managing agent lifecycle, and running parallel workloads across Claude/Codex/Gemini/Pi/Droid agents.", "format": "claude", "subtype": "skill", diff --git a/skills/choosing-swarm-patterns/SKILL.md b/skills/choosing-swarm-patterns/SKILL.md index c6b017d..20d3358 100644 --- a/skills/choosing-swarm-patterns/SKILL.md +++ b/skills/choosing-swarm-patterns/SKILL.md @@ -186,7 +186,7 @@ await workflow("api-build") .step("review", { agent: "lead", task: "Review everything", dependsOn: ["routes"] }) .run(); ``` -Hub (picked via `role: lead` or first agent) stays on the channel and direct-messages interactive workers via `mcp__relaycast__message_dm_send`. +Hub (picked via `role: lead` or first agent) stays on the channel and direct-messages interactive workers via `mcp__agent-relay__send_dm`. > **Don't set `interactive: false` on a hub-spoke worker** if you want it to receive coordination DMs — `resolveTopology` strips non-interactive agents from the message graph (`coordinator.ts:218-237`). Use `interactive: false` only when the worker is a one-shot subprocess whose stdout you collect via `{{steps.X.output}}` without any mid-run coordination. @@ -333,21 +333,21 @@ Conventional signals baked into the adapter (`relay-adapter.ts:29-36`): The runner captures PTY chunks as step output and also records channel posts + file changes as `StepCompletionEvidence`. Legacy fallback: a file at `.relay/summaries/{stepName}.md` is read if PTY output is empty (`runner.ts:6607`). -## Relaycast MCP — Correct Tool Names +## Agent Relay MCP — Correct Tool Names -The skill previously referenced `mcp__relaycast__send` / `mcp__relaycast__dm` — those names are wrong. The real tools (the first three are cited in the workflow convention-injection at `relay-adapter.ts:31-35`; the rest are exposed by the live `relaycast` MCP server): +The skill previously referenced `mcp__agent-relay__send` / `mcp__agent-relay__dm` — those names are wrong. The real tools (the first three are cited in the workflow convention-injection at `relay-adapter.ts:31-35`; the rest are exposed by the live `agent-relay` MCP server): | Purpose | Tool | Source | |---------|------|--------| -| Send DM to another agent | `mcp__relaycast__message_dm_send` | `relay-adapter.ts:31` | -| Check inbox | `mcp__relaycast__message_inbox_check` | `relay-adapter.ts:35` | -| List agents | `mcp__relaycast__agent_list` | `relay-adapter.ts:35` | -| Post to a channel | `mcp__relaycast__message_post` | relaycast MCP server | -| Reply in a thread | `mcp__relaycast__message_reply` | relaycast MCP server | -| Spawn sub-agent | `mcp__relaycast__agent_add` | relaycast MCP server | -| Remove sub-agent | `mcp__relaycast__agent_remove` | relaycast MCP server | +| Send DM to another agent | `mcp__agent-relay__send_dm` | `relay-adapter.ts:31` | +| Check inbox | `mcp__agent-relay__check_inbox` | `relay-adapter.ts:35` | +| List agents | `mcp__agent-relay__list_agents` | `relay-adapter.ts:35` | +| Post to a channel | `mcp__agent-relay__post_message` | agent-relay MCP server | +| Reply in a thread | `mcp__agent-relay__reply_to_thread` | agent-relay MCP server | +| Spawn sub-agent | `mcp__agent-relay__add_agent` | agent-relay MCP server | +| Remove sub-agent | `mcp__agent-relay__remove_agent` | agent-relay MCP server | -> `interactive: false` agents run as non-interactive subprocesses with no relay connection — they must NOT call any `mcp__relaycast__*` tool (validator warns on this at `validator.ts:138-150`, check `NONINTERACTIVE_RELAY`). +> `interactive: false` agents run as non-interactive subprocesses with no relay connection — they must NOT call any `mcp__agent-relay__*` tool (validator warns on this at `validator.ts:138-150`, check `NONINTERACTIVE_RELAY`). ## Reflection (Trajectories) @@ -383,7 +383,7 @@ For a first-class critic loop, use the `reflection` **pattern** (agents with `ro | Relying on `reflectOnBarriers` | Config flag exists but runner never calls it | Use `reflectOnConverge` for convergence reflection; use `reflection` pattern for critic loops | | `interactive: false` agent calling MCP | Non-interactive subprocess has no relay | Use `interactive: true` (default) or emit output on stdout | | Relying on multi-level `hierarchical` | Topology is single-level hub in current impl | Use pattern for naming; model levels via `dependsOn` graph | -| Writing `mcp__relaycast__send(...)` | Wrong tool name | Use `mcp__relaycast__message_post` or `message_dm_send` | +| Writing `mcp__agent-relay__send(...)` | Wrong tool name | Use `mcp__agent-relay__post_message` or `mcp__agent-relay__send_dm` | ## Resume & Re-run @@ -397,7 +397,7 @@ await runWorkflow("feature-dev.yaml", { previousRunId: "", }); ``` -Cached outputs live in `.agent-relay/step-outputs/`; runs in `.agent-relay/workflow-runs.jsonl`. Env vars `RESUME_RUN_ID`, `START_FROM`, `PREVIOUS_RUN_ID` are auto-detected. +Cached outputs live in `.agentworkforce/relay/step-outputs/`; runs in `.agentworkforce/relay/workflow-runs.jsonl`. Env vars `RESUME_RUN_ID`, `START_FROM`, `PREVIOUS_RUN_ID` are auto-detected. ## Complete YAML Example diff --git a/skills/openclaw-orchestrator/SKILL.md b/skills/openclaw-orchestrator/SKILL.md index 76c267d..5d62c54 100644 --- a/skills/openclaw-orchestrator/SKILL.md +++ b/skills/openclaw-orchestrator/SKILL.md @@ -1,6 +1,6 @@ --- -name: agent-relay-orchestrator -version: 1.0.0 +name: openclaw-orchestrator +version: 1.1.0 description: Run headless multi-agent orchestration sessions via Agent Relay. Use when spawning teams of agents, creating channels for coordination, managing agent lifecycle, and running parallel workloads across Claude/Codex/Gemini/Pi/Droid agents. homepage: https://agentrelay.com/openclaw metadata: { 'category': 'orchestration', 'requires': 'agent-relay' } @@ -13,24 +13,29 @@ Run headless multi-agent sessions: start infrastructure, join a workspace, creat ## Prerequisites - `agent-relay` CLI installed (`npm i -g agent-relay`) -- Relaycast workspace key (`rk_live_...`) — get one at https://agentrelay.com/openclaw or run `agent-relay up` to auto-create +- Agent Relay workspace key (`rk_live_...`) — get one at https://agentrelay.com/openclaw or run `agent-relay local up` to auto-create - For Claude agents: `ANTHROPIC_API_KEY` or `claude auth login` ## Quick Reference | Action | Command | |--------|---------| -| Start broker | `agent-relay up --workspace-key rk_live_KEY --no-spawn` | -| Start broker (background) | `agent-relay up --workspace-key rk_live_KEY --background --no-spawn` | -| Check status | `agent-relay status` | -| Spawn agent | `agent-relay spawn NAME CLI "task"` | -| Spawn with team | `agent-relay spawn NAME CLI --team TEAM "task"` | -| List agents | `agent-relay agents` | -| View logs | `agent-relay agents:logs NAME` | -| Send to channel | `agent-relay send '#channel' 'message'` | -| Send DM | `agent-relay send AGENT 'message'` | -| Kill agent | `agent-relay agents:kill NAME` | -| Stop broker | `agent-relay down` | +| Start broker | `agent-relay local up --workspace-key rk_live_KEY --no-spawn` | +| Start broker (background) | `agent-relay local up --workspace-key rk_live_KEY --background --no-spawn` | +| Check status | `agent-relay local status` | +| Spawn agent | `agent-relay local agent spawn CLI --name NAME --task "task"` | +| Spawn into a shared channel | `agent-relay local agent spawn CLI --name NAME --channels TEAM --task "task"` | +| List agents | `agent-relay local agent list` | +| View logs (debug) | `agent-relay local tail --agent NAME` | +| Send to channel (via relay) | `agent-relay message post channel 'message'` | +| Send DM (via relay) | `agent-relay message dm send AGENT 'message'` | +| Release agent | `agent-relay local agent release NAME` | +| Stop broker | `agent-relay local down` | + +> Lifecycle (start/stop, spawn/release, list) is the `agent-relay local …` group. +> Messaging always goes through relay — the `agent-relay message …` group (which +> needs an agent token) or the `mcp__agent-relay__*` tools. Don't read worker +> replies off `local tail`; that's raw broker output for debugging. ## Setup Flow @@ -45,47 +50,49 @@ This registers you on the workspace and configures mcporter for channel/DM tools ### 2. Start broker with workspace key ```bash -agent-relay up --workspace-key rk_live_YOUR_KEY --no-spawn +agent-relay local up --workspace-key rk_live_YOUR_KEY --no-spawn ``` -**Critical**: Pass `--workspace-key` so spawned agents inherit the workspace connection. Without it, agents can't communicate via Relaycast channels. +**Critical**: Pass `--workspace-key` so spawned agents inherit the workspace connection. Without it, agents can't communicate via Agent Relay channels. ### 3. Create channels for coordination ```bash -mcporter call relaycast create_channel name=my-project topic="Project coordination" -mcporter call relaycast join_channel channel=my-project +mcporter call agent-relay create_channel name=my-project topic="Project coordination" +mcporter call agent-relay join_channel channel=my-project ``` ### 4. Spawn agents ```bash -agent-relay spawn architect claude --team my-team "Your task..." -agent-relay spawn developer claude --team my-team "Your task..." -agent-relay spawn tester claude --team my-team "Your task..." +agent-relay local agent spawn claude --name architect --channels my-project --task "Your task..." +agent-relay local agent spawn claude --name developer --channels my-project --task "Your task..." +agent-relay local agent spawn claude --name tester --channels my-project --task "Your task..." ``` +Agents that share a channel (`--channels my-project`) coordinate in it. + ## Agent Communication -Spawned agents communicate through the broker's workspace connection. +Spawned agents are registered relay participants — they message through relay's MCP tools (see the **using-agent-relay** skill). ### From spawned agents (in their task prompt) ``` # Post to channel -agent-relay send '#channel-name' 'your message' +mcp__agent-relay__post_message(channel: "channel-name", text: "your message") -# DM another agent -agent-relay send agent-name 'your message' +# DM another agent +mcp__agent-relay__send_dm(to: "agent-name", text: "your message") # Check inbox -agent-relay inbox +mcp__agent-relay__check_inbox() ``` -### From orchestrator (via mcporter) +### From the orchestrator (via mcporter) ```bash -mcporter call relaycast post_message channel=my-project text="Status update" -mcporter call relaycast get_messages channel=my-project limit=20 -mcporter call relaycast send_dm to=architect text="Review the design" +mcporter call agent-relay post_message channel=my-project text="Status update" +mcporter call agent-relay list_messages channel=my-project limit=20 +mcporter call agent-relay send_dm to=architect text="Review the design" ``` ## Agent Types @@ -105,9 +112,9 @@ Include communication instructions in every agent's task: You are ROLE on the TEAM team. ## Communication -Post updates to #channel: agent-relay send '#channel' 'your message' -Check for messages: agent-relay inbox -DM a teammate: agent-relay send teammate-name 'message' +Post updates to a channel: mcp__agent-relay__post_message(channel: "channel", text: "your message") +Check for messages: mcp__agent-relay__check_inbox() +DM a teammate: mcp__agent-relay__send_dm(to: "teammate-name", text: "message") ## Your Team - agent-a (role) — does X @@ -115,42 +122,42 @@ DM a teammate: agent-relay send teammate-name 'message' ## Tasks 1. ... -2. Post progress to #channel +2. Post progress to the channel 3. When done: openclaw system event --text 'Done: description' --mode now ``` ## Monitoring ```bash -# Check all agents -agent-relay agents +# List all agents (JSON: pid, status, uptime) +agent-relay local agent list -# Tail an agent's output -agent-relay agents:logs NAME -n 500 +# Tail an agent's raw output (debug only) +agent-relay local tail --agent NAME -# Check channel conversation -mcporter call relaycast get_messages channel=my-project limit=20 +# Check channel conversation (via relay) +mcporter call agent-relay list_messages channel=my-project limit=20 -# Check who's online -mcporter call relaycast list_agents status=online +# Check who's online (via relay) +mcporter call agent-relay list_agents status=online ``` ## Lifecycle Management ```bash -# Kill a stuck agent -agent-relay agents:kill NAME +# Release a stuck agent (graceful stop) +agent-relay local agent release NAME -# Kill all agents in a team -agent-relay agents | grep TEAM | awk '{print $1}' | xargs -I{} agent-relay agents:kill {} +# Release several by name +for a in architect developer tester; do agent-relay local agent release "$a"; done # Stop everything -agent-relay down +agent-relay local down ``` ## Rate Limiting -- Add 15s gaps between sequential spawns to avoid Relaycast 429 errors +- Add 15s gaps between sequential spawns to avoid Agent Relay 429 errors - Use unique agent names per run (append UUID suffix) to avoid 409 conflicts - The SDK uses `registerOrRotate` pattern: on 409, rotates the agent token @@ -179,7 +186,7 @@ All agents join same channel, post updates, read each other's work on a shared g | Agents can't message | Broker must have `--workspace-key` | | Droid stuck at approval | Don't use `--cwd` with droid agents | | Agent name conflict (409) | Use unique names or let SDK `registerOrRotate` handle it | -| Channel not found | Create it first via `mcporter call relaycast create_channel` | -| Agent idle but no output | Check `agent-relay agents:logs NAME` for errors | +| Channel not found | Create it first via `mcporter call agent-relay create_channel` | +| Agent idle but no output | Check `agent-relay local tail --agent NAME` for errors | | npx setup fails in spawned agent | Agents inherit broker's workspace — no setup needed | -| `agent-relay send` fails for DM | Spawned agents can broadcast to channels but DMs may not work for non-Relaycast-registered agents | +| DM fails for an agent | DMs require a registered identity; broadcast to a channel with `post_message` if the recipient isn't registered | diff --git a/skills/orchestrating-agent-relay/SKILL.md b/skills/orchestrating-agent-relay/SKILL.md index 8be7191..36f5ab8 100644 --- a/skills/orchestrating-agent-relay/SKILL.md +++ b/skills/orchestrating-agent-relay/SKILL.md @@ -1,6 +1,6 @@ --- name: orchestrating-agent-relay -description: The canonical way to run agent-relay - self-bootstrap the broker and autonomously spawn, monitor, and coordinate a team of worker agents without human intervention. Covers infrastructure startup, agent spawning, lifecycle monitoring, CLI-first reading, and team coordination. +description: The canonical way to run agent-relay - self-bootstrap the broker and autonomously spawn, monitor, and coordinate a team of worker agents without human intervention. Covers infrastructure startup, agent spawning, lifecycle monitoring, relay-mediated messaging, and team coordination. --- # Orchestrating Agent Relay @@ -11,41 +11,60 @@ Self-bootstrap agent-relay infrastructure and manage a team of agents autonomous A headless orchestrator is an agent that: -1. Starts the relay infrastructure itself (`agent-relay up`) -2. Spawns and manages worker agents -3. Monitors agent lifecycle events +1. Starts the relay infrastructure itself (`agent-relay local up`) +2. Spawns and manages worker agents (`agent-relay local agent …`) +3. Monitors agent liveness via the broker (`agent-relay local agent list`) and reads worker replies through relay (`agent-relay message inbox check`) 4. Coordinates work without human intervention -The orchestrator drives the team **from outside** and is **not** a -registered relay agent, so it reads/sends/lists via the `agent-relay` CLI -(MCP `mcp__relaycast__message_*` tools require a registered identity). The -workers it spawns _are_ registered participants — their peer-messaging -reference is the **`using-agent-relay`** skill. +The CLI has two surfaces, and the split is the thing to memorize: + +- **`agent-relay local …`** — **lifecycle only**: start/stop the local broker + and spawn/release/list the agents it runs. No token required; it talks to the + local broker via `.agentworkforce/relay/connection.json`. **Never use it to + read or send messages.** +- **`agent-relay message … / channel … / agent …`** — **all messaging goes + through relay** (the Agent Relay service at `gateway.relaycast.dev`). These are + **token-gated** (`--token` / `RELAY_AGENT_TOKEN`). Register once for an agent + token (see Step 3), then send and read every coordination message here — or + use the equivalent relay MCP tools (`mcp__agent-relay__*`). + +**Always go through relay for messaging — never contact the broker directly to +read worker output.** Worker ACKs, replies, and DONE signals arrive as relay +messages: read them with `agent-relay message inbox check` / +`message dm list `, not by tailing the broker. (`local tail` is +a low-level broker/TTY debugging aid only.) + +The orchestrator drives the team **from outside** but is itself a registered +relay agent — that is what lets it message through relay. The workers it spawns +are registered participants too; their peer-messaging reference is the +**`using-agent-relay`** skill. ## When to Use - Agent needs full control over its worker team -- No human available to run `agent-relay up` manually +- No human available to run `agent-relay local up` manually - Agent should manage agent lifecycle autonomously - Building self-contained multi-agent systems ## Quick Reference -| Step | Command/Tool | -| ---------------------------------- | ------------------------------------------------------- | -| Verify installation | `command -v agent-relay` or `npx agent-relay --version` | -| Verify Node runtime if shim fails | `node --version` or fix mise/asdf first | -| Start infrastructure | `agent-relay up --no-dashboard --verbose` | -| Check status | `agent-relay status --wait-for=10` | -| Spawn worker | `agent-relay spawn Worker1 claude "task"` | -| List workers | `agent-relay who` | -| View worker logs | `agent-relay agents:logs Worker1` | -| Send DM to worker | `agent-relay send Worker1 "message"` | -| Post to channel | `agent-relay send '#general' "message"` | -| Read worker DM replies (full text) | `agent-relay replies Worker1` (add `--json` to parse) | -| Read full DM conversation history | `agent-relay history --to Worker1` | -| Release worker | `agent-relay release Worker1` | -| Stop infrastructure | `agent-relay down` | +| Step | Command/Tool | +| --------------------------------- | ------------------------------------------------------------- | +| Verify installation | `command -v agent-relay` or `npx agent-relay --version` | +| Verify Node runtime if shim fails | `node --version` or fix mise/asdf first | +| Start infrastructure | `agent-relay local up --no-dashboard --verbose` | +| Check broker readiness | `agent-relay local status --wait-for=10` | +| Spawn worker | `agent-relay local agent spawn claude --name Worker1 --task "…"` | +| List workers | `agent-relay local agent list` | +| Resource usage | `agent-relay local metrics` | +| Register for a messaging token | `agent-relay agent register Lead` (prints token; then `export RELAY_AGENT_TOKEN=`) | +| DM a worker (via relay) | `agent-relay message dm send Worker1 "…"` | +| Post to a channel (via relay) | `agent-relay message post general "…"` | +| Read a worker's replies (via relay) | `agent-relay message dm list ` | +| Check inbox (via relay) | `agent-relay message inbox check` | +| Debug raw worker output (not messaging) | `agent-relay local tail --agent Worker1` | +| Release worker | `agent-relay local agent release Worker1` | +| Stop infrastructure | `agent-relay local down` | ## Bootstrap Flow @@ -69,330 +88,299 @@ npx agent-relay --version ### Step 1: Start Infrastructure ```bash -# Starts a detached broker in headless mode and returns after API readiness -agent-relay up --no-dashboard --verbose +# Start the local broker in headless mode +agent-relay local up --no-dashboard --verbose ``` Verify broker readiness before spawning any workers: ```bash -# Must show "RUNNING" before you spawn workers -agent-relay status --wait-for=10 +# Polls until the broker reports RUNNING (or times out after 10s) +agent-relay local status --wait-for=10 ``` +The broker: + +- Provisions an Agent Relay workspace when none is configured +- Removes `CLAUDECODE` env var when spawning (fixes nested session error) +- Persists state to `.agentworkforce/relay/` (connection files, etc.) + When verifying from a source checkout or throwaway git worktree, run these commands from the project/worktree root. The CLI writes runtime state to -`.agent-relay/` and may create `.mcp.json`; clean those files after validation -if the worktree should remain clean. +`.agentworkforce/relay/` and may create `.mcp.json`; clean those files after +validation if the worktree should remain clean. Pass `--state-dir ` to +relocate broker state. -The broker: +### Step 2: Spawn Workers -- Auto-creates a Relaycast workspace if `RELAY_API_KEY` not set -- Removes `CLAUDECODE` env var when spawning (fixes nested session error) -- Persists state to `.agent-relay/` +```bash +# provider is positional; --name defaults to the provider; --channels defaults to "general" +agent-relay local agent spawn claude --name Worker1 --task "Implement the authentication module following the existing patterns" +``` -### Step 2: Spawn Workers via MCP +MCP equivalent (works once the orchestrator is registered — see Step 3): ```text -mcp__relaycast__agent_add( +mcp__agent-relay__add_agent( name: "Worker1", cli: "claude", task: "Implement the authentication module following the existing patterns" ) ``` -CLI equivalent: +### Step 3: Register, then Coordinate Through Relay + +Register once for an agent token so every message — sent or read — goes through +relay: ```bash -agent-relay spawn Worker1 claude "Implement the authentication module following the existing patterns" +# Prints a registration JSON that includes the agent token +agent-relay agent register Lead +# Copy the "token" value from the output: +export RELAY_AGENT_TOKEN= ``` -> **Expect a 30–60s gap between spawn and the first ACK.** A worker shows -> `online` in `who --json` within ~5s (the process is up), but the underlying -> CLI (claude/codex) is still cold-starting and won't send its ACK DM until it -> finishes booting — typically 30–45s, occasionally longer, after `online`. -> `online` means "process alive," **not** "agent responsive." Don't treat -> ACK silence in the first minute as a stuck worker; size ACK-wait loops for -> at least 60s (e.g. a 30-iteration poll) before escalating to troubleshooting. - -### Step 3: Monitor and Coordinate +Now do **all** coordination through the `message` group (or the equivalent +`mcp__agent-relay__*` tools): ```bash -# Read Worker1's DM replies (chronological, full text, untruncated) -agent-relay replies Worker1 - -# Machine-readable: full text + direction, safe to parse in a loop -agent-relay replies Worker1 --json +agent-relay message dm send Worker1 "Also add unit tests" # targeted DM +agent-relay message post general "All workers: wrap up" # channel broadcast (bare name, no #) +agent-relay message dm list # read a worker's replies +agent-relay message inbox check # unread across conversations +``` -# Send a targeted DM to a specific worker -agent-relay send Worker1 "Also add unit tests" +Use `--json` when sending or checking the inbox if you need to script follow-up +reads: the DM send response includes the conversation ID, and inbox entries can +also carry the conversation ID for the thread to pass to `message dm list`. -# Broadcast to all agents on a channel -agent-relay send '#general' "All workers: wrap up current task" +Track which workers are alive with the lifecycle command (not a messaging +channel): -# List active workers (structured status for polling) -agent-relay who --json +```bash +agent-relay local agent list # pid, status, uptime — JSON, ideal for polling ``` -> **The spawning orchestrator is not a registered relaycast agent.** -> The `mcp__relaycast__message_*` / `agent_list` MCP tools require a -> registered identity and fail for you with the error -> `Not registered. Call agent.register first.` -> Use the `agent-relay` CLI for all reading, sending, and listing, and add -> `--json` to any read command (`replies`, `history`, `who`) when you need -> full, untruncated, parseable output. +> **Read worker replies through relay, never from the broker.** ACKs, replies, +> and DONE signals are relay messages — read them with `message inbox check` / +> `message dm list`. Do not use `local tail` to "read" worker responses; it +> streams the broker's raw TTY output and is only a low-level debugging aid. +> +> **Messaging requires a registered agent identity.** The `message`, `channel`, +> and `dm` groups (and the `mcp__agent-relay__*` tools) reject unregistered +> callers with `Not registered. Call register_agent first.` Run +> `agent-relay agent register ` and set `RELAY_AGENT_TOKEN` (or pass +> `--token ` per call). ### Step 4: Release Workers -```text -mcp__relaycast__agent_remove(name: "Worker1") +```bash +agent-relay local agent release Worker1 +# MCP: mcp__agent-relay__remove_agent(name: "Worker1") ``` ### Step 5: Shutdown (optional) ```bash -agent-relay down +agent-relay local down ``` ## CLI Commands for Orchestration -**Use the `agent-relay` CLI extensively for monitoring and managing workers.** The CLI provides essential visibility into agent activity. +Two namespaces — keep the split straight. -### Channel vs DM — When to Use Each +### Local broker & agents — lifecycle only (no token) -**DM** — targeted, private, for responses you need to read back: +Use these to start/stop the broker and manage the agent processes. **Not for +messaging** — never read or send messages here. -- `agent-relay send Worker1 "message"` — sends a DM to Worker1 -- `mcp__relaycast__message_dm_send(to: "Worker1", text: "...")` — same via MCP -- Worker replies arrive as DMs back to the sender +```bash +agent-relay local up [--no-dashboard] [--verbose] [--no-spawn] [--background] [--state-dir ] +agent-relay local down [--force] [--all] +agent-relay local status [--wait-for ] # broker readiness +agent-relay local metrics [--agent ] # resource usage +agent-relay local agent list # running agents (JSON) +agent-relay local agent spawn --name --task "" [--channels ] [--model ] +agent-relay local agent new … # spawn + attach to its TUI +agent-relay local agent release # graceful stop +agent-relay local agent set-model # switch a running agent's model +agent-relay local agent attach --mode view|drive|passthrough +agent-relay local tail [--agent ] # raw broker/TTY output — DEBUG ONLY, not message reading +``` -**Channel post** — broadcast, visible to all agents on that channel: +### Messaging & registry — always through relay (token-gated) -- `agent-relay send '#general' "message"` — posts to #general (`#` prefix required) -- `mcp__relaycast__message_post(channel: "general", text: "...")` — same via MCP -- Use for coordination messages, status updates, announcements +Every coordination message goes through relay here. All accept `--token ` +(or `RELAY_AGENT_TOKEN`), `--workspace-key`, and `--base-url`. -**`agent-relay replies ` is the canonical command for reading worker -DM replies** — it returns full text, sender-attributed, in chronological -order, with no truncation. Add `--json` for machine-readable output. - -`inbox --agent ` is legacy unread-only behavior; once read, entries -disappear. Prefer `replies` for a persistent, complete view. - -#### `replies --json` schema (read this before writing a monitor) - -Verified against the agent-relay CLI source (`replies` command). When there -**is** a conversation, `--json` prints a JSON array of message objects: - -```json -[ - { - "id": "01J...", - "from": "Implementer", - "to": "orchestrator", - "text": "ACK — starting on the auth module", - "createdAt": "2026-05-19T14:02:11.000Z", - "direction": "inbound" - } -] +```bash +agent-relay agent register # print an agent token, then export RELAY_AGENT_TOKEN +agent-relay agent list [--status ] # workspace agent registry + +agent-relay message post # channel broadcast (bare channel name) +agent-relay message list [--limit ] # channel history +agent-relay message dm send # DM a worker +agent-relay message dm list [--limit ] # read a DM thread +agent-relay message dm send_group --to --to # group DM +agent-relay message reply # threaded reply +agent-relay message get_thread # full thread +agent-relay message search [--channel ] [--from ] [--limit ] +agent-relay message inbox check [--limit ] # unread messages +agent-relay message inbox mark_read +agent-relay message reaction add|remove + +agent-relay channel create|list|join|leave|invite|set_topic|archive … ``` -`unread` (boolean) and/or `unread_state: "unknown"` may also be present -depending on read-state availability. Footguns that will silently break a -naive monitor: - -- **The timestamp field is `createdAt`, not `ts`/`timestamp`.** It is an - ISO-8601 string. -- **In `replies --json`, `direction` is always the literal `"inbound"`** — it - is hard-coded, because `replies` only ever returns messages _from_ the - named agent. It is never `"incoming"`, `"from"`, `"in"`, nor `"outbound"`. - Filtering on `direction == "inbound"` is harmless but redundant; filtering - on any other literal yields a monitor that runs forever and never sees the - ACK or DONE. (`"outbound"` only appears in `history --to --json`, - which includes messages you sent — see below.) -- **The empty state is a plain string, not `[]`.** When there is _no - conversation at all_, the command prints the literal line - `No DM conversation with .` (exit 0) — not JSON. (If a conversation - exists but no messages match the filters, `--json` does emit a valid `[]`.) - Piping the no-conversation case straight into `jq` errors out. Guard for it: - - ```bash - out=$(agent-relay replies Implementer --json) - case "$out" in - "No DM conversation with"*|"") echo "no replies yet" ;; - *) echo "$out" | jq -r '.[] | "\(.createdAt) \(.direction) \(.text)"' ;; - esac - ``` - -- **Build monitors defensively: emit-all, then eyeball.** Print every entry - with its `direction` and `createdAt` rather than hard-filtering inside - `jq`. A monitor that shows everything beats one that silently drops the - message you were waiting for because an assumption about the schema was - wrong. - -`history --to --json` uses the same object shape (`id`, `from`, `to`, -`text`, `createdAt`, `direction`) but `direction` is computed: -`"outbound"` for messages you (the reader identity) sent, `"inbound"` for the -agent's. Use it when you need both sides of the thread, not just the agent's -replies. +### Channel vs DM — When to Use Each -```bash -# WRONG — history (no flags) will not show DM replies from workers -agent-relay history +**DM** — targeted, private, for responses you need to read back: -# RIGHT — read a worker's DM replies (full text, chronological) -agent-relay replies Worker1 +- `agent-relay message dm send Worker1 "message"` — sends a DM to Worker1 +- `mcp__agent-relay__send_dm(to: "Worker1", text: "...")` — same via MCP +- Read a worker's thread with `agent-relay message dm list ` -# Machine-readable: full text + direction, safe to parse in a loop -agent-relay replies Worker1 --json +**Channel post** — broadcast, visible to all agents on that channel: -# Full DM conversation history with a worker (read + unread) -agent-relay history --to Worker1 +- `agent-relay message post general "message"` — posts to the `general` channel + (bare name — no `#` prefix in the new `message post` command) +- `mcp__agent-relay__post_message(channel: "general", text: "...")` — same via MCP +- Use for coordination messages, status updates, announcements -# Channel evidence (diffs, grep counts, GO/NO-GO) — full text, -# untruncated, chronological; add --json to parse it programmatically -agent-relay history --to '#general' --json -``` +### Monitoring Workers (Essential) -(Reading via MCP `message_*` tools fails for you — see the "not a registered -relaycast agent" callout under Bootstrap Step 3.) +Read worker progress and replies **through relay**; use the broker only for +liveness/health. -### Monitoring Workers (Essential) +```bash +# Worker replies, ACKs, DONE signals — read these through relay +agent-relay message inbox check # unread across conversations +agent-relay message dm list # a specific worker's thread + +# Liveness/health only (lifecycle, not messaging) +agent-relay local agent list # running agents (pid, status, uptime) +agent-relay local metrics # resource usage -Spawn/send/release commands are in the Quick Reference and Bootstrap Step 3 — -not repeated here. For monitoring specifically: poll `agent-relay who --json` -for structured liveness (pid, uptimeSecs, status) instead of scraping the -worker TTY, and use `agent-relay agents:logs ` to watch real-time output -when debugging. - -> **Harness note: don't poll with a bare foreground `sleep`.** Many harnesses -> (Claude Code included) block a foreground `sleep` used to wait for ACK/DONE -> — e.g. `sleep 25; agent-relay replies ...` is rejected with a directive to -> use a backgrounded loop or a Monitor/until-loop instead. The inline -> `sleep`-based snippets shown elsewhere in this skill are illustrative of the -> *logic*; in a harnessed environment, run the wait loop with -> `run_in_background` (or the harness's Monitor + until-loop), polling -> `agent-relay replies --json` and `agent-relay who --json` from inside -> the backgrounded loop rather than blocking the foreground on `sleep`. +# Last resort: raw broker/TTY output for debugging a wedged worker. +# This is NOT how you read a worker's messages. +agent-relay local tail --agent Worker1 +``` ### Troubleshooting ```bash -# Kill unresponsive worker -agent-relay agents:kill Worker1 +# Gracefully stop an unresponsive worker +agent-relay local agent release Worker1 + +# Reset the broker if it is wedged +agent-relay local down --force # Re-check broker status -agent-relay status +agent-relay local status -# If a worker looks stuck, inspect its logs first -agent-relay agents:logs Worker1 +# If a worker looks stuck, inspect its output first +agent-relay local tail --agent Worker1 ``` -**Tip:** Run `agent-relay agents:logs ` frequently to monitor worker progress and catch errors early. +**Tip:** Read worker progress through relay (`agent-relay message inbox check`) +and poll `agent-relay local agent list` for liveness. Reach for +`agent-relay local tail` only to debug a wedged worker's raw output. ## Orchestrator Instructions Template -Give your lead agent these instructions. The bootstrap/spawn/monitor commands -are in the Bootstrap Flow and Quick Reference above — the paste-worthy part is -the **Protocol**, the ruleset a lead agent can't infer from the command list: +Give your lead agent these instructions: ```text -You are an autonomous orchestrator. Bootstrap the relay infrastructure -(Bootstrap Flow Steps 0–2), then spawn and manage workers per the -Quick Reference. Then enforce this protocol: +You are an autonomous orchestrator. Bootstrap the relay infrastructure and manage a team of workers. -## Protocol -- Workers will ACK when they receive tasks — but expect a 30–60s cold-start - gap after spawn: `who --json` shows `online` (~5s) well before the CLI is - booted enough to send its first ACK. Don't troubleshoot a "stuck" fresh - worker until at least 60s has passed -- Workers will send DONE when complete -- In a harnessed environment, never wait with a bare foreground `sleep` - (it is blocked) — run ACK/DONE poll loops with run_in_background or a - Monitor/until-loop, polling `replies --json` and `who --json` from inside it -- **ACK/DONE target: `orchestrator` (the auto-registered spawning identity) or - the `#general` channel — NEVER `broker`.** `broker` is the broker's internal - routing self-name, not a spawnable/DM-able agent: a worker DM to `broker` (and - `agent-relay send broker`) fails with `Agent "broker" not found`. Write the - worker task prompt to DM `orchestrator` (or post `#general`) — never "DM the - broker" -- Tell every worker explicitly: do NOT self-remove/release after DONE — stay - alive and idle so you can DM them review findings to fix -- After DONE, run a reviewer; on NO-GO, DM the findings back to the SAME - worker. If the worker is gone, spawn a fresh one and re-inject branch + - commit SHA + the full verdict -- Parse `replies --json` defensively: `direction` is always `"inbound"`, - timestamp is `createdAt` (not `ts`), and the no-conversation state is a - plain string, not `[]` -- Poll `agent-relay who --json` for worker liveness; set a wall-clock fallback - so a silently-dead worker can't hang the loop -- Read worker DM replies with `agent-relay replies ` (`--json` to parse); - plain `agent-relay history` shows channel posts only, never DM replies. See - the "Channel vs DM" section for the full reading model -``` +## Step 1: Verify Installation +Run: command -v agent-relay || npx agent-relay --version +If you hit a mise/asdf shim error: verify Node first with `node --version`, then fix the runtime manager +If not found: npm install -g agent-relay -## Multi-Round Review Loops (DONE → NO-GO → fix → re-review) +## Step 2: Start Infrastructure +Run: agent-relay local up --no-dashboard --verbose +Verify: agent-relay local status --wait-for=10 (should report RUNNING) -Spawning, monitoring, and releasing a worker is the easy path. The hard part -the basic flow does **not** cover: a worker reports DONE, a reviewer comes -back NO-GO, and now the work has to go back. Plan for this topology before you -spawn anything. +## Step 3: Manage Your Team -### Workers must not self-remove until you tell them +Spawn workers (provider is positional, --name/--task are flags): + agent-relay local agent spawn claude --name Worker1 --task "Task description" -A worker's natural hygiene instinct is to call `agent.remove` on itself right -after reporting DONE. That **kills the review→fix→re-review loop**: when the -reviewer returns NO-GO there is no agent left to send the findings to, so you -are forced to spawn a fresh worker and re-inject the entire context (branch, -commit, full verdict) instead of just DMing the existing one. +Register once so all messaging goes through relay: + agent-relay agent register Lead # prints a token + export RELAY_AGENT_TOKEN= -**Put this in every implementer/worker task prompt explicitly:** +Coordinate ENTIRELY through relay (send and read every message here): + agent-relay message dm send Worker1 "Additional instructions" # targeted DM + agent-relay message post general "All workers: prioritize auth" # broadcast + agent-relay message dm list # read a worker's replies + agent-relay message inbox check # unread across conversations -```text -Do NOT call agent.remove / agent-relay release on yourself. Report DONE and -stay alive and idle. The orchestrator will send you review findings to fix, -or release you when the work is fully accepted. Self-removing before then -breaks the fix loop. +Check liveness only (lifecycle, not messaging): + agent-relay local agent list # running workers + status + +Release when done: + agent-relay local agent release Worker1 + +## Protocol +- Workers ACK when they receive tasks and send DONE when complete — both arrive as relay messages +- Read replies through relay: `agent-relay message inbox check` / `message dm list ` (requires RELAY_AGENT_TOKEN) +- NEVER read worker responses with `agent-relay local tail` — that is broker-direct raw output, not relay messaging (use it only to debug a wedged worker) +- Poll `agent-relay local agent list` for liveness; do all messaging through the `message`/`channel` groups ``` -The "release when done" guidance elsewhere in this skill applies to the -**orchestrator** releasing workers — never to a worker releasing itself -mid-loop. +## Multi-Round Review Loops + +The first DONE is not the end of a serious workflow. Plan for review, fixes, +and re-review before you spawn implementers. -### The respawn-with-full-context fallback +### Keep workers alive for fixes -If a worker did self-remove (or died), you cannot just DM it. Spawn a fresh -worker and re-inject everything it needs to act with no prior memory: +Tell every implementer not to release itself after DONE: + +```text +Do NOT call remove_agent or release yourself after DONE. Report DONE and stay +alive and idle. The orchestrator will send review findings to fix, or release +you when the work is fully accepted. +``` + +If a worker exits before review finishes, you cannot DM it. Spawn a replacement +and give it the full continuation context: ```bash -agent-relay spawn Implementer2 codex "Continuation of prior work. \ -Branch: feature/auth. Last commit: . \ -The reviewer returned NO-GO with these findings: . \ -Check out the branch, address every finding, re-run tests, report DONE. \ -Do NOT self-remove — stay alive for re-review." +agent-relay local agent spawn codex --name Implementer2 --task "Continuation of prior work. +Branch: feature/auth. Last commit: . +Reviewer returned NO-GO: + +Address every finding, rerun tests, report DONE, and stay alive for re-review." ``` -Always pass branch + commit SHA + the **complete** reviewer verdict. A fresh -worker has none of the loop's history; a summarized verdict loses the -specifics it needs to fix. +Always include the branch, commit SHA, and complete reviewer verdict. A fresh +worker has none of the previous round's memory, and a summary often loses the +specific failing assertion or file path it needs. -### Detecting a silently-dead worker +### Monitor both messages and liveness -Monitors fire on **DMs only**. A worker that exits or self-removes produces no -DM, so the monitor just goes quiet — indistinguishable from a worker still -thinking. Defenses: +Review loops can hang if you wait only for a relay message from a worker that +has died. Poll both surfaces: -- Poll `agent-relay who --json` for liveness instead of inferring it from DM - silence. A worker that vanishes from `who` is gone. -- `agent-relay agents:logs ` will show a self-issued `agent.remove` / - release call — but it is noisy TTY scraping, a last resort, not a signal. -- Always set a wall-clock fallback (e.g. a ScheduleWakeup ~30 min out) so a - silently-dead worker can't hang the loop forever waiting on a DM that will - never arrive. +- Relay messages for ACK, DONE, and review responses: + `agent-relay message inbox check` and `agent-relay message dm list ` +- Local lifecycle for process health: + `agent-relay local agent list` +- Raw output only when debugging: + `agent-relay local tail --agent ` + +Set a wall-clock fallback for long loops so silence cannot block the +orchestrator forever. In harnesses that reject a bare foreground `sleep`, run +the polling loop in the background or use the harness's monitor/until-loop +primitive. ## Lifecycle Events -The broker emits these events (available via SDK subscriptions): +The broker emits these events (available via SDK subscriptions and +`agent-relay local tail`): | Event | When | | ------------------------ | --------------------------- | @@ -404,33 +392,24 @@ The broker emits these events (available via SDK subscriptions): ## Common Mistakes -| Mistake | Fix | -| -------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `agent-relay: command not found` or mise/asdf shim error | Ensure Node is available first (`node --version`); if a shim is broken, fix the runtime manager, then install/use `agent-relay` | -| "Nested session" error | Broker handles this automatically; if running manually, unset `CLAUDECODE` env var | -| Broker not starting | Try `agent-relay down` first, then `agent-relay up --no-dashboard --verbose` and `agent-relay status --wait-for=10` | -| Broker shows STARTING after `status --wait-for` | The process is alive but the broker API is not ready; inspect logs, retry readiness, or restart with `agent-relay down --force` if it remains stuck | -| Broker shows STOPPED immediately after start | Check `ps aux \| grep agent-relay-broker` and `.agent-relay/connection.json`; if the process is alive but status is STOPPED, rerun status from the project root or pass `--state-dir` | -| Half-started broker: process alive but `status` says STOPPED and `Failed to read broker connection metadata` | `up` spawned a broker that never finished writing connection metadata (readiness timed out) and was not cleaned up. Do NOT just retry `up` — it won't reap the orphan. `pkill -f agent-relay-broker` (or `agent-relay down --force`), delete `.agent-relay/`, then `agent-relay up` clean and `agent-relay status --wait-for=30`. `agent-relay doctor` flags this orphaned/half-started state | -| Worktree verification leaves git status dirty | Run `agent-relay down --force`, then remove generated `.agent-relay/` and `.mcp.json` from throwaway validation worktrees before committing | -| Spawn fails with `internal reply dropped` | Broker likely is not fully ready yet; wait for readiness, then spawn one worker first | -| Workers not connecting | Ensure broker started; check `agent-relay who` and worker logs | -| Not monitoring workers | Use `agent-relay agents:logs ` frequently to track progress | -| Workers seem stuck | Check logs with `agent-relay agents:logs ` for errors | -| Messages not delivered | Check `agent-relay history --to '#general' --json` for channel messages; use `agent-relay replies --json` for DMs | -| Worker replies not showing in history | Expected — plain `history` only shows channel posts. Use `agent-relay replies ` (full text, chronological) or `agent-relay history --to ` (full thread) to read DM replies | -| Need to see unread DM content | `inbox_check` / `inbox --agent` only return counts or clear on read, and the MCP `message_dm_list` tool requires a registered identity you don't have. Use `agent-relay replies --json` | -| Re-reading already-read replies | `agent-relay replies ` is a persistent view (not unread-only); use `--since