From 7fc3d4e18f71a2d88d6fd0d715f5c0cb42fae552 Mon Sep 17 00:00:00 2001 From: Will Washburn Date: Wed, 3 Jun 2026 06:41:33 -0400 Subject: [PATCH 1/4] orchestrating-agent-relay: update for v8 CLI surface MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Rewrite the skill for relay main: broker/agent verbs under `agent-relay local …`, all messaging through relay (the `message`/ `channel` groups, never the broker directly), and `mcp__agent-relay__*` tool names. Bump to 2.2.0. Co-Authored-By: Claude Opus 4.8 (1M context) --- prpm.json | 2 +- skills/orchestrating-agent-relay/SKILL.md | 514 ++++++++++------------ 2 files changed, 224 insertions(+), 292 deletions(-) diff --git a/prpm.json b/prpm.json index 503e96f..4e81996 100644 --- a/prpm.json +++ b/prpm.json @@ -89,7 +89,7 @@ }, { "name": "orchestrating-agent-relay", - "version": "2.1.2", + "version": "2.2.0", "description": "The canonical way to run agent-relay - self-bootstrap the broker and autonomously spawn, monitor, and coordinate a team of worker agents without human intervention. Covers infrastructure startup, agent spawning, lifecycle monitoring, CLI-first reading, and team coordination.", "format": "claude", "subtype": "skill", diff --git a/skills/orchestrating-agent-relay/SKILL.md b/skills/orchestrating-agent-relay/SKILL.md index 8be7191..af1fc0e 100644 --- a/skills/orchestrating-agent-relay/SKILL.md +++ b/skills/orchestrating-agent-relay/SKILL.md @@ -11,41 +11,60 @@ Self-bootstrap agent-relay infrastructure and manage a team of agents autonomous A headless orchestrator is an agent that: -1. Starts the relay infrastructure itself (`agent-relay up`) -2. Spawns and manages worker agents -3. Monitors agent lifecycle events +1. Starts the relay infrastructure itself (`agent-relay local up`) +2. Spawns and manages worker agents (`agent-relay local agent …`) +3. Monitors agent liveness via the broker (`agent-relay local agent list`) and reads worker replies through relay (`agent-relay message inbox check`) 4. Coordinates work without human intervention -The orchestrator drives the team **from outside** and is **not** a -registered relay agent, so it reads/sends/lists via the `agent-relay` CLI -(MCP `mcp__relaycast__message_*` tools require a registered identity). The -workers it spawns _are_ registered participants — their peer-messaging -reference is the **`using-agent-relay`** skill. +The CLI has two surfaces, and the split is the thing to memorize: + +- **`agent-relay local …`** — **lifecycle only**: start/stop the local broker + and spawn/release/list the agents it runs. No token required; it talks to the + local broker via `.agentworkforce/relay/connection.json`. **Never use it to + read or send messages.** +- **`agent-relay message … / channel … / agent …`** — **all messaging goes + through relay** (the Relaycast service at `gateway.relaycast.dev`). These are + **token-gated** (`--token` / `RELAY_AGENT_TOKEN`). Register once for an agent + token (see Step 3), then send and read every coordination message here — or + use the equivalent relay MCP tools (`mcp__agent-relay__*`). + +**Always go through relay for messaging — never contact the broker directly to +read worker output.** Worker ACKs, replies, and DONE signals arrive as relay +messages: read them with `agent-relay message inbox check` / +`message dm list `, not by tailing the broker. (`local tail` is +a low-level broker/TTY debugging aid only.) + +The orchestrator drives the team **from outside** but is itself a registered +relay agent — that is what lets it message through relay. The workers it spawns +are registered participants too; their peer-messaging reference is the +**`using-agent-relay`** skill. ## When to Use - Agent needs full control over its worker team -- No human available to run `agent-relay up` manually +- No human available to run `agent-relay local up` manually - Agent should manage agent lifecycle autonomously - Building self-contained multi-agent systems ## Quick Reference -| Step | Command/Tool | -| ---------------------------------- | ------------------------------------------------------- | -| Verify installation | `command -v agent-relay` or `npx agent-relay --version` | -| Verify Node runtime if shim fails | `node --version` or fix mise/asdf first | -| Start infrastructure | `agent-relay up --no-dashboard --verbose` | -| Check status | `agent-relay status --wait-for=10` | -| Spawn worker | `agent-relay spawn Worker1 claude "task"` | -| List workers | `agent-relay who` | -| View worker logs | `agent-relay agents:logs Worker1` | -| Send DM to worker | `agent-relay send Worker1 "message"` | -| Post to channel | `agent-relay send '#general' "message"` | -| Read worker DM replies (full text) | `agent-relay replies Worker1` (add `--json` to parse) | -| Read full DM conversation history | `agent-relay history --to Worker1` | -| Release worker | `agent-relay release Worker1` | -| Stop infrastructure | `agent-relay down` | +| Step | Command/Tool | +| --------------------------------- | ------------------------------------------------------------- | +| Verify installation | `command -v agent-relay` or `npx agent-relay --version` | +| Verify Node runtime if shim fails | `node --version` or fix mise/asdf first | +| Start infrastructure | `agent-relay local up --no-dashboard --verbose` | +| Check broker readiness | `agent-relay local status --wait-for=10` | +| Spawn worker | `agent-relay local agent spawn claude --name Worker1 --task "…"` | +| List workers | `agent-relay local agent list` | +| Resource usage | `agent-relay local metrics` | +| Register for a messaging token | `agent-relay agent register Lead` (sets up `RELAY_AGENT_TOKEN`) | +| DM a worker (via relay) | `agent-relay message dm send Worker1 "…"` | +| Post to a channel (via relay) | `agent-relay message post general "…"` | +| Read a worker's replies (via relay) | `agent-relay message dm list ` | +| Check inbox (via relay) | `agent-relay message inbox check` | +| Debug raw worker output (not messaging) | `agent-relay local tail --agent Worker1` | +| Release worker | `agent-relay local agent release Worker1` | +| Stop infrastructure | `agent-relay local down` | ## Bootstrap Flow @@ -69,330 +88,248 @@ npx agent-relay --version ### Step 1: Start Infrastructure ```bash -# Starts a detached broker in headless mode and returns after API readiness -agent-relay up --no-dashboard --verbose +# Start the local broker in headless mode +agent-relay local up --no-dashboard --verbose ``` Verify broker readiness before spawning any workers: ```bash -# Must show "RUNNING" before you spawn workers -agent-relay status --wait-for=10 +# Polls until the broker reports RUNNING (or times out after 10s) +agent-relay local status --wait-for=10 ``` +The broker: + +- Provisions a Relaycast workspace when none is configured +- Removes `CLAUDECODE` env var when spawning (fixes nested session error) +- Persists state to `.agentworkforce/relay/` (connection files, etc.) + When verifying from a source checkout or throwaway git worktree, run these commands from the project/worktree root. The CLI writes runtime state to -`.agent-relay/` and may create `.mcp.json`; clean those files after validation -if the worktree should remain clean. +`.agentworkforce/relay/` and may create `.mcp.json`; clean those files after +validation if the worktree should remain clean. Pass `--state-dir ` to +relocate broker state. -The broker: +### Step 2: Spawn Workers -- Auto-creates a Relaycast workspace if `RELAY_API_KEY` not set -- Removes `CLAUDECODE` env var when spawning (fixes nested session error) -- Persists state to `.agent-relay/` +```bash +# provider is positional; --name defaults to the provider; --channels defaults to "general" +agent-relay local agent spawn claude --name Worker1 --task "Implement the authentication module following the existing patterns" +``` -### Step 2: Spawn Workers via MCP +MCP equivalent (works once the orchestrator is registered — see Step 3): ```text -mcp__relaycast__agent_add( +mcp__agent-relay__add_agent( name: "Worker1", cli: "claude", task: "Implement the authentication module following the existing patterns" ) ``` -CLI equivalent: +### Step 3: Register, then Coordinate Through Relay + +Register once for an agent token so every message — sent or read — goes through +relay: ```bash -agent-relay spawn Worker1 claude "Implement the authentication module following the existing patterns" +# Prints a registration JSON that includes the agent token +agent-relay agent register Lead +# Copy the "token" value from the output: +export RELAY_AGENT_TOKEN= ``` -> **Expect a 30–60s gap between spawn and the first ACK.** A worker shows -> `online` in `who --json` within ~5s (the process is up), but the underlying -> CLI (claude/codex) is still cold-starting and won't send its ACK DM until it -> finishes booting — typically 30–45s, occasionally longer, after `online`. -> `online` means "process alive," **not** "agent responsive." Don't treat -> ACK silence in the first minute as a stuck worker; size ACK-wait loops for -> at least 60s (e.g. a 30-iteration poll) before escalating to troubleshooting. - -### Step 3: Monitor and Coordinate +Now do **all** coordination through the `message` group (or the equivalent +`mcp__agent-relay__*` tools): ```bash -# Read Worker1's DM replies (chronological, full text, untruncated) -agent-relay replies Worker1 - -# Machine-readable: full text + direction, safe to parse in a loop -agent-relay replies Worker1 --json - -# Send a targeted DM to a specific worker -agent-relay send Worker1 "Also add unit tests" +agent-relay message dm send Worker1 "Also add unit tests" # targeted DM +agent-relay message post general "All workers: wrap up" # channel broadcast (bare name, no #) +agent-relay message dm list # read a worker's replies +agent-relay message inbox check # unread across conversations +``` -# Broadcast to all agents on a channel -agent-relay send '#general' "All workers: wrap up current task" +Track which workers are alive with the lifecycle command (not a messaging +channel): -# List active workers (structured status for polling) -agent-relay who --json +```bash +agent-relay local agent list # pid, status, uptime — JSON, ideal for polling ``` -> **The spawning orchestrator is not a registered relaycast agent.** -> The `mcp__relaycast__message_*` / `agent_list` MCP tools require a -> registered identity and fail for you with the error -> `Not registered. Call agent.register first.` -> Use the `agent-relay` CLI for all reading, sending, and listing, and add -> `--json` to any read command (`replies`, `history`, `who`) when you need -> full, untruncated, parseable output. +> **Read worker replies through relay, never from the broker.** ACKs, replies, +> and DONE signals are relay messages — read them with `message inbox check` / +> `message dm list`. Do not use `local tail` to "read" worker responses; it +> streams the broker's raw TTY output and is only a low-level debugging aid. +> +> **Messaging requires a registered agent identity.** The `message`, `channel`, +> and `dm` groups (and the `mcp__agent-relay__*` tools) reject unregistered +> callers with `Not registered. Call agent.register first.` Run +> `agent-relay agent register ` and set `RELAY_AGENT_TOKEN` (or pass +> `--token ` per call). ### Step 4: Release Workers -```text -mcp__relaycast__agent_remove(name: "Worker1") +```bash +agent-relay local agent release Worker1 +# MCP: mcp__agent-relay__remove_agent(name: "Worker1") ``` ### Step 5: Shutdown (optional) ```bash -agent-relay down +agent-relay local down ``` ## CLI Commands for Orchestration -**Use the `agent-relay` CLI extensively for monitoring and managing workers.** The CLI provides essential visibility into agent activity. +Two namespaces — keep the split straight. -### Channel vs DM — When to Use Each +### Local broker & agents — lifecycle only (no token) -**DM** — targeted, private, for responses you need to read back: +Use these to start/stop the broker and manage the agent processes. **Not for +messaging** — never read or send messages here. -- `agent-relay send Worker1 "message"` — sends a DM to Worker1 -- `mcp__relaycast__message_dm_send(to: "Worker1", text: "...")` — same via MCP -- Worker replies arrive as DMs back to the sender +```bash +agent-relay local up [--no-dashboard] [--verbose] [--no-spawn] [--background] [--state-dir ] +agent-relay local down [--force] [--all] +agent-relay local status [--wait-for ] # broker readiness +agent-relay local metrics [--agent ] # resource usage +agent-relay local agent list # running agents (JSON) +agent-relay local agent spawn --name --task "" [--channels ] [--model ] +agent-relay local agent new … # spawn + attach to its TUI +agent-relay local agent release # graceful stop +agent-relay local agent set-model # switch a running agent's model +agent-relay local agent attach --mode view|drive|passthrough +agent-relay local tail [--agent ] # raw broker/TTY output — DEBUG ONLY, not message reading +``` -**Channel post** — broadcast, visible to all agents on that channel: +### Messaging & registry — always through relay (token-gated) -- `agent-relay send '#general' "message"` — posts to #general (`#` prefix required) -- `mcp__relaycast__message_post(channel: "general", text: "...")` — same via MCP -- Use for coordination messages, status updates, announcements +Every coordination message goes through relay here. All accept `--token ` +(or `RELAY_AGENT_TOKEN`), `--workspace-key`, and `--base-url`. -**`agent-relay replies ` is the canonical command for reading worker -DM replies** — it returns full text, sender-attributed, in chronological -order, with no truncation. Add `--json` for machine-readable output. - -`inbox --agent ` is legacy unread-only behavior; once read, entries -disappear. Prefer `replies` for a persistent, complete view. - -#### `replies --json` schema (read this before writing a monitor) - -Verified against the agent-relay CLI source (`replies` command). When there -**is** a conversation, `--json` prints a JSON array of message objects: - -```json -[ - { - "id": "01J...", - "from": "Implementer", - "to": "orchestrator", - "text": "ACK — starting on the auth module", - "createdAt": "2026-05-19T14:02:11.000Z", - "direction": "inbound" - } -] +```bash +agent-relay agent register # print an agent token, then export RELAY_AGENT_TOKEN +agent-relay agent list [--status ] # workspace agent registry + +agent-relay message post # channel broadcast (bare channel name) +agent-relay message list [--limit ] # channel history +agent-relay message dm send # DM a worker +agent-relay message dm list [--limit ] # read a DM thread +agent-relay message dm send_group # group DM +agent-relay message reply # threaded reply +agent-relay message get_thread # full thread +agent-relay message search [--channel ] [--from ] [--limit ] +agent-relay message inbox check [--limit ] # unread messages +agent-relay message inbox mark_read +agent-relay message reaction add|remove + +agent-relay channel create|list|join|leave|invite|set_topic|archive … ``` -`unread` (boolean) and/or `unread_state: "unknown"` may also be present -depending on read-state availability. Footguns that will silently break a -naive monitor: - -- **The timestamp field is `createdAt`, not `ts`/`timestamp`.** It is an - ISO-8601 string. -- **In `replies --json`, `direction` is always the literal `"inbound"`** — it - is hard-coded, because `replies` only ever returns messages _from_ the - named agent. It is never `"incoming"`, `"from"`, `"in"`, nor `"outbound"`. - Filtering on `direction == "inbound"` is harmless but redundant; filtering - on any other literal yields a monitor that runs forever and never sees the - ACK or DONE. (`"outbound"` only appears in `history --to --json`, - which includes messages you sent — see below.) -- **The empty state is a plain string, not `[]`.** When there is _no - conversation at all_, the command prints the literal line - `No DM conversation with .` (exit 0) — not JSON. (If a conversation - exists but no messages match the filters, `--json` does emit a valid `[]`.) - Piping the no-conversation case straight into `jq` errors out. Guard for it: - - ```bash - out=$(agent-relay replies Implementer --json) - case "$out" in - "No DM conversation with"*|"") echo "no replies yet" ;; - *) echo "$out" | jq -r '.[] | "\(.createdAt) \(.direction) \(.text)"' ;; - esac - ``` - -- **Build monitors defensively: emit-all, then eyeball.** Print every entry - with its `direction` and `createdAt` rather than hard-filtering inside - `jq`. A monitor that shows everything beats one that silently drops the - message you were waiting for because an assumption about the schema was - wrong. - -`history --to --json` uses the same object shape (`id`, `from`, `to`, -`text`, `createdAt`, `direction`) but `direction` is computed: -`"outbound"` for messages you (the reader identity) sent, `"inbound"` for the -agent's. Use it when you need both sides of the thread, not just the agent's -replies. +### Channel vs DM — When to Use Each + +**DM** — targeted, private, for responses you need to read back: -```bash -# WRONG — history (no flags) will not show DM replies from workers -agent-relay history +- `agent-relay message dm send Worker1 "message"` — sends a DM to Worker1 +- `mcp__agent-relay__send_dm(to: "Worker1", text: "...")` — same via MCP +- Read a worker's thread with `agent-relay message dm list ` -# RIGHT — read a worker's DM replies (full text, chronological) -agent-relay replies Worker1 +**Channel post** — broadcast, visible to all agents on that channel: -# Machine-readable: full text + direction, safe to parse in a loop -agent-relay replies Worker1 --json +- `agent-relay message post general "message"` — posts to the `general` channel + (bare name — no `#` prefix in the new `message post` command) +- `mcp__agent-relay__post_message(channel: "general", text: "...")` — same via MCP +- Use for coordination messages, status updates, announcements -# Full DM conversation history with a worker (read + unread) -agent-relay history --to Worker1 +### Monitoring Workers (Essential) -# Channel evidence (diffs, grep counts, GO/NO-GO) — full text, -# untruncated, chronological; add --json to parse it programmatically -agent-relay history --to '#general' --json -``` +Read worker progress and replies **through relay**; use the broker only for +liveness/health. -(Reading via MCP `message_*` tools fails for you — see the "not a registered -relaycast agent" callout under Bootstrap Step 3.) +```bash +# Worker replies, ACKs, DONE signals — read these through relay +agent-relay message inbox check # unread across conversations +agent-relay message dm list # a specific worker's thread -### Monitoring Workers (Essential) +# Liveness/health only (lifecycle, not messaging) +agent-relay local agent list # running agents (pid, status, uptime) +agent-relay local metrics # resource usage -Spawn/send/release commands are in the Quick Reference and Bootstrap Step 3 — -not repeated here. For monitoring specifically: poll `agent-relay who --json` -for structured liveness (pid, uptimeSecs, status) instead of scraping the -worker TTY, and use `agent-relay agents:logs ` to watch real-time output -when debugging. - -> **Harness note: don't poll with a bare foreground `sleep`.** Many harnesses -> (Claude Code included) block a foreground `sleep` used to wait for ACK/DONE -> — e.g. `sleep 25; agent-relay replies ...` is rejected with a directive to -> use a backgrounded loop or a Monitor/until-loop instead. The inline -> `sleep`-based snippets shown elsewhere in this skill are illustrative of the -> *logic*; in a harnessed environment, run the wait loop with -> `run_in_background` (or the harness's Monitor + until-loop), polling -> `agent-relay replies --json` and `agent-relay who --json` from inside -> the backgrounded loop rather than blocking the foreground on `sleep`. +# Last resort: raw broker/TTY output for debugging a wedged worker. +# This is NOT how you read a worker's messages. +agent-relay local tail --agent Worker1 +``` ### Troubleshooting ```bash -# Kill unresponsive worker -agent-relay agents:kill Worker1 +# Gracefully stop an unresponsive worker +agent-relay local agent release Worker1 + +# Reset the broker if it is wedged +agent-relay local down --force # Re-check broker status -agent-relay status +agent-relay local status -# If a worker looks stuck, inspect its logs first -agent-relay agents:logs Worker1 +# If a worker looks stuck, inspect its output first +agent-relay local tail --agent Worker1 ``` -**Tip:** Run `agent-relay agents:logs ` frequently to monitor worker progress and catch errors early. +**Tip:** Read worker progress through relay (`agent-relay message inbox check`) +and poll `agent-relay local agent list` for liveness. Reach for +`agent-relay local tail` only to debug a wedged worker's raw output. ## Orchestrator Instructions Template -Give your lead agent these instructions. The bootstrap/spawn/monitor commands -are in the Bootstrap Flow and Quick Reference above — the paste-worthy part is -the **Protocol**, the ruleset a lead agent can't infer from the command list: +Give your lead agent these instructions: ```text -You are an autonomous orchestrator. Bootstrap the relay infrastructure -(Bootstrap Flow Steps 0–2), then spawn and manage workers per the -Quick Reference. Then enforce this protocol: - -## Protocol -- Workers will ACK when they receive tasks — but expect a 30–60s cold-start - gap after spawn: `who --json` shows `online` (~5s) well before the CLI is - booted enough to send its first ACK. Don't troubleshoot a "stuck" fresh - worker until at least 60s has passed -- Workers will send DONE when complete -- In a harnessed environment, never wait with a bare foreground `sleep` - (it is blocked) — run ACK/DONE poll loops with run_in_background or a - Monitor/until-loop, polling `replies --json` and `who --json` from inside it -- **ACK/DONE target: `orchestrator` (the auto-registered spawning identity) or - the `#general` channel — NEVER `broker`.** `broker` is the broker's internal - routing self-name, not a spawnable/DM-able agent: a worker DM to `broker` (and - `agent-relay send broker`) fails with `Agent "broker" not found`. Write the - worker task prompt to DM `orchestrator` (or post `#general`) — never "DM the - broker" -- Tell every worker explicitly: do NOT self-remove/release after DONE — stay - alive and idle so you can DM them review findings to fix -- After DONE, run a reviewer; on NO-GO, DM the findings back to the SAME - worker. If the worker is gone, spawn a fresh one and re-inject branch + - commit SHA + the full verdict -- Parse `replies --json` defensively: `direction` is always `"inbound"`, - timestamp is `createdAt` (not `ts`), and the no-conversation state is a - plain string, not `[]` -- Poll `agent-relay who --json` for worker liveness; set a wall-clock fallback - so a silently-dead worker can't hang the loop -- Read worker DM replies with `agent-relay replies ` (`--json` to parse); - plain `agent-relay history` shows channel posts only, never DM replies. See - the "Channel vs DM" section for the full reading model -``` - -## Multi-Round Review Loops (DONE → NO-GO → fix → re-review) +You are an autonomous orchestrator. Bootstrap the relay infrastructure and manage a team of workers. -Spawning, monitoring, and releasing a worker is the easy path. The hard part -the basic flow does **not** cover: a worker reports DONE, a reviewer comes -back NO-GO, and now the work has to go back. Plan for this topology before you -spawn anything. +## Step 1: Verify Installation +Run: command -v agent-relay || npx agent-relay --version +If you hit a mise/asdf shim error: verify Node first with `node --version`, then fix the runtime manager +If not found: npm install -g agent-relay -### Workers must not self-remove until you tell them +## Step 2: Start Infrastructure +Run: agent-relay local up --no-dashboard --verbose +Verify: agent-relay local status --wait-for=10 (should report RUNNING) -A worker's natural hygiene instinct is to call `agent.remove` on itself right -after reporting DONE. That **kills the review→fix→re-review loop**: when the -reviewer returns NO-GO there is no agent left to send the findings to, so you -are forced to spawn a fresh worker and re-inject the entire context (branch, -commit, full verdict) instead of just DMing the existing one. +## Step 3: Manage Your Team -**Put this in every implementer/worker task prompt explicitly:** +Spawn workers (provider is positional, --name/--task are flags): + agent-relay local agent spawn claude --name Worker1 --task "Task description" -```text -Do NOT call agent.remove / agent-relay release on yourself. Report DONE and -stay alive and idle. The orchestrator will send you review findings to fix, -or release you when the work is fully accepted. Self-removing before then -breaks the fix loop. -``` +Register once so all messaging goes through relay: + agent-relay agent register Lead # prints a token + export RELAY_AGENT_TOKEN= -The "release when done" guidance elsewhere in this skill applies to the -**orchestrator** releasing workers — never to a worker releasing itself -mid-loop. +Coordinate ENTIRELY through relay (send and read every message here): + agent-relay message dm send Worker1 "Additional instructions" # targeted DM + agent-relay message post general "All workers: prioritize auth" # broadcast + agent-relay message dm list # read a worker's replies + agent-relay message inbox check # unread across conversations -### The respawn-with-full-context fallback +Check liveness only (lifecycle, not messaging): + agent-relay local agent list # running workers + status -If a worker did self-remove (or died), you cannot just DM it. Spawn a fresh -worker and re-inject everything it needs to act with no prior memory: +Release when done: + agent-relay local agent release Worker1 -```bash -agent-relay spawn Implementer2 codex "Continuation of prior work. \ -Branch: feature/auth. Last commit: . \ -The reviewer returned NO-GO with these findings: . \ -Check out the branch, address every finding, re-run tests, report DONE. \ -Do NOT self-remove — stay alive for re-review." +## Protocol +- Workers ACK when they receive tasks and send DONE when complete — both arrive as relay messages +- Read replies through relay: `agent-relay message inbox check` / `message dm list ` (requires RELAY_AGENT_TOKEN) +- NEVER read worker responses with `agent-relay local tail` — that is broker-direct raw output, not relay messaging (use it only to debug a wedged worker) +- Poll `agent-relay local agent list` for liveness; do all messaging through the `message`/`channel` groups ``` -Always pass branch + commit SHA + the **complete** reviewer verdict. A fresh -worker has none of the loop's history; a summarized verdict loses the -specifics it needs to fix. - -### Detecting a silently-dead worker - -Monitors fire on **DMs only**. A worker that exits or self-removes produces no -DM, so the monitor just goes quiet — indistinguishable from a worker still -thinking. Defenses: - -- Poll `agent-relay who --json` for liveness instead of inferring it from DM - silence. A worker that vanishes from `who` is gone. -- `agent-relay agents:logs ` will show a self-issued `agent.remove` / - release call — but it is noisy TTY scraping, a last resort, not a signal. -- Always set a wall-clock fallback (e.g. a ScheduleWakeup ~30 min out) so a - silently-dead worker can't hang the loop forever waiting on a DM that will - never arrive. - ## Lifecycle Events -The broker emits these events (available via SDK subscriptions): +The broker emits these events (available via SDK subscriptions and +`agent-relay local tail`): | Event | When | | ------------------------ | --------------------------- | @@ -404,33 +341,23 @@ The broker emits these events (available via SDK subscriptions): ## Common Mistakes -| Mistake | Fix | -| -------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `agent-relay: command not found` or mise/asdf shim error | Ensure Node is available first (`node --version`); if a shim is broken, fix the runtime manager, then install/use `agent-relay` | -| "Nested session" error | Broker handles this automatically; if running manually, unset `CLAUDECODE` env var | -| Broker not starting | Try `agent-relay down` first, then `agent-relay up --no-dashboard --verbose` and `agent-relay status --wait-for=10` | -| Broker shows STARTING after `status --wait-for` | The process is alive but the broker API is not ready; inspect logs, retry readiness, or restart with `agent-relay down --force` if it remains stuck | -| Broker shows STOPPED immediately after start | Check `ps aux \| grep agent-relay-broker` and `.agent-relay/connection.json`; if the process is alive but status is STOPPED, rerun status from the project root or pass `--state-dir` | -| Half-started broker: process alive but `status` says STOPPED and `Failed to read broker connection metadata` | `up` spawned a broker that never finished writing connection metadata (readiness timed out) and was not cleaned up. Do NOT just retry `up` — it won't reap the orphan. `pkill -f agent-relay-broker` (or `agent-relay down --force`), delete `.agent-relay/`, then `agent-relay up` clean and `agent-relay status --wait-for=30`. `agent-relay doctor` flags this orphaned/half-started state | -| Worktree verification leaves git status dirty | Run `agent-relay down --force`, then remove generated `.agent-relay/` and `.mcp.json` from throwaway validation worktrees before committing | -| Spawn fails with `internal reply dropped` | Broker likely is not fully ready yet; wait for readiness, then spawn one worker first | -| Workers not connecting | Ensure broker started; check `agent-relay who` and worker logs | -| Not monitoring workers | Use `agent-relay agents:logs ` frequently to track progress | -| Workers seem stuck | Check logs with `agent-relay agents:logs ` for errors | -| Messages not delivered | Check `agent-relay history --to '#general' --json` for channel messages; use `agent-relay replies --json` for DMs | -| Worker replies not showing in history | Expected — plain `history` only shows channel posts. Use `agent-relay replies ` (full text, chronological) or `agent-relay history --to ` (full thread) to read DM replies | -| Need to see unread DM content | `inbox_check` / `inbox --agent` only return counts or clear on read, and the MCP `message_dm_list` tool requires a registered identity you don't have. Use `agent-relay replies --json` | -| Re-reading already-read replies | `agent-relay replies ` is a persistent view (not unread-only); use `--since