CryptoLabInc · esifea · Jun 17, 2026 · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026
diff --git a/.github/workflows/pr-tests.yml b/.github/workflows/pr-tests.yml
@@ -4,9 +4,9 @@ name: PR Tests
 
 on:
   pull_request:
-    branches: [main]
+    branches: [main, "release/**"]
   push:
-    branches: [main]
+    branches: [main, "release/**"]
 
 permissions:
   contents: read

diff --git a/AGENT_INTEGRATION.md b/AGENT_INTEGRATION.md
@@ -1,19 +1,23 @@
 # Agent Integration Guide
 
 Rune works with all major AI agents via native MCP (Model Context Protocol)
-support. In v0.4 the MCP server is a single Go binary
-(`bin/rune-mcp`) that the host CLI auto-spawns over stdio — no Python
-runtime, no `pip install`, no manual `mcp add` for the supported CLIs.
+support. In v0.4 the MCP server is a single Go binary (`rune-mcp`) that the
+host CLI auto-spawns over stdio through the committed bash wrapper
+`bin/rune mcp-server` — no Python runtime, no `pip install`, no manual
+`mcp add` for the supported CLIs.
 
 ## Integration Principles
 
 ### Cross-agent common (single source of truth)
-- The Go binary at `cmd/rune-mcp/` is the only MCP server entry point.
-  Plugin / extension manifests point each CLI at the same binary.
-- Runtime preparation happens at install time (the binary is already
-  built and shipped with the plugin tarball — see Task #30 for the
-  release pipeline). Nothing needs to be (re)bootstrapped at session
-  start.
+- The CLI entry point is `cmd/rune/` (the `rune` binary). Plugin /
+  extension manifests point each CLI at the committed bash wrapper
+  `bin/rune` invoked as `rune mcp-server`, which execs the downloaded
+  `rune-mcp` MCP server.
+- Runtime preparation happens on the first MCP spawn, not at plugin
+  install: the wrapper self-installs the `rune` CLI and downloads the
+  pinned `rune-mcp` binary (per `.release-pins.yaml`) into `~/.rune/bin/`,
+  then execs it — so the server comes online in the same session with no
+  manual `/mcp` reconnect or restart.
 
 ### Agent-specific adapters (thin layer only)
 - Codex-only tasks: `codex mcp add/remove/list` registration flows
@@ -48,10 +52,12 @@ $ claude plugin install rune
 > /plugin install rune
 ```
 
-The plugin manifest (`.claude-plugin/plugin.json`) declares the binary
-path; Claude Code spawns `${CLAUDE_PLUGIN_ROOT}/bin/rune-mcp` via stdio
-on session start. enVector Cloud credentials are delivered automatically
-via the Vault bundle — you never set `ENVECTOR_*` env vars directly.
+The plugin manifest (`.claude-plugin/plugin.json`) declares the wrapper
+path; Claude Code spawns `${CLAUDE_PLUGIN_ROOT}/bin/rune mcp-server` via
+stdio on session start (on a fresh install the wrapper self-installs
+rune-mcp first, then execs it). enVector Cloud credentials are delivered
+automatically via the Vault bundle — you never set `ENVECTOR_*` env vars
+directly.
 
 ### Configure credentials
 

diff --git a/SKILL.md b/SKILL.md
@@ -109,12 +109,15 @@ If in Active state but operations fail:
 
    Note: enVector credentials are delivered automatically via the Vault bundle — no user input needed.
 
-4. Create `~/.rune/config.json` with `state: "active"` and the values
-   above (`mkdir -p ~/.rune && chmod 700 ~/.rune`, then `chmod 600` the
-   file).
-5. Call the `reload_pipelines` MCP tool. The MCP server's boot loop
-   dials Vault, fetches the agent manifest (EncKey + envector
-   creds), connects to enVector, and transitions to Active.
+4. Call the `configure` MCP tool with the collected values
+   (`endpoint`, `token`, `ca_cert_path`, `tls_disable`). The server does
+   the atomic 0600 write to `~/.rune/config.json`, sets `state: "active"`,
+   refreshes `metadata.lastUpdated`, and runs a best-effort Vault probe.
+   The agent never writes the config file itself.
+5. Call the `activate` MCP tool to bring pipelines online. It runs the
+   prereq checks server-side and drives the boot loop: dials Vault,
+   fetches the agent manifest (EncKey + enVector creds), connects to
+   enVector, and transitions to Active.
 6. Confirm health by calling `diagnostics` and applying the
    **Boot Failure — Fast-Fail Rule** (see section below). If
    `vault.last_boot_error` is present, surface its `hint` verbatim
@@ -195,28 +198,37 @@ Recommendations:
 
 **Note**: In most cases, simply asking naturally ("Why did we choose PostgreSQL?") triggers Retriever automatically — no command needed.
 
-### `/rune:activate` (or `/rune:wakeup`)
+### `/rune:activate`
 (or `$rune activate` for Codex CLI)
 
 **Purpose**: Attempt to activate plugin after infrastructure is ready
 
 **Use Case**: Infrastructure was not ready during configure, but now it's deployed and running.
 
 **Steps**:
-1. Check if config exists
-   - NO → Redirect to `/rune:configure` (or `$rune configure` for Codex CLI)
-   - YES → Continue
-2. If `state` is already `"active"`, skip to step 4 (just verify health).
-3. If `state` is `"dormant"`, set it to `"active"` and clear any
-   `dormant_reason` / `dormant_since` fields.
-4. Call the `reload_pipelines` MCP tool. From a terminal Dormant the
-   boot loop is re-spawned; from Active it is a no-op.
-5. Call `diagnostics` and apply the **Boot Failure — Fast-Fail Rule**
-   (section below).
-6. If `vault.last_boot_error` is present: surface its `hint` verbatim,
-   suggest the matching recovery action, and stop. Do NOT loop on
-   `reload_pipelines` or probe with shell tools — the classifier has
-   already done that work. Otherwise render the per-subsystem snapshot.
+1. Call the `activate` MCP tool — no Read, no Edit, no manual state
+   inspection. It runs the prereq checks server-side (config present,
+   runed socket reachable + Health probe) and only triggers the boot
+   loop when everything is ready. It returns a `status`:
+   `configure_required` | `install_pending` | `waiting_for_bootstrap` |
+   `active` | `waiting_for_vault` | `dormant`.
+2. Branch on `status`:
+   - `configure_required` → redirect to `/rune:configure`; use the `hint`
+     verbatim and stop.
+   - `install_pending` → invoke the recovery in `hint` (the agent runs
+     `rune install`, never the user), then retry `/rune:activate` once.
+   - `waiting_for_bootstrap` → runed is still downloading llama-server /
+     the embedding model; summarize `.bootstrap` progress, tell the user
+     no further action is needed, and stop (do NOT poll).
+   - `active` → optionally call `diagnostics` once and render the
+     per-subsystem snapshot.
+   - `waiting_for_vault` / `dormant` → apply the **Boot Failure —
+     Fast-Fail Rule** (below): surface `reload.last_boot_error.hint`
+     verbatim, suggest one recovery, and stop.
+
+(Older rune-mcp binaries without the `activate` tool fall back to the
+legacy flow: set `state: "active"`, call `reload_pipelines` directly, and
+branch on `diagnostics.vault.last_boot_error`.)
 
 ### `/rune:reset`
 (or `$rune reset` for Codex CLI)

diff --git a/bin/rune b/bin/rune
@@ -44,30 +44,135 @@ mkdir -p "$RUNE_HOME"
 LOCK_DIR="$RUNE_HOME/bootstrap.lock.d"
 TMP=""
 SUMS=""
+OWNER_TOKEN=""
 cleanup() {
   [ -n "$TMP" ] && rm -f "$TMP"
   [ -n "$SUMS" ] && rm -f "$SUMS"
-  rmdir "$LOCK_DIR" 2>/dev/null || true
+  # Release lock after checking token is valid or not
+  if [ -n "$OWNER_TOKEN" ] && [ "$(cat "$LOCK_DIR/owner" 2>/dev/null || true)" = "$OWNER_TOKEN" ]; then
+    rm -f "$LOCK_DIR/owner" 2>/dev/null || true
+    rmdir "$LOCK_DIR" 2>/dev/null || true
+  fi
+}
+
+# Network time budget
+# - `mcp-server`: MCP entrypoint run by Claude Code session with ~30s timeout.
+#                 SIGKILL after timeout skip cleanup which leave unreleased bootstrap lock;
+#                 so overall time should be less than 30s.
+#                 Worst case: API resolve up to 7s + binary download up to 13s + checksum up to 7s
+# - other: matched with each downloaded binaries' deadline
+if [ "${1:-}" = mcp-server ]; then
+  NET_RETRY=3; NET_RETRY_DELAY=1; NET_RETRY_MAXTIME=3
+  NET_API_MAXTIME=4; NET_BIN_MAXTIME=10; NET_CHECKSUM_MAXTIME=4
+else
+  NET_RETRY=3; NET_RETRY_DELAY=2; NET_RETRY_MAXTIME=60
+  NET_API_MAXTIME=20; NET_BIN_MAXTIME=120; NET_CHECKSUM_MAXTIME=30
+fi
+
+# retries fast transient errors such as Github CDN failures (504, timeouts) only;
+# slow/hung requests are intentionally not retried to stay within the spawn budget.
+# Caller add NET_{API|BIN|CHECKSUM}_MAXTIME properly on each step
+fetch() {
+  curl --fail --silent --show-error --location --connect-timeout 5 \
+       --retry "$NET_RETRY" --retry-delay "$NET_RETRY_DELAY" \
+       --retry-max-time "$NET_RETRY_MAXTIME" "$@"
+}
+
+# Lock waiting budget to exit before Claude code MCP spawn timeout (~30s)
+LOCK_WAIT_BUDGET="${RUNE_LOCK_WAIT_BUDGET:-20}"
+# Lock's wall-clock age to prevent alive but stuck holder
+# Worst case: NET_RETRY_MAXTIME + NET_{API|BIN|CHECKSUM}_MAXTIME (about 350s) when !mcp-server
+LOCK_STALE_AFTER="${RUNE_LOCK_STALE_AFTER:-360}"
+
+# Atomically take stale lock and remove it
+clear_stale_lock() {
+  if mv "$LOCK_DIR" "$LOCK_DIR.reclaim.$$" 2>/dev/null; then
+    rm -rf "$LOCK_DIR.reclaim.$$" 2>/dev/null || true
+  fi
+  return 0
 }
 
 waited=0
-while ! mkdir "$LOCK_DIR" 2>/dev/null; do # another session hold lock
+wait_count=0
+while true; do
+  # Claim lock atomically
+  if mkdir "$LOCK_DIR" 2>/dev/null; then
+    OWNER_TOKEN="$$ $(date +%s)" # "<pid> <timestamp>"
+    if ( set -C; printf '%s\n' "$OWNER_TOKEN" > "$LOCK_DIR/owner" ) 2>/dev/null; then
+      trap cleanup EXIT INT TERM
+      # Double-check if mkdir -> write gap affect lock
+      if [ "$(cat "$LOCK_DIR/owner" 2>/dev/null || true)" = "$OWNER_TOKEN" ]; then
+        break
+      fi
+      trap - EXIT INT TERM
+      OWNER_TOKEN=""
+      continue
+    fi
+
+    OWNER_TOKEN=""
+    if [ ! -d "$LOCK_DIR" ]; then
+      continue   # lock is cleared, retry claim
+    fi
+
+    # Real write error (disk full, permission, or others)
+    if [ ! -e "$LOCK_DIR/owner" ]; then
+      echo "rune: cannot record install bootstrap lock owner (file write failed)" >&2
+      exit 1
+    fi
+  fi
+
+  # Wait for another process as we failed to claim lock
   if [ -x "$TARGET" ]; then
     exec "$TARGET" "$@"   # bootstrap finished
   fi
 
+  # Validate owner
+  owner="$(cat "$LOCK_DIR/owner" 2>/dev/null || true)"
+  pid="${owner%% *}"
+  case "$owner" in
+    *" "*) ts="${owner##* }" ;;
+    *)     ts="" ;;
+  esac
+
+  if [ -z "$owner" ]; then
+    # Dir is created but no owner yet; holder in the middle of claim or died
+    wait_count=$((wait_count + 1))
+    if [ "$wait_count" -ge 5 ]; then
+      echo "rune: bootstrap lock not claimed for ${wait_count}s; reclaiming" >&2
+      clear_stale_lock; wait_count=0; continue
+    fi
+  else
+    wait_count=0
+    if [ -n "$pid" ] && ! kill -0 "$pid" 2>/dev/null; then
+      # Holder process not found; lock is leaked
+      echo "rune: bootstrap lock holder (pid $pid) is not found; reclaiming" >&2
+      clear_stale_lock; continue
+    fi
+
+    # Check wall-clock age
+    case "$ts" in
+      ''|*[!0-9]*) age=0 ;;
+      *)           age=$(( $(date +%s) - ts )) ;;
+    esac
+
+    if [ "$age" -ge "$LOCK_STALE_AFTER" ]; then
+      echo "rune: bootstrap lock stale (${age}s); reclaiming" >&2
+      clear_stale_lock; continue
+    fi
+
+    if [ "$waited" -ge "$LOCK_WAIT_BUDGET" ]; then
+      echo "rune: another rune bootstrap is in progress over MCP spawn budget." >&2
+      echo "      Retry in a moment, or run it out-of-band:" >&2
+      echo "        bash -c \"\${CLAUDE_PLUGIN_ROOT:-.}/bin/rune install\"" >&2
+      exit 1
+    fi
+  fi
+
   sleep 1
   waited=$((waited + 1))
-  if [ "$waited" -ge 120 ]; then
-    echo "rune: bootstrap lock held >120s, reclaiming" >&2
-    rmdir "$LOCK_DIR" 2>/dev/null || true
-    waited=0
-  fi
 done
 
-# Check error
-trap cleanup EXIT INT TERM
-
+# Double-check: install is completed right before we won the lock
 if [ -x "$TARGET" ]; then
   cleanup
   trap - EXIT INT TERM
@@ -85,12 +190,9 @@ if [ -z "$RUNE_VERSION" ]; then
   # Use token if exist
   token="${GITHUB_TOKEN:-${GH_TOKEN:-}}"
   if [ -n "$token" ]; then
-    body="$(curl --fail --silent --show-error --location --connect-timeout 10 --max-time 20 \
-      --retry 3 --retry-delay 2 \
-      --header "Authorization: Bearer $token" "$api" || true)"
+    body="$(fetch --max-time "$NET_API_MAXTIME" --header "Authorization: Bearer $token" "$api" || true)"
   else
-    body="$(curl --fail --silent --show-error --location --connect-timeout 10 --max-time 20 \
-      --retry 3 --retry-delay 2 "$api" || true)"
+    body="$(fetch --max-time "$NET_API_MAXTIME" "$api" || true)"
   fi
 
   RUNE_VERSION="$(printf '%s' "$body" \
@@ -128,12 +230,22 @@ mkdir -p "$(dirname "$TARGET")"
 TMP="$(mktemp "$(dirname "$TARGET")/.rune-bootstrap-XXXXXX")"
 SUMS="$(mktemp -t rune-bootstrap-sums-XXXXXX)"
 
-# --retry rides out transient GitHub CDN failures (504, timeouts) instead
-# of aborting the whole bootstrap on the first blip.
-curl --fail --silent --show-error --location --connect-timeout 10 --max-time 120 --retry 3 --retry-delay 2 "$RELEASE_BASE/$ASSET"        -o "$TMP"
-curl --fail --silent --show-error --location --connect-timeout 10 --max-time 30  --retry 3 --retry-delay 2 "$RELEASE_BASE/checksums.txt" -o "$SUMS"
+if ! fetch --max-time "$NET_BIN_MAXTIME" "$RELEASE_BASE/$ASSET" -o "$TMP"; then
+  echo "rune: could not download $ASSET ($RUNE_VERSION) after retries." >&2
+  echo "      The release endpoint may be slow or temporarily unavailable (e.g. HTTP 504)." >&2
+  echo "      Recover out-of-band, then reconnect /mcp:" >&2
+  echo "        bash -c \"\${CLAUDE_PLUGIN_ROOT:-.}/bin/rune install\"" >&2
+  exit 1
+fi
+if ! fetch --max-time "$NET_CHECKSUM_MAXTIME" "$RELEASE_BASE/checksums.txt" -o "$SUMS"; then
+  echo "rune: could not download checksums.txt ($RUNE_VERSION) after retries." >&2
+  echo "      The release endpoint may be slow or temporarily unavailable (e.g. HTTP 504)." >&2
+  echo "      Recover out-of-band, then reconnect /mcp:" >&2
+  echo "        bash -c \"\${CLAUDE_PLUGIN_ROOT:-.}/bin/rune install\"" >&2
+  exit 1
+fi
 
-EXPECTED="$(grep " $ASSET\$" "$SUMS" | cut -d' ' -f1)"
+EXPECTED="$(grep " $ASSET\$" "$SUMS" | cut -d' ' -f1 || true)"
 if [ -z "$EXPECTED" ]; then
   echo "rune: $ASSET not listed in checksums.txt for $RUNE_VERSION" >&2
   exit 1

diff --git a/cmd/rune/install.go b/cmd/rune/install.go
@@ -6,6 +6,7 @@ import (
 	"flag"
 	"fmt"
 	"io"
+	"os"
 
 	"github.com/CryptoLabInc/rune-cli/internal/bootstrap"
 )
@@ -20,8 +21,21 @@ func runInstall(ctx context.Context, args []string, stdout, stderr io.Writer) in
 		return 2
 	}
 
+	// Check RUNE_MANIFEST before fail
 	if *manifest == "" {
-		fmt.Fprintln(stderr, "rune install: no manifest URL configured (set --manifest-url or RUNE_MANIFEST)")
+		if env := os.Getenv("RUNE_MANIFEST"); env != "" {
+			*manifest = env
+		}
+	}
+
+	if *manifest == "" {
+		const msg = "no manifest URL configured (set --manifest-url or RUNE_MANIFEST)"
+		if *jsonOut {
+			_ = json.NewEncoder(stdout).Encode(jsonEvent{Event: "summary", Error: msg})
+		} else {
+			fmt.Fprintln(stderr, "rune install: "+msg)
+		}
+
 		return 2
 	}