Skip to content

feat(chat): resilient hot-reload recovery for decopilot runs #2711

@vibegui

Description

@vibegui

Plan: Resilient Hot-Reload Recovery for Decopilot Runs

Context

When the dev server hot-reloads (bun --hot) mid-stream, the Claude Code agent child process is killed, the in-memory RunRegistry is wiped, and the NATS JetStream buffer (memory-only) is lost. But the thread stays in_progress in the DB because stopAll() fires FORCE_FAIL as fire-and-forget (async reactor may not complete before the process is replaced).

The frontend detects isRunInProgress, tries /attach which returns 204 (no run in registry), retries 3 times, then gives up. The user sees "No response was generated" + "Run in progress" stuck forever — the only escape is manually clicking cancel.

Answer to the question: No, the agent is NOT still running after hot reload. The child process is killed. But the Claude Code SDK stores conversation history on disk (~/.claude/projects/), so thread context survives restarts. We can leverage this by sending a "continue" message with context.

Approach: Detect Ghost + Auto-Continue

  1. Server startup: Sweep DB for ghost threads (in_progress with no run in registry) and mark them as interrupted
  2. Frontend: When ghost detected, replace "No response was generated" with a "Continue" button that sends a contextual resume message

Changes

1. Add listByStatus() to thread storage

File: apps/mesh/src/storage/threads.ts + apps/mesh/src/storage/ports.ts

Add method to find all ghost threads on startup:

listByStatus(status: string): Promise<Array<{ id: string; organization_id: string }>>
// SELECT id, organization_id FROM thread WHERE status = $1

2. Server startup ghost-run sweep

File: apps/mesh/src/api/app.ts (~line 318, after RunRegistry creation)

After creating the RunRegistry, run an async sweep:

// Fire-and-forget: clean up any threads left in_progress from previous process
threadStorage.listByStatus("in_progress").then(async (ghosts) => {
  for (const ghost of ghosts) {
    await threadStorage.update(ghost.id, ghost.organization_id, { status: "failed" });
    sseHub.emit(ghost.organization_id, createDecopilotThreadStatusEvent(ghost.id, "failed"));
    sseHub.emit(ghost.organization_id, createDecopilotFinishEvent(ghost.id, "failed"));
    console.warn("[decopilot] Cleaned up ghost run", { threadId: ghost.id });
  }
}).catch(err => console.error("[decopilot] Ghost sweep failed", err));

This runs once on startup, non-blocking. Any thread stuck as in_progress without a corresponding run is a ghost.

3. Frontend: auto-cancel on resume failure (fast ghost resolution)

File: apps/mesh/src/web/components/chat/chat-provider.tsx (TaskStreamManager, line ~129)

When tryResumeStream fails (which means /attach returned 204), instead of retrying 3 times with 30s polling, immediately call the cancel endpoint on the first failure:

// In the .catch handler after resume fails:
chatStore.cancelRun(); // triggers ghost detection server-side (routes.ts:391-413)

The cancel endpoint already has ghost detection that force-fails the thread and emits SSE events.

4. "Continue" button in EmptyAssistantState

File: apps/mesh/src/web/components/chat/message/assistant.tsx (line 370)

Replace the static EmptyAssistantState with a component that shows a "Continue" button when the thread was interrupted. The button sends a contextual message like:

"The previous run was interrupted by a server restart. Please continue where you left off. Here's a brief summary of what was being done: [last user message content]"

Implementation:

  • EmptyAssistantState needs access to: whether this is the last pair, the thread status (failed), and the user's last message
  • Pass isLast and the user message from MessagePair props down to MessageAssistant
  • When isLast && message === null && !isLoading && thread.status === "failed":
    • Show "Run was interrupted" text
    • Render a "Continue" button that calls chatStore.sendMessage() with a pre-built continuation prompt
    • The prompt includes the last user message text for context
function EmptyAssistantState({ isLast, userMessage }: { isLast: boolean; userMessage?: ChatMessage }) {
  const threadStatus = useChatStore(s => {
    const thread = s.threads.find(t => t.id === s.activeThreadId);
    return thread?.status;
  });

  // Ghost/interrupted run — show continue button
  if (isLast && threadStatus === "failed" && userMessage) {
    const userText = userMessage.parts
      ?.filter(p => p.type === "text")
      .map(p => p.text)
      .join(" ")
      .slice(0, 200);

    return (
      <div className="flex flex-col gap-2 py-2">
        <div className="text-[14px] text-muted-foreground/60">
          Run was interrupted by a server restart
        </div>
        <button
          className="text-[13px] text-primary hover:underline self-start"
          onClick={() => {
            chatStore.sendMessage({
              parts: [{ type: "text", text: `The previous run was interrupted. Please continue where you left off. The original request was: "${userText}"` }],
            });
          }}
        >
          Continue conversation
        </button>
      </div>
    );
  }

  return (
    <div className="text-[14px] text-muted-foreground/60 py-2">
      No response was generated
    </div>
  );
}

Prop threading:

  • MessagePair component (pair.tsx:59) already has pair.user — pass it to MessageAssistant
  • MessageAssistant passes it to EmptyAssistantState when rendering the empty state

5. Pass user message through component tree

File: apps/mesh/src/web/components/chat/message/pair.tsx (line 89)

Add userMessage prop to MessageAssistant:

<MessageAssistant
  message={pair.assistant}
  userMessage={pair.user}  // NEW
  status={status}
  isLast={isLastPair}
  isPlanMode={isPlanMode}
/>

File: apps/mesh/src/web/components/chat/message/assistant.tsx

Add userMessage to MessageAssistant props and pass it to EmptyAssistantState.

Files to modify

File Change
apps/mesh/src/storage/ports.ts Add listByStatus() to ThreadStoragePort
apps/mesh/src/storage/threads.ts Implement listByStatus() query
apps/mesh/src/api/app.ts Add startup ghost sweep (~line 318)
apps/mesh/src/web/components/chat/chat-provider.tsx Auto-cancel on first resume failure
apps/mesh/src/web/components/chat/message/assistant.tsx "Continue" button in EmptyAssistantState
apps/mesh/src/web/components/chat/message/pair.tsx Pass userMessage to MessageAssistant

Edge cases

  • Multiple ghosts: Startup sweep handles all in one pass
  • Concurrent hot reloads: Force-fail is idempotent (in_progress -> failed transition only)
  • SSE reconnect: EventSource auto-reconnects after restart; ghost sweep SSE events emit after hub is ready
  • Partial messages: Any messages saved at 5-step checkpoints survive; the gap between last checkpoint and crash is lost (acceptable for dev)
  • Non-interrupted failures: The "Continue" button only shows when isLast && message === null && threadStatus === "failed" — regular failures with partial responses won't trigger it (they have content)
  • Claude Code memory: The SDK stores session history at ~/.claude/projects/, so when the user sends the continue message, the new agent instance can load thread history from both our DB and the SDK's session files

Verification

  1. Start a Claude Code run that takes time (e.g., "search the codebase for all TODO comments")
  2. While streaming, save a file to trigger hot reload
  3. Expected: within 1-2s, the thread transitions to "failed"
  4. UI shows "Run was interrupted by a server restart" + "Continue conversation" button
  5. Click "Continue" — sends a message with context, agent picks up where it left off

Future: True Resume (out of scope for now)

The Claude Agent SDK supports resume: sessionId + resumeSessionAt: messageUuid. A future enhancement could:

  • Store a unique session UUID per thread (instead of session_id: "chat")
  • On restart, re-spawn the agent with resume to continue from where it left off
  • Re-stream the resumed output to the client

This is complex (duplicate content detection, partial tool state, session file integrity) and better suited as a production feature with proper testing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions