Skip to content

Multi-pod HITL: no recovery mechanism when pod crashes during can_use_tool wait #871

@Smizr

Description

@Smizr

Description

We're building a production multi-pod service using ClaudeSDKClient with a Human-in-the-Loop approval flow. We've read through the existing issues (#96, #304) and examples, and we have a working approach for the happy path — but we're stuck on the pod crash/restart scenario.

Our setup

  • Multi-pod Kubernetes deployment (any pod can serve any request)
  • SessionStore (v0.1.66) for transcript durability across pods
  • can_use_tool callback as the HITL gate (no timeout — per Permission tool #96)

What works (but we don't want)

We can block inside can_use_tool with an asyncio.Future, coordinated across pods via Postgres LISTEN/NOTIFY. When the user approves, the future resolves and the agent continues in the same stream.

However, we don't want to keep the CLI subprocess alive while waiting for the user. User decisions can take minutes or hours. In a multi-tenant service with many concurrent sessions, holding a live subprocess per waiting session (memory, CPU, connection) doesn't scale. We'd prefer to stop the process, persist the state, and resume when the user decides.

What breaks: stop process + resume

When the pod running the agent dies while can_use_tool is blocking, the stream and CLI subprocess are lost. There is no SDK mechanism to recover a pending can_use_tool gate on a different pod. The only option is to start a new session via --resume, but:

  • No built-in HITL recovery on resume — after a pod crash, the application must manage the full HITL state externally (persist pending decisions, detect already-made decisions on the resumed session, return them from can_use_tool, etc.). The SDK has no awareness of interrupted HITL gates.
  • Non-transcript state is lostmaterialize_resume_session() only restores transcripts and auth files. Plan files and other config-dir state are not included (see separate issue).

Questions

  1. Is there a recommended pattern for HITL in multi-pod deployments that survive pod restarts? The wait-in-callback approach works great until the pod dies. The two-request resume approach works but loses non-transcript state.

  2. Has the team considered a built-in mechanism for durable HITL gates? For example, a way to persist pending can_use_tool decisions in SessionStore so that on resume the SDK can automatically resolve them without the application needing to re-implement the coordination logic.

  3. Is there a way to stop/pause the CLI subprocess while waiting for the user's answer? While can_use_tool blocks, the CLI subprocess stays alive consuming resources (memory, connection). In a multi-tenant service with many concurrent sessions waiting for user approval, this doesn't scale well. Is there a way to gracefully stop the process during the wait and resume it later without losing state?

  4. How should the user's HITL decision appear in the chat history? When a user approves/declines a tool call, we want that choice to show as a message in the conversation (e.g., "Approved"). can_use_tool doesn't write to the transcript, and sdk_get_messages() only returns user/assistant messages. What's the recommended way to persist a HITL decision as a visible message at the correct position?

  5. Are there other teams running multi-pod HITL with ClaudeSDKClient? We'd love to learn about patterns others have found for this use case.

Environment

  • claude-agent-sdk v0.1.66
  • Python 3.14
  • Kubernetes, Postgres (asyncpg), FastAPI

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions