You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're building a production multi-pod service using ClaudeSDKClient with a Human-in-the-Loop approval flow. We've read through the existing issues (#96, #304) and examples, and we have a working approach for the happy path — but we're stuck on the pod crash/restart scenario.
Our setup
Multi-pod Kubernetes deployment (any pod can serve any request)
SessionStore (v0.1.66) for transcript durability across pods
can_use_tool callback as the HITL gate (no timeout — per Permission tool #96)
What works (but we don't want)
We can block inside can_use_tool with an asyncio.Future, coordinated across pods via Postgres LISTEN/NOTIFY. When the user approves, the future resolves and the agent continues in the same stream.
However, we don't want to keep the CLI subprocess alive while waiting for the user. User decisions can take minutes or hours. In a multi-tenant service with many concurrent sessions, holding a live subprocess per waiting session (memory, CPU, connection) doesn't scale. We'd prefer to stop the process, persist the state, and resume when the user decides.
What breaks: stop process + resume
When the pod running the agent dies while can_use_tool is blocking, the stream and CLI subprocess are lost. There is no SDK mechanism to recover a pending can_use_tool gate on a different pod. The only option is to start a new session via --resume, but:
No built-in HITL recovery on resume — after a pod crash, the application must manage the full HITL state externally (persist pending decisions, detect already-made decisions on the resumed session, return them from can_use_tool, etc.). The SDK has no awareness of interrupted HITL gates.
Non-transcript state is lost — materialize_resume_session() only restores transcripts and auth files. Plan files and other config-dir state are not included (see separate issue).
Questions
Is there a recommended pattern for HITL in multi-pod deployments that survive pod restarts? The wait-in-callback approach works great until the pod dies. The two-request resume approach works but loses non-transcript state.
Has the team considered a built-in mechanism for durable HITL gates? For example, a way to persist pending can_use_tool decisions in SessionStore so that on resume the SDK can automatically resolve them without the application needing to re-implement the coordination logic.
Is there a way to stop/pause the CLI subprocess while waiting for the user's answer? While can_use_tool blocks, the CLI subprocess stays alive consuming resources (memory, connection). In a multi-tenant service with many concurrent sessions waiting for user approval, this doesn't scale well. Is there a way to gracefully stop the process during the wait and resume it later without losing state?
How should the user's HITL decision appear in the chat history? When a user approves/declines a tool call, we want that choice to show as a message in the conversation (e.g., "Approved"). can_use_tool doesn't write to the transcript, and sdk_get_messages() only returns user/assistant messages. What's the recommended way to persist a HITL decision as a visible message at the correct position?
Are there other teams running multi-pod HITL with ClaudeSDKClient? We'd love to learn about patterns others have found for this use case.
Description
We're building a production multi-pod service using
ClaudeSDKClientwith a Human-in-the-Loop approval flow. We've read through the existing issues (#96, #304) and examples, and we have a working approach for the happy path — but we're stuck on the pod crash/restart scenario.Our setup
SessionStore(v0.1.66) for transcript durability across podscan_use_toolcallback as the HITL gate (no timeout — per Permission tool #96)What works (but we don't want)
We can block inside
can_use_toolwith anasyncio.Future, coordinated across pods via PostgresLISTEN/NOTIFY. When the user approves, the future resolves and the agent continues in the same stream.However, we don't want to keep the CLI subprocess alive while waiting for the user. User decisions can take minutes or hours. In a multi-tenant service with many concurrent sessions, holding a live subprocess per waiting session (memory, CPU, connection) doesn't scale. We'd prefer to stop the process, persist the state, and resume when the user decides.
What breaks: stop process + resume
When the pod running the agent dies while
can_use_toolis blocking, the stream and CLI subprocess are lost. There is no SDK mechanism to recover a pendingcan_use_toolgate on a different pod. The only option is to start a new session via--resume, but:can_use_tool, etc.). The SDK has no awareness of interrupted HITL gates.materialize_resume_session()only restores transcripts and auth files. Plan files and other config-dir state are not included (see separate issue).Questions
Is there a recommended pattern for HITL in multi-pod deployments that survive pod restarts? The wait-in-callback approach works great until the pod dies. The two-request resume approach works but loses non-transcript state.
Has the team considered a built-in mechanism for durable HITL gates? For example, a way to persist pending
can_use_tooldecisions inSessionStoreso that on resume the SDK can automatically resolve them without the application needing to re-implement the coordination logic.Is there a way to stop/pause the CLI subprocess while waiting for the user's answer? While
can_use_toolblocks, the CLI subprocess stays alive consuming resources (memory, connection). In a multi-tenant service with many concurrent sessions waiting for user approval, this doesn't scale well. Is there a way to gracefully stop the process during the wait and resume it later without losing state?How should the user's HITL decision appear in the chat history? When a user approves/declines a tool call, we want that choice to show as a message in the conversation (e.g., "Approved").
can_use_tooldoesn't write to the transcript, andsdk_get_messages()only returnsuser/assistantmessages. What's the recommended way to persist a HITL decision as a visible message at the correct position?Are there other teams running multi-pod HITL with
ClaudeSDKClient? We'd love to learn about patterns others have found for this use case.Environment
claude-agent-sdkv0.1.66