Multi-pod HITL: no recovery mechanism when pod crashes during `can_use_tool` wait

## Description

We're building a production multi-pod service using `ClaudeSDKClient` with a Human-in-the-Loop approval flow. We've read through the existing issues (#96, #304) and examples, and we have a working approach for the happy path — but we're stuck on the pod crash/restart scenario.

## Our setup

- Multi-pod Kubernetes deployment (any pod can serve any request)
- `SessionStore` (v0.1.66) for transcript durability across pods
- `can_use_tool` callback as the HITL gate (no timeout — per #96)

## What works (but we don't want)

We can block inside `can_use_tool` with an `asyncio.Future`, coordinated across pods via Postgres `LISTEN/NOTIFY`. When the user approves, the future resolves and the agent continues in the same stream.

However, **we don't want to keep the CLI subprocess alive while waiting for the user**. User decisions can take minutes or hours. In a multi-tenant service with many concurrent sessions, holding a live subprocess per waiting session (memory, CPU, connection) doesn't scale. We'd prefer to stop the process, persist the state, and resume when the user decides.

## What breaks: stop process + resume

When the pod running the agent dies while `can_use_tool` is blocking, the stream and CLI subprocess are lost. There is no SDK mechanism to recover a pending `can_use_tool` gate on a different pod. The only option is to start a new session via `--resume`, but:

- **No built-in HITL recovery on resume** — after a pod crash, the application must manage the full HITL state externally (persist pending decisions, detect already-made decisions on the resumed session, return them from `can_use_tool`, etc.). The SDK has no awareness of interrupted HITL gates.
- **Non-transcript state is lost** — `materialize_resume_session()` only restores transcripts and auth files. Plan files and other config-dir state are not included (see separate issue).

## Questions

1. **Is there a recommended pattern for HITL in multi-pod deployments that survive pod restarts?** The wait-in-callback approach works great until the pod dies. The two-request resume approach works but loses non-transcript state.

2. **Has the team considered a built-in mechanism for durable HITL gates?** For example, a way to persist pending `can_use_tool` decisions in `SessionStore` so that on resume the SDK can automatically resolve them without the application needing to re-implement the coordination logic.

3. **Is there a way to stop/pause the CLI subprocess while waiting for the user's answer?** While `can_use_tool` blocks, the CLI subprocess stays alive consuming resources (memory, connection). In a multi-tenant service with many concurrent sessions waiting for user approval, this doesn't scale well. Is there a way to gracefully stop the process during the wait and resume it later without losing state?

4. **How should the user's HITL decision appear in the chat history?** When a user approves/declines a tool call, we want that choice to show as a message in the conversation (e.g., "Approved"). `can_use_tool` doesn't write to the transcript, and `sdk_get_messages()` only returns `user`/`assistant` messages. What's the recommended way to persist a HITL decision as a visible message at the correct position?

5. **Are there other teams running multi-pod HITL with `ClaudeSDKClient`?** We'd love to learn about patterns others have found for this use case.

## Environment

- `claude-agent-sdk` v0.1.66
- Python 3.14
- Kubernetes, Postgres (asyncpg), FastAPI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-pod HITL: no recovery mechanism when pod crashes during `can_use_tool` wait #871

Description

Our setup

What works (but we don't want)

What breaks: stop process + resume

Questions

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Multi-pod HITL: no recovery mechanism when pod crashes during can_use_tool wait #871

Description

Description

Our setup

What works (but we don't want)

What breaks: stop process + resume

Questions

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Multi-pod HITL: no recovery mechanism when pod crashes during `can_use_tool` wait #871