fix(sdks): raise typed, actionable errors when sandbox dies mid-request#1419
Conversation
When a sandbox is killed or expires while a request is in flight, the JS SDK threw a cryptic 'SandboxError: 2: [unknown] terminated' and the Python SDK leaked a raw httpcore.RemoteProtocolError. Both now raise a typed error explaining the sandbox was likely killed and pointing to isRunning()/is_running(). - JS: map ConnectError Code.Unknown with rawMessage 'terminated' to an actionable SandboxError; translate undici 'terminated' TypeError on the envd HTTP path (files.read/write) - Python: wrap httpcore/httpx transport errors in SandboxException in handle_rpc_exception and the new handle_envd_api_transport_exception - Remove never-firing @_retry decorators from streaming methods in e2b_connect (generator functions defer execution past the try/except; a working mid-stream retry would replay delivered events) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: 5520ce5 The changes in this PR will be included in the next version bump. This PR includes changesets to release 2 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
PR SummaryMedium Risk Overview Health-check plumbing is wired through Python Unit and live-sandbox integration tests cover confirmed-kill vs passthrough behavior. Reviewed by Cursor Bugbot for commit 5520ce5. Bugbot is set up for automated code reviews on this repo. Configure here. |
Package ArtifactsBuilt from d5f2929. Download artifacts from this workflow run. JS SDK ( npm install ./e2b-2.29.2-mishushakov-sandbox-kill-stream-errors.0.tgzCLI ( npm install ./e2b-cli-2.11.2-mishushakov-sandbox-kill-stream-errors.0.tgzPython SDK ( pip install ./e2b-2.28.2+mishushakov.sandbox.kill.stream.errors-py3-none-any.whl |
A dropped connection mid-request (HTTP/2 stream reset) can come from the sandbox dying or from an intermediary (load balancer, network). On the termination signature, the SDKs now probe envd's /health endpoint: 502 confirms the sandbox was killed or reached its end of life; otherwise the hedged message pointing to isRunning()/is_running() is kept. The health-check closure is plumbed into Commands/Pty/Filesystem and the command/watch handles in JS and sync/async Python. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 32365fe. Configure here.
Only a health probe that confirms the sandbox is gone produces the 'killed or reached its end of life' message; otherwise the error is the plain typed SandboxError/SandboxException without guessing at the cause. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…allback Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Reattach the handleEnvdApiError JSDoc that the new checkSandboxHealth was inserted under, and hoist HEALTH_CHECK_TIMEOUT_MS above its use - Guard the Python health probe call so a throwing probe falls back to the generic terminated-connection message (parity with JS) - Narrow the filesystem read/write except clauses from httpx.TransportError to (httpx.ProtocolError, httpx.NetworkError) — exactly the types the transport handler wraps Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
When a connection drops mid-request, raise TimeoutError (JS) / TimeoutException (Python) only when the health probe confirms the sandbox is no longer running — consistent with how requests to an already-dead sandbox surface (502 / Code.Unavailable mappings). In all other cases the original error now propagates unchanged: no generic terminated-connection wrapping and no SandboxException wrapping of other transport errors. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ents Main moved the sync SDK to per-thread RPC/HTTP clients (#1422, #1425). Rebuild the sandbox health probe on top of that model: sync Commands and Pty now lazily create a thread-local envd HTTP client (mirroring Filesystem) and expose _check_health as a method resolving the calling thread's client, instead of capturing a single shared httpx.Client at construction time. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Consolidate the two-line catch handler into a single formatRequestError call that returns the error to throw, matching the main SDK pattern (e2b-dev/E2B#1419). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Align with e2b-dev/E2B#1419, which names the health-check-aware error wrappers handle*Error / handle_*_exception: - JS: formatRequestError -> handleRequestError - Python: _raise_if_sandbox_killed -> _handle_connection_error Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Problem
When a sandbox is killed (or reaches its end of life) while a request is in flight, both SDKs surfaced unusable errors:
SandboxError: 2: [unknown] terminated— typed, but cryptic and says nothing about the sandbox being killed.httpcore.RemoteProtocolError: <StreamReset stream_id:1, error_code:2, remote_reset:True>.This affected the whole envd streaming family (
commands.run, PTY sessions,files.watchDir/watch_dir) and thefiles.read/writeHTTP transfers.The stream-reset signature alone can't distinguish the sandbox dying from an intermediary (load balancer, network) dropping the connection — so the SDKs now actively check, and only transform the error when the sandbox is confirmed gone.
Fix
Health-check disambiguation. When the connection-terminated signature appears (JS:
ConnectErrorCode.Unknown+terminatedor UndiciTypeError: terminated; Python:httpcore/httpxRemoteProtocolError), the SDK probes envd's/healthendpoint:TimeoutError(JS) /TimeoutException(Python): "The sandbox was killed or reached its end of life while the request was in flight." This matches how requests to an already-dead sandbox surface today (the 502 /Code.Unavailablemappings raise the timeout error type), so the exception type no longer depends on whether the sandbox died just before or just during the request.The probe (5s timeout) runs only on the termination signature, never on the happy path or for other errors. A health-check closure is plumbed into
Commands/Pty/Filesystemand the command/watch handles in JS and sync/async Python;Commands/Ptynow receive the envd API client in their constructors (internal signature change).Cleanup (
e2b_connect/client.py): removed the@_retry(RemoteProtocolError, 3)decorators fromcall_server_stream/acall_server_stream. They never executed —inspect.iscoroutinefunctionis false for (async) generator functions, and calling a generator function doesn't run its body, so the wrapper'stry/exceptcould never fire. A working mid-stream retry would be wrong anyway (it would replay already-delivered events). Unary retries are unchanged.Before / after
If the health probe does not confirm the sandbox is gone (e.g. a load balancer dropped the connection, or local envd in debug mode), the original error propagates unchanged — the SDK only makes a claim when it has verified it.
Notes
format: 'stream'body afterfiles.readreturns (JSReadableStreamconsumption happens in user code). Python is fully covered since httpx buffers non-streaming responses inside the request call.Tests
TimeoutError/TimeoutException, raw-error passthrough for running/unknown/probe-failure, health check skipped for unrelated errors — 28 Python + 25 JS assertions pass.sleep 60, kill the sandbox, assertwait()raisesTimeoutError/TimeoutExceptionwith the confirmed kill message — JS, sync Python, and async Python.🤖 Generated with Claude Code