Skip to content

fix(sdks): raise typed, actionable errors when sandbox dies mid-request#1419

Merged
mishushakov merged 8 commits into
mainfrom
mishushakov/sandbox-kill-stream-errors
Jun 15, 2026
Merged

fix(sdks): raise typed, actionable errors when sandbox dies mid-request#1419
mishushakov merged 8 commits into
mainfrom
mishushakov/sandbox-kill-stream-errors

Conversation

@mishushakov

@mishushakov mishushakov commented Jun 11, 2026

Copy link
Copy Markdown
Member

Problem

When a sandbox is killed (or reaches its end of life) while a request is in flight, both SDKs surfaced unusable errors:

  • JS: SandboxError: 2: [unknown] terminated — typed, but cryptic and says nothing about the sandbox being killed.
  • Python: leaked a completely raw httpcore.RemoteProtocolError: <StreamReset stream_id:1, error_code:2, remote_reset:True>.

This affected the whole envd streaming family (commands.run, PTY sessions, files.watchDir/watch_dir) and the files.read/write HTTP transfers.

The stream-reset signature alone can't distinguish the sandbox dying from an intermediary (load balancer, network) dropping the connection — so the SDKs now actively check, and only transform the error when the sandbox is confirmed gone.

Fix

Health-check disambiguation. When the connection-terminated signature appears (JS: ConnectError Code.Unknown + terminated or Undici TypeError: terminated; Python: httpcore/httpx RemoteProtocolError), the SDK probes envd's /health endpoint:

  • 502 (sandbox confirmed gone)TimeoutError (JS) / TimeoutException (Python): "The sandbox was killed or reached its end of life while the request was in flight." This matches how requests to an already-dead sandbox surface today (the 502 / Code.Unavailable mappings raise the timeout error type), so the exception type no longer depends on whether the sandbox died just before or just during the request.
  • Anything else (still running, or probe inconclusive) → the original error propagates unchanged, exactly as before this PR.

The probe (5s timeout) runs only on the termination signature, never on the happy path or for other errors. A health-check closure is plumbed into Commands/Pty/Filesystem and the command/watch handles in JS and sync/async Python; Commands/Pty now receive the envd API client in their constructors (internal signature change).

Cleanup (e2b_connect/client.py): removed the @_retry(RemoteProtocolError, 3) decorators from call_server_stream/acall_server_stream. They never executed — inspect.iscoroutinefunction is false for (async) generator functions, and calling a generator function doesn't run its body, so the wrapper's try/except could never fire. A working mid-stream retry would be wrong anyway (it would replay already-delivered events). Unary retries are unchanged.

Before / after

const sandbox = await Sandbox.create()
const cmd = await sandbox.commands.run('sleep 60', { background: true })
await sandbox.kill() // e.g. from another process
await cmd.wait()
// before: SandboxError: 2: [unknown] terminated
// after:  TimeoutError: [unknown] terminated: The sandbox was killed or reached
//         its end of life while the request was in flight.
sandbox = Sandbox.create()
cmd = sandbox.commands.run("sleep 60", background=True)
sandbox.kill()
cmd.wait()
# before: httpcore.RemoteProtocolError: <StreamReset stream_id:1, error_code:2, remote_reset:True>  (not an e2b type!)
# after:  e2b.exceptions.TimeoutException: <StreamReset ...>: The sandbox was killed
#         or reached its end of life while the request was in flight.

If the health probe does not confirm the sandbox is gone (e.g. a load balancer dropped the connection, or local envd in debug mode), the original error propagates unchanged — the SDK only makes a claim when it has verified it.

Notes

  • Not covered: errors raised while consuming a format: 'stream' body after files.read returns (JS ReadableStream consumption happens in user code). Python is fully covered since httpx buffers non-streaming responses inside the request call.

Tests

  • Unit: confirmed-kill → TimeoutError/TimeoutException, raw-error passthrough for running/unknown/probe-failure, health check skipped for unrelated errors — 28 Python + 25 JS assertions pass.
  • Integration (run against live sandboxes, all passing): start sleep 60, kill the sandbox, assert wait() raises TimeoutError/TimeoutException with the confirmed kill message — JS, sync Python, and async Python.

🤖 Generated with Claude Code

When a sandbox is killed or expires while a request is in flight, the JS SDK
threw a cryptic 'SandboxError: 2: [unknown] terminated' and the Python SDK
leaked a raw httpcore.RemoteProtocolError. Both now raise a typed error
explaining the sandbox was likely killed and pointing to isRunning()/is_running().

- JS: map ConnectError Code.Unknown with rawMessage 'terminated' to an
  actionable SandboxError; translate undici 'terminated' TypeError on the
  envd HTTP path (files.read/write)
- Python: wrap httpcore/httpx transport errors in SandboxException in
  handle_rpc_exception and the new handle_envd_api_transport_exception
- Remove never-firing @_retry decorators from streaming methods in
  e2b_connect (generator functions defer execution past the try/except;
  a working mid-stream retry would replay delivered events)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@changeset-bot

changeset-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 5520ce5

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages
Name Type
e2b Patch
@e2b/python-sdk Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@cursor

cursor Bot commented Jun 11, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Broad error-path changes across streaming commands, PTY, filesystem, and watch in both SDKs; behavior only changes when health confirms the sandbox is dead, but mis-probes could affect error typing.

Overview
When a connection drops mid-request (streaming RPC for commands, PTY, directory watch, and HTTP filesystem read/write), both JS and Python SDKs now probe envd /health (5s timeout) only for the known “terminated” / stream-reset signatures. If the probe returns 502 (sandbox gone), callers get TimeoutError / TimeoutException with a message that the sandbox was killed or reached end of life—aligned with errors for requests to an already-dead sandbox. If the sandbox is still up or the probe is inconclusive, the original error is unchanged.

Health-check plumbing is wired through Commands, Pty, Filesystem, and command/watch handles; internal constructors now take the envd API client where needed for probes. Python sync Commands/Pty use thread-local httpx clients for health checks.

Python e2b_connect: removed non-functional @_retry on server-stream RPC helpers (generators never hit the wrapper); added a comment that mid-stream retry would replay events.

Unit and live-sandbox integration tests cover confirmed-kill vs passthrough behavior.

Reviewed by Cursor Bugbot for commit 5520ce5. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Package Artifacts

Built from d5f2929. Download artifacts from this workflow run.

JS SDK (e2b@2.29.2-mishushakov-sandbox-kill-stream-errors.0):

npm install ./e2b-2.29.2-mishushakov-sandbox-kill-stream-errors.0.tgz

CLI (@e2b/cli@2.11.2-mishushakov-sandbox-kill-stream-errors.0):

npm install ./e2b-cli-2.11.2-mishushakov-sandbox-kill-stream-errors.0.tgz

Python SDK (e2b==2.28.2+mishushakov-sandbox-kill-stream-errors):

pip install ./e2b-2.28.2+mishushakov.sandbox.kill.stream.errors-py3-none-any.whl

@mishushakov mishushakov marked this pull request as draft June 11, 2026 12:19
A dropped connection mid-request (HTTP/2 stream reset) can come from the
sandbox dying or from an intermediary (load balancer, network). On the
termination signature, the SDKs now probe envd's /health endpoint:
502 confirms the sandbox was killed or reached its end of life; otherwise
the hedged message pointing to isRunning()/is_running() is kept. The
health-check closure is plumbed into Commands/Pty/Filesystem and the
command/watch handles in JS and sync/async Python.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 32365fe. Configure here.

Comment thread packages/js-sdk/src/envd/rpc.ts
mishushakov and others added 3 commits June 11, 2026 17:07
Only a health probe that confirms the sandbox is gone produces the
'killed or reached its end of life' message; otherwise the error is the
plain typed SandboxError/SandboxException without guessing at the cause.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…allback

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Reattach the handleEnvdApiError JSDoc that the new checkSandboxHealth
  was inserted under, and hoist HEALTH_CHECK_TIMEOUT_MS above its use
- Guard the Python health probe call so a throwing probe falls back to
  the generic terminated-connection message (parity with JS)
- Narrow the filesystem read/write except clauses from
  httpx.TransportError to (httpx.ProtocolError, httpx.NetworkError) —
  exactly the types the transport handler wraps

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@mishushakov mishushakov marked this pull request as ready for review June 11, 2026 15:31
mishushakov and others added 3 commits June 11, 2026 17:53
When a connection drops mid-request, raise TimeoutError (JS) /
TimeoutException (Python) only when the health probe confirms the
sandbox is no longer running — consistent with how requests to an
already-dead sandbox surface (502 / Code.Unavailable mappings).
In all other cases the original error now propagates unchanged:
no generic terminated-connection wrapping and no SandboxException
wrapping of other transport errors.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ents

Main moved the sync SDK to per-thread RPC/HTTP clients (#1422, #1425).
Rebuild the sandbox health probe on top of that model: sync Commands and
Pty now lazily create a thread-local envd HTTP client (mirroring
Filesystem) and expose _check_health as a method resolving the calling
thread's client, instead of capturing a single shared httpx.Client at
construction time.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@mishushakov mishushakov enabled auto-merge (squash) June 15, 2026 18:55
@mishushakov mishushakov merged commit 82add5b into main Jun 15, 2026
36 of 37 checks passed
@mishushakov mishushakov deleted the mishushakov/sandbox-kill-stream-errors branch June 15, 2026 19:01
mishushakov added a commit to e2b-dev/code-interpreter that referenced this pull request Jun 16, 2026
Consolidate the two-line catch handler into a single
formatRequestError call that returns the error to throw, matching
the main SDK pattern (e2b-dev/E2B#1419).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mishushakov added a commit to e2b-dev/code-interpreter that referenced this pull request Jun 16, 2026
Align with e2b-dev/E2B#1419, which names the health-check-aware
error wrappers handle*Error / handle_*_exception:
- JS: formatRequestError -> handleRequestError
- Python: _raise_if_sandbox_killed -> _handle_connection_error

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants