Skip to content

Commit 82add5b

Browse files
mishushakovclaude
andauthored
fix(sdks): raise typed, actionable errors when sandbox dies mid-request (#1419)
## Problem When a sandbox is killed (or reaches its end of life) while a request is in flight, both SDKs surfaced unusable errors: - **JS**: `SandboxError: 2: [unknown] terminated` — typed, but cryptic and says nothing about the sandbox being killed. - **Python**: leaked a completely raw `httpcore.RemoteProtocolError: <StreamReset stream_id:1, error_code:2, remote_reset:True>`. This affected the whole envd streaming family (`commands.run`, PTY sessions, `files.watchDir`/`watch_dir`) and the `files.read`/`write` HTTP transfers. The stream-reset signature alone can't distinguish the sandbox dying from an intermediary (load balancer, network) dropping the connection — so the SDKs now actively check, and only transform the error when the sandbox is confirmed gone. ## Fix **Health-check disambiguation.** When the connection-terminated signature appears (JS: `ConnectError` `Code.Unknown` + `terminated` or Undici `TypeError: terminated`; Python: `httpcore`/`httpx` `RemoteProtocolError`), the SDK probes envd's `/health` endpoint: - **502 (sandbox confirmed gone)** → `TimeoutError` (JS) / `TimeoutException` (Python): "The sandbox was killed or reached its end of life while the request was in flight." This matches how requests to an *already-dead* sandbox surface today (the 502 / `Code.Unavailable` mappings raise the timeout error type), so the exception type no longer depends on whether the sandbox died just before or just during the request. - **Anything else** (still running, or probe inconclusive) → the original error propagates unchanged, exactly as before this PR. The probe (5s timeout) runs only on the termination signature, never on the happy path or for other errors. A health-check closure is plumbed into `Commands`/`Pty`/`Filesystem` and the command/watch handles in JS and sync/async Python; `Commands`/`Pty` now receive the envd API client in their constructors (internal signature change). **Cleanup** (`e2b_connect/client.py`): removed the `@_retry(RemoteProtocolError, 3)` decorators from `call_server_stream`/`acall_server_stream`. They never executed — `inspect.iscoroutinefunction` is false for (async) generator functions, and calling a generator function doesn't run its body, so the wrapper's `try/except` could never fire. A *working* mid-stream retry would be wrong anyway (it would replay already-delivered events). Unary retries are unchanged. ## Before / after ```ts const sandbox = await Sandbox.create() const cmd = await sandbox.commands.run('sleep 60', { background: true }) await sandbox.kill() // e.g. from another process await cmd.wait() // before: SandboxError: 2: [unknown] terminated // after: TimeoutError: [unknown] terminated: The sandbox was killed or reached // its end of life while the request was in flight. ``` ```python sandbox = Sandbox.create() cmd = sandbox.commands.run("sleep 60", background=True) sandbox.kill() cmd.wait() # before: httpcore.RemoteProtocolError: <StreamReset stream_id:1, error_code:2, remote_reset:True> (not an e2b type!) # after: e2b.exceptions.TimeoutException: <StreamReset ...>: The sandbox was killed # or reached its end of life while the request was in flight. ``` If the health probe does not confirm the sandbox is gone (e.g. a load balancer dropped the connection, or local envd in debug mode), the original error propagates unchanged — the SDK only makes a claim when it has verified it. ## Notes - Not covered: errors raised while consuming a `format: 'stream'` body **after** `files.read` returns (JS `ReadableStream` consumption happens in user code). Python is fully covered since httpx buffers non-streaming responses inside the request call. ## Tests - Unit: confirmed-kill → `TimeoutError`/`TimeoutException`, raw-error passthrough for running/unknown/probe-failure, health check skipped for unrelated errors — 28 Python + 25 JS assertions pass. - Integration (run against live sandboxes, all passing): start `sleep 60`, kill the sandbox, assert `wait()` raises `TimeoutError`/`TimeoutException` with the *confirmed* kill message — JS, sync Python, and async Python. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
1 parent a6c801d commit 82add5b

30 files changed

Lines changed: 922 additions & 207 deletions

File tree

.changeset/spotty-llamas-shout.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
'e2b': patch
3+
'@e2b/python-sdk': patch
4+
---
5+
6+
Raise a typed, actionable error when the sandbox dies while a request is in flight. When the connection is dropped mid-request (streaming RPC calls — commands, PTY, directory watch — and filesystem read/write), the SDKs now probe the sandbox health endpoint: if the sandbox is confirmed gone, a `TimeoutError` (JS) / `TimeoutException` (Python) is raised stating the sandbox was killed or reached its end of life — consistent with how requests to an already-dead sandbox surface. In all other cases the original error propagates unchanged.

packages/js-sdk/src/envd/api.ts

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,12 @@ import {
1212
formatSandboxTimeoutError,
1313
AuthenticationError,
1414
RateLimitError,
15+
TimeoutError,
1516
} from '../errors'
1617
import { StartResponse, ConnectResponse } from './process/process_pb'
1718
import { Code, ConnectError } from '@connectrpc/connect'
1819
import { WatchDirResponse } from './filesystem/filesystem_pb'
20+
import { SandboxHealthCheck } from './rpc'
1921

2022
type ApiError = { message?: string } | string
2123

@@ -29,6 +31,64 @@ const DEFAULT_ERROR_MAP: Record<number, (message: string) => Error> = {
2931
507: (message) => new NotEnoughSpaceError(message),
3032
}
3133

34+
const HEALTH_CHECK_TIMEOUT_MS = 5_000
35+
36+
/**
37+
* Probes the sandbox's envd health endpoint.
38+
*
39+
* @param envdApi - The envd API client of the sandbox.
40+
* @returns `true` if the sandbox is running, `false` if it is not, `undefined` if its state could not be determined.
41+
*/
42+
export async function checkSandboxHealth(
43+
envdApi: EnvdApiClient
44+
): Promise<boolean | undefined> {
45+
try {
46+
const res = await envdApi.api.GET('/health', {
47+
signal: AbortSignal.timeout(HEALTH_CHECK_TIMEOUT_MS),
48+
})
49+
50+
if (res.response.status === 502) {
51+
return false
52+
}
53+
if (res.response.ok) {
54+
return true
55+
}
56+
57+
return undefined
58+
} catch {
59+
return undefined
60+
}
61+
}
62+
63+
/**
64+
* Handles transport-level fetch failures from envd API calls. When the connection was
65+
* dropped mid-request, probes the sandbox health to tell apart the sandbox being killed
66+
* from a transient network failure (e.g. a load balancer dropping the connection).
67+
*
68+
* @param err - The caught error, expected to be a fetch transport failure.
69+
* @param checkHealth - Probe returning whether the sandbox is running, or `undefined` when unknown.
70+
* @returns A `TimeoutError` when the connection was terminated mid-request and the sandbox is confirmed gone, or the original error otherwise.
71+
*/
72+
export async function handleEnvdApiFetchError(
73+
err: unknown,
74+
checkHealth?: SandboxHealthCheck
75+
): Promise<Error> {
76+
// Undici surfaces a connection dropped mid-body as a TypeError with the message 'terminated'
77+
if (err instanceof TypeError && err.message === 'terminated') {
78+
const running = checkHealth
79+
? await checkHealth().catch(() => undefined)
80+
: undefined
81+
82+
if (running === false) {
83+
return new TimeoutError(
84+
`${err.message}: The sandbox was killed or reached its end of life while the request was in flight.`
85+
)
86+
}
87+
}
88+
89+
return err as Error
90+
}
91+
3292
/**
3393
* Handles errors from envd API responses by mapping HTTP status codes to specific error types.
3494
*

packages/js-sdk/src/envd/rpc.ts

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,25 @@ import {
1414
} from '../errors'
1515
import { ENVD_DEFAULT_USER } from './versions'
1616

17+
/**
18+
* Result of a sandbox health probe: `true` if the sandbox is running, `false` if it is not,
19+
* `undefined` if its state could not be determined.
20+
*/
21+
export type SandboxHealthCheck = () => Promise<boolean | undefined>
22+
23+
/**
24+
* Checks whether the error is the signature of the connection to the sandbox being
25+
* dropped mid-request — an HTTP/2 stream reset surfaced by connect as `Code.Unknown`
26+
* with the message 'terminated'.
27+
*/
28+
export function isConnectionTerminatedError(err: unknown): boolean {
29+
return (
30+
err instanceof ConnectError &&
31+
err.code === Code.Unknown &&
32+
err.rawMessage === 'terminated'
33+
)
34+
}
35+
1736
const DEFAULT_ERROR_MAP: Partial<Record<Code, (message: string) => Error>> = {
1837
[Code.InvalidArgument]: (message) => new InvalidArgumentError(message),
1938
[Code.Unauthenticated]: (message) => new AuthenticationError(message),
@@ -62,6 +81,36 @@ export function handleRpcError(
6281
return err as Error
6382
}
6483

84+
/**
85+
* Like {@link handleRpcError}, but when the connection to the sandbox was dropped
86+
* mid-request it probes the sandbox health to tell apart the sandbox being killed
87+
* from a transient network failure (e.g. a load balancer dropping the connection).
88+
* When the probe confirms the sandbox is gone, a `TimeoutError` is returned —
89+
* consistent with how requests to an already-dead sandbox surface.
90+
*
91+
* @param err - The caught error, expected to be a `ConnectError` from the gRPC transport.
92+
* @param checkHealth - Probe returning whether the sandbox is running, or `undefined` when unknown.
93+
* @param errorMap - Optional map of gRPC `Code` values to error factory functions that override the defaults.
94+
* @returns The corresponding `Error` instance.
95+
*/
96+
export async function handleRpcErrorWithHealthCheck(
97+
err: unknown,
98+
checkHealth?: SandboxHealthCheck,
99+
errorMap?: Partial<Record<Code, (message: string) => Error>>
100+
): Promise<Error> {
101+
if (isConnectionTerminatedError(err) && checkHealth) {
102+
const running = await checkHealth().catch(() => undefined)
103+
104+
if (running === false) {
105+
return new TimeoutError(
106+
`${(err as ConnectError).message}: The sandbox was killed or reached its end of life while the request was in flight.`
107+
)
108+
}
109+
}
110+
111+
return handleRpcError(err, errorMap)
112+
}
113+
65114
function encode64(value: string): string {
66115
switch (runtime) {
67116
case 'deno':

packages/js-sdk/src/sandbox/commands/commandHandle.ts

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,7 @@
1-
import { handleRpcError } from '../../envd/rpc'
1+
import {
2+
handleRpcErrorWithHealthCheck,
3+
SandboxHealthCheck,
4+
} from '../../envd/rpc'
25
import { SandboxError } from '../../errors'
36
import { ConnectResponse, StartResponse } from '../../envd/process/process_pb'
47
import type { CommandRequestOpts } from '.'
@@ -112,7 +115,8 @@ export class CommandHandle
112115
) => Promise<void>,
113116
private readonly handleCloseStdin?: (
114117
opts?: CommandRequestOpts
115-
) => Promise<void>
118+
) => Promise<void>,
119+
private readonly checkHealth?: SandboxHealthCheck
116120
) {
117121
this._wait = this.handleEvents()
118122
}
@@ -281,7 +285,10 @@ export class CommandHandle
281285
}
282286
}
283287
} catch (e) {
284-
this.iterationError = handleRpcError(e)
288+
this.iterationError = await handleRpcErrorWithHealthCheck(
289+
e,
290+
this.checkHealth
291+
)
285292
} finally {
286293
this.handleDisconnect()
287294
}

packages/js-sdk/src/sandbox/commands/index.ts

Lines changed: 25 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -15,12 +15,20 @@ import {
1515
setupRequestController,
1616
Username,
1717
} from '../../connectionConfig'
18-
import { handleProcessStartEvent } from '../../envd/api'
18+
import {
19+
checkSandboxHealth,
20+
EnvdApiClient,
21+
handleProcessStartEvent,
22+
} from '../../envd/api'
1923
import {
2024
Process as ProcessService,
2125
Signal,
2226
} from '../../envd/process/process_pb'
23-
import { authenticationHeader, handleRpcError } from '../../envd/rpc'
27+
import {
28+
authenticationHeader,
29+
handleRpcErrorWithHealthCheck,
30+
SandboxHealthCheck,
31+
} from '../../envd/rpc'
2432
import { ENVD_COMMANDS_STDIN, ENVD_ENVD_CLOSE } from '../../envd/versions'
2533
import { SandboxError } from '../../errors'
2634
import { CommandHandle, CommandResult } from './commandHandle'
@@ -129,16 +137,16 @@ export class Commands {
129137

130138
private readonly defaultProcessConnectionTimeout = 60_000 // 60 seconds
131139
private readonly envdVersion: string
140+
private readonly checkHealth: SandboxHealthCheck
132141

133142
constructor(
134143
transport: Transport,
135-
private readonly connectionConfig: ConnectionConfig,
136-
metadata: {
137-
version: string
138-
}
144+
private readonly envdApi: EnvdApiClient,
145+
private readonly connectionConfig: ConnectionConfig
139146
) {
140147
this.rpc = createClient(ProcessService, transport)
141-
this.envdVersion = metadata.version
148+
this.envdVersion = envdApi.version
149+
this.checkHealth = () => checkSandboxHealth(this.envdApi)
142150
}
143151

144152
/**
@@ -177,7 +185,7 @@ export class Commands {
177185
...(p.config!.cwd && { cwd: p.config!.cwd }),
178186
}))
179187
} catch (err) {
180-
throw handleRpcError(err)
188+
throw await handleRpcErrorWithHealthCheck(err, this.checkHealth)
181189
}
182190
}
183191

@@ -220,7 +228,7 @@ export class Commands {
220228
}
221229
)
222230
} catch (err) {
223-
throw handleRpcError(err)
231+
throw await handleRpcErrorWithHealthCheck(err, this.checkHealth)
224232
}
225233
}
226234

@@ -257,7 +265,7 @@ export class Commands {
257265
}
258266
)
259267
} catch (err) {
260-
throw handleRpcError(err)
268+
throw await handleRpcErrorWithHealthCheck(err, this.checkHealth)
261269
}
262270
}
263271

@@ -298,7 +306,7 @@ export class Commands {
298306
}
299307
}
300308

301-
throw handleRpcError(err)
309+
throw await handleRpcErrorWithHealthCheck(err, this.checkHealth)
302310
}
303311
}
304312

@@ -354,11 +362,12 @@ export class Commands {
354362
opts?.onStderr,
355363
undefined,
356364
(data, stdinOpts) => this.sendStdin(pid, data, stdinOpts),
357-
(stdinOpts) => this.closeStdin(pid, stdinOpts)
365+
(stdinOpts) => this.closeStdin(pid, stdinOpts),
366+
this.checkHealth
358367
)
359368
} catch (err) {
360369
cleanup()
361-
throw handleRpcError(err)
370+
throw await handleRpcErrorWithHealthCheck(err, this.checkHealth)
362371
}
363372
}
364373

@@ -468,11 +477,12 @@ export class Commands {
468477
opts?.onStderr,
469478
undefined,
470479
(data, stdinOpts) => this.sendStdin(pid, data, stdinOpts),
471-
(stdinOpts) => this.closeStdin(pid, stdinOpts)
480+
(stdinOpts) => this.closeStdin(pid, stdinOpts),
481+
this.checkHealth
472482
)
473483
} catch (err) {
474484
cleanup()
475-
throw handleRpcError(err)
485+
throw await handleRpcErrorWithHealthCheck(err, this.checkHealth)
476486
}
477487
}
478488
}

packages/js-sdk/src/sandbox/commands/pty.ts

Lines changed: 28 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,16 @@ import {
1919
setupRequestController,
2020
} from '../../connectionConfig'
2121
import { CommandHandle } from './commandHandle'
22-
import { authenticationHeader, handleRpcError } from '../../envd/rpc'
23-
import { handleProcessStartEvent } from '../../envd/api'
22+
import {
23+
authenticationHeader,
24+
handleRpcErrorWithHealthCheck,
25+
SandboxHealthCheck,
26+
} from '../../envd/rpc'
27+
import {
28+
checkSandboxHealth,
29+
EnvdApiClient,
30+
handleProcessStartEvent,
31+
} from '../../envd/api'
2432

2533
export interface PtyCreateOpts
2634
extends Pick<ConnectionOpts, 'requestTimeoutMs' | 'signal'> {
@@ -74,18 +82,18 @@ export type PtyConnectOpts = Pick<PtyCreateOpts, 'onData' | 'timeoutMs'> &
7482
export class Pty {
7583
private readonly rpc: Client<typeof ProcessService>
7684
private readonly envdVersion: string
85+
private readonly checkHealth: SandboxHealthCheck
7786

7887
private readonly defaultPtyConnectionTimeout = 60_000 // 60 seconds
7988

8089
constructor(
8190
private readonly transport: Transport,
82-
private readonly connectionConfig: ConnectionConfig,
83-
metadata: {
84-
version: string
85-
}
91+
private readonly envdApi: EnvdApiClient,
92+
private readonly connectionConfig: ConnectionConfig
8693
) {
8794
this.rpc = createClient(ProcessService, this.transport)
88-
this.envdVersion = metadata.version
95+
this.envdVersion = envdApi.version
96+
this.checkHealth = () => checkSandboxHealth(this.envdApi)
8997
}
9098

9199
/**
@@ -144,11 +152,14 @@ export class Pty {
144152
events,
145153
undefined,
146154
undefined,
147-
opts.onData
155+
opts.onData,
156+
undefined,
157+
undefined,
158+
this.checkHealth
148159
)
149160
} catch (err) {
150161
cleanup()
151-
throw handleRpcError(err)
162+
throw await handleRpcErrorWithHealthCheck(err, this.checkHealth)
152163
}
153164
}
154165

@@ -198,11 +209,14 @@ export class Pty {
198209
events,
199210
undefined,
200211
undefined,
201-
opts?.onData
212+
opts?.onData,
213+
undefined,
214+
undefined,
215+
this.checkHealth
202216
)
203217
} catch (err) {
204218
cleanup()
205-
throw handleRpcError(err)
219+
throw await handleRpcErrorWithHealthCheck(err, this.checkHealth)
206220
}
207221
}
208222

@@ -242,7 +256,7 @@ export class Pty {
242256
}
243257
)
244258
} catch (err) {
245-
throw handleRpcError(err)
259+
throw await handleRpcErrorWithHealthCheck(err, this.checkHealth)
246260
}
247261
}
248262

@@ -283,7 +297,7 @@ export class Pty {
283297
}
284298
)
285299
} catch (err) {
286-
throw handleRpcError(err)
300+
throw await handleRpcErrorWithHealthCheck(err, this.checkHealth)
287301
}
288302
}
289303

@@ -327,7 +341,7 @@ export class Pty {
327341
}
328342
}
329343

330-
throw handleRpcError(err)
344+
throw await handleRpcErrorWithHealthCheck(err, this.checkHealth)
331345
}
332346
}
333347
}

0 commit comments

Comments
 (0)