feat(sdks/kotlin): add acquireMinRemainingTtl to pool to skip near-ex…#986
Conversation
…piry idles Closes opensandbox-group#983. When using the distributed sandbox warm pool, `pool.acquire()` could return an idle sandbox with only milliseconds of remaining TTL. The subsequent `checkReady` polling (up to 30s) would see the sandbox expire server-side, fail with READY_TIMEOUT, and never get a chance to call `renew`. Root cause: `tryTakeIdle` used a binary `expiresAt > now` check — a sandbox with 1ms remaining TTL still passed. Changes (Kotlin SDK only, additive and backward-compatible): * PoolStateStore: new overload `tryTakeIdle(poolName, minRemainingTtl)` with default impl that delegates to the existing single-arg method when the threshold is zero/negative, so custom user stores keep compiling. * InMemoryPoolStateStore: overrides the new method to discard entries whose remaining TTL is below the threshold (entries are removed from idle membership so reconcile can replenish). * RedisPoolStateStore: TAKE_IDLE_SCRIPT now reads the threshold from ARGV[1] and uses `expires_at > now_ms + min_remaining_ttl_ms`. The zero-arg path passes "0" so behavior is unchanged for callers. * PoolConfig: new `acquireMinRemainingTtl: Duration` (default `Duration.ZERO` — pre-existing binary-expiry behavior). Validated as non-negative. Wired into Builder + `withMaxIdle`. * SandboxPool.acquire: passes `config.acquireMinRemainingTtl` to the store. `releaseAllIdle` keeps the unfiltered call (drain-all semantics intentionally bypasses the TTL gate). Tests added: * InMemoryPoolStateStoreTest: zero/negative threshold falls back; entries below threshold are skipped and removed from idle; entries above pass. * RedisPoolStateStoreTest (gated on OPENSANDBOX_TEST_REDIS_URL): same three behaviors against real Redis. * PoolConfigTest: default value is `Duration.ZERO`; configured value round-trips; negative value is rejected. Python and other-language SDKs share the same pattern and will be addressed in follow-up PRs to keep this change reviewable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d921dfa779
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
| val poolName = config.poolName | ||
| val sandboxId = stateStore.tryTakeIdle(poolName) | ||
| val sandboxId = stateStore.tryTakeIdle(poolName, config.acquireMinRemainingTtl) |
There was a problem hiding this comment.
Expose acquireMinRemainingTtl on SandboxPool.Builder
When callers construct pools via SandboxPool.builder() (the public facade used in the README examples), there is no forwarding acquireMinRemainingTtl(...) method on that builder, so this new non-zero threshold can only be set by manually building a PoolConfig and passing it via config(...). Since acquire() now depends on config.acquireMinRemainingTtl here, users of the primary builder cannot enable the near-expiry filtering this change adds and will keep the old behavior.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — addressed in 8a6e953. SandboxPool.Builder now has a forwarding acquireMinRemainingTtl(...) setter (next to acquireSkipHealthCheck). Verified in SandboxPoolTest.sandbox pool builder forwards acquire readiness settings into config.
| } | ||
| val poolName = config.poolName | ||
| val sandboxId = stateStore.tryTakeIdle(poolName) | ||
| val sandboxId = stateStore.tryTakeIdle(poolName, config.acquireMinRemainingTtl) |
There was a problem hiding this comment.
Reset sandbox TTL before recording idle expiry
In pools where sandbox creation/readiness or warmupSandboxPreparer consumes a meaningful part of idleTimeout, this check is based on the store expiry stamped by putIdle, but the server timeout was already started when Sandbox.builder().timeout(config.idleTimeout) created the sandbox before those warmup steps. That makes acquireMinRemainingTtl overestimate real server-side TTL by the warmup duration, so an idle sandbox can still pass this gate with less than the configured minimum remaining and expire during the subsequent ready check unless the sandbox is renewed immediately before it is recorded idle.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Real issue, thanks for flagging. The store stamps expiresAt = putIdle_time + idleTimeout, but the server-side TTL clock starts at Sandbox.builder().timeout(idleTimeout) — so any time spent in warmup / warmupSandboxPreparer makes the store overestimate the real server-side remaining TTL.
Fixing this correctly means reading server-truth expiry (e.g. via sandbox.getInfo()) instead of recomputing locally, which is a separate, broader change with cross-language impact. To keep this PR focused on the threshold logic from #983, I would prefer to file a follow-up issue and address it there. Happy to open that immediately if you agree.
The current 60s default still tolerates warmups well below a minute — which is the common case for the code-interpreter template — so it does not regress the user-reported scenario. For long warmups, opting out via Duration.ZERO or sizing acquireMinRemainingTtl against typical warmup duration is the documented escape hatch until the follow-up lands.
| require(!acquireHealthCheckPollingInterval.isNegative && !acquireHealthCheckPollingInterval.isZero) { | ||
| "acquireHealthCheckPollingInterval must be positive" | ||
| } | ||
| require(!acquireMinRemainingTtl.isNegative) { "acquireMinRemainingTtl must be non-negative" } |
There was a problem hiding this comment.
Reject thresholds that can never admit idle sandboxes
If acquireMinRemainingTtl is configured greater than or equal to idleTimeout, every warmed idle entry will fail the store's strict expiresAt > now + minRemainingTtl check, so each acquire drains the pool's idle IDs, leaves the live sandboxes to expire server-side, and forces direct creation while reconcile keeps replenishing. Please validate this value against idleTimeout (for example requiring it to be less) so a single config mistake cannot make the pool permanently discard all warm sandboxes.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in 8a6e953. PoolConfig.build() now rejects acquireMinRemainingTtl >= idleTimeout with a descriptive message (and the same guard exists in PoolConfig.__post_init__ on the Python side). New PoolConfigTest.build rejects acquireMinRemainingTtl greater than or equal to idleTimeout and a strict-less-than boundary test cover both sides.
…thon SDK Addresses bot review feedback on PR opensandbox-group#986 and the remaining acceptance criteria from opensandbox-group#983 that the first revision deferred. ### Changes vs the previous revision **Default flipped from `Duration.ZERO` to 60s** so the binary-expiry race described in the issue is fixed out of the box, matching the issue's first acceptance criterion. Users who deliberately want the legacy "any-non-expired-entry" behavior can opt out by setting `Duration.ZERO` (documented on the field). **Cross-validated against `idleTimeout`** (`acquireMinRemainingTtl < idleTimeout`). Without this guard a misconfigured threshold ≥ idleTimeout would reject every freshly warmed entry and starve the pool. Caught by the codex bot review — now refused at build() time with a descriptive message. **Exposed on the public `SandboxPool.Builder`** (forwards to the underlying `PoolConfig.Builder`). Without this, users of the primary builder could not override the new threshold without dropping to the lower-level config builder. Also caught by the codex bot review. **Reconcile-time eviction of near-expiry entries.** The issue's optional proposal: `reapExpiredIdle` now also reclaims entries whose remaining TTL is below the threshold so reconcile triggers replenishment proactively instead of waiting for full expiry. Implemented as a new `reapExpiredIdle(poolName, now, minRemainingTtl)` overload on `PoolStateStore` (default impl falls back to the strict-expiry sweep) so custom user stores keep compiling. Both the in-memory and Redis stores override it; the Redis Lua script accepts the threshold via `ARGV[1]` with a sensible default of 0. **Python SDK (sync + async, in-memory + Redis).** The issue labels listed both `sdk/python` and `sdk/java`; this revision implements the same change across all four Python store classes (`InMemoryPoolStateStore`, `InMemoryAsyncPoolStateStore`, `RedisPoolStateStore`, `AsyncRedisPoolStateStore`), both pool implementations, both reconcilers, and both `PoolConfig` / `AsyncPoolConfig`. User-defined Python stores that pre-date this change keep working: callers go through `try_take_idle_with_min_ttl` / `reap_expired_idle_with_min_ttl` helpers in `pool_types` that detect the new methods via `hasattr` and fall back to the binary-expiry path otherwise. ### Tests (all green locally) * Kotlin `:sandbox:test` and `:sandbox-pool-redis:test` — 13/13 in the Redis suite verified against a real Redis 7 instance via `OPENSANDBOX_TEST_REDIS_URL`. * Python `pytest` — 192 passed (37 added/touched in pool tests, 0 skipped). * New unit tests cover, for both stores in both languages: zero/negative threshold falls through; entries below threshold are skipped and removed from idle; entries above threshold are returned; new `reap_expired_idle_min_ttl` evicts near-expiry while preserving entries above the threshold. * New PoolConfig tests cover: default is 60s; negative rejected; `>= idleTimeout` rejected; just-below-`idleTimeout` accepted; explicit `Duration.ZERO` accepted (opt-out path). ### Deferred (separate follow-up) The codex bot also flagged that `idleTimeout` starts the server-side TTL clock at sandbox creation, while the store's `expiresAt` is stamped after `warmupSandboxPreparer` runs — so the store TTL can over-estimate the real server-side remaining TTL by the warmup duration. Fixing that correctly requires reading server-truth expiry (e.g. via `sandbox.getInfo()`) rather than recomputing locally; treating that as a separate PR keeps the scope of this one tight on the threshold logic. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8a6e9539f6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| sandbox_id = _try_take_idle_with_min_ttl( | ||
| self._state_store, | ||
| pool_name, | ||
| self._config.acquire_min_remaining_ttl, |
There was a problem hiding this comment.
Expose acquire_min_remaining_ttl on Python pool facades
When callers use the public SandboxPoolSync/SandboxPoolAsync constructors, there is still no constructor argument forwarded into PoolConfig/AsyncPoolConfig, but acquire now always reads self._config.acquire_min_remaining_ttl here. That means facade users cannot opt out or tune the new threshold; in particular, an existing pool configured with idle_timeout <= 60s now fails config validation because the hidden default threshold is 60s and there is no way to pass a smaller value through these constructors.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — addressed in 3f7dddf. Both SandboxPoolSync.__init__ and SandboxPoolAsync.__init__ now accept acquire_min_remaining_ttl (default 60s, matching PoolConfig) and forward it through to the underlying config. Two new facade tests in test_pool_config.py exercise the idle_timeout=30s + acquire_min_remaining_ttl=10s combination end-to-end so the previously broken construction path is now covered.
Codex bot caught that ``SandboxPoolSync`` and ``SandboxPoolAsync`` constructors had explicit kwargs forwarding every other ``PoolConfig`` field, but not the newly added ``acquire_min_remaining_ttl``. With the default 60s baked into ``PoolConfig`` validation, any pool configured with ``idle_timeout <= 60s`` would now fail at construction time and facade users had no way to override. Adds the kwarg (default 60s, matching ``PoolConfig``) to both constructors and forwards it through to the underlying config. Adds two facade tests: building a pool with ``idle_timeout=30s`` and ``acquire_min_remaining_ttl=10s`` succeeds and the values land in the config as written. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3f7dddf9e6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| private const val DEFAULT_DEGRADED_THRESHOLD = 3 | ||
| private val DEFAULT_ACQUIRE_READY_TIMEOUT = Duration.ofSeconds(30) | ||
| private val DEFAULT_ACQUIRE_HEALTH_CHECK_POLLING_INTERVAL = Duration.ofMillis(200) | ||
| private val DEFAULT_ACQUIRE_MIN_REMAINING_TTL: Duration = Duration.ofSeconds(60) |
There was a problem hiding this comment.
Avoid rejecting existing short idle timeouts
When callers keep the new default but already configure idleTimeout to 60 seconds or less, this added 60s default immediately trips the validation below that requires acquireMinRemainingTtl < idleTimeout, so previously valid pools fail during config construction unless every caller learns to pass Duration.ZERO or a smaller value. This is a public SDK behavior break (the Python config adds the same 60s default), so the default should be derived from idleTimeout or kept opt-in to preserve existing short-idle configurations.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
You're right — addressed in 4e4f12c. The default is now derived from idleTimeout instead of being a fixed 60s:
- Builder field is nullable (
Duration?in Kotlin /timedelta | Nonein Python). nullresolves atbuild()time tomin(60s, idleTimeout / 2)— always strictly less thanidleTimeout, so existing pools with short idle timeouts keep working without code changes.- Pass
Duration.ZEROto opt out, or any explicit positive value to override. - The strict
< idleTimeoutvalidation still fires for explicit values, so misconfigurations are still caught.
New tests default scales for short idle_timeout / caps at 60s for long idle_timeout cover both branches in Kotlin and Python.
| val now = Instant.now() | ||
|
|
||
| stateStore.reapExpiredIdle(poolName, now) | ||
| stateStore.reapExpiredIdle(poolName, now, config.acquireMinRemainingTtl) |
There was a problem hiding this comment.
Kill live sandboxes reaped for min TTL
When acquireMinRemainingTtl is positive, this sweep now removes not-yet-expired idle entries from the store, but unlike the shrink path it does not return those IDs to onDiscardSandbox, so the live sandboxes are left running until their server TTL expires while reconcile immediately creates replacements. With larger configured thresholds this can keep up to a full idle buffer of discarded-but-still-running sandboxes around for a long time and temporarily exceed pool resource/quota expectations.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Same root cause as the acquire-path leak — addressed in 4e4f12c. reapExpiredIdle(name, now, minTtl) now returns the alive evicted IDs (excluding fully-expired ones, since the server has reaped them). PoolReconciler routes the returned list through the existing onDiscardSandbox callback, which is the same path shrinkExcessIdle uses to fire killSandbox.
The Redis Lua script returns the alive list explicitly so the round-trip survives. New tests cover reapExpiredIdle with minRemainingTtl returns alive evicted entries and excludes already-expired from the alive list, both against in-memory and real-Redis stores.
| } | ||
| val poolName = config.poolName | ||
| val sandboxId = stateStore.tryTakeIdle(poolName) | ||
| val sandboxId = stateStore.tryTakeIdle(poolName, config.acquireMinRemainingTtl) |
There was a problem hiding this comment.
Clean up idle sandboxes discarded during acquire
When an acquire runs with a positive acquireMinRemainingTtl, the new store variant consumes every idle ID that is below the threshold before returning an eligible ID or null, but this acquire path only receives the returned ID and has no chance to kill the skipped live sandboxes. In pools that are acquired after entries age past the threshold, those skipped sandboxes are dropped from pool accounting and replaced by direct creates/reconcile while still running until their server-side TTL expires, temporarily exceeding the intended pool size and resource usage.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Real bug, addressed in 4e4f12c. The store API now surfaces the discarded-alive IDs to the caller:
- New
TakeIdleResultcarryingsandboxId+discardedAliveSandboxIds.tryTakeIdle(name, minTtl)returns it. SandboxPool.acquirecalls a newkillDiscardedAlivehelper on the returned list (best-effort, errors logged and swallowed).- Already-expired entries are intentionally excluded from the list — the server has already reaped them, a kill round-trip would be wasted.
Verified by new unit tests in both Kotlin (tryTakeIdle surfaces alive entries below the threshold so callers can kill them / silently drops fully-expired entries) and Python equivalents, against both in-memory and real-Redis stores.
…ingTtl default Addresses three bot reviews on PR opensandbox-group#986 head (commit 3f7dddf). ### opensandbox-group#1 + opensandbox-group#2 — Resource leak: skipped near-expiry sandboxes are now killed Previously, when ``acquireMinRemainingTtl`` was positive, both ``tryTakeIdle`` and ``reapExpiredIdle`` silently dropped near-expiry idle entries from store membership but never told the pool, so the live sandboxes kept running on the server (consuming quota / cost) until their own TTL elapsed — temporarily exceeding the intended pool size. Redesigned the store API to surface those IDs to callers: * New ``TakeIdleResult`` value type carrying ``sandbox_id`` plus a ``discarded_alive_sandbox_ids`` list. ``tryTakeIdle(name, minTtl)`` now returns it. Already-expired entries (server has reaped them) are intentionally excluded — a kill round-trip would be wasted. * ``reapExpiredIdle(name, now, minTtl)`` now returns the alive evicted IDs (same exclusion rule). * Both Lua scripts now return arrays so the discarded-alive list survives the round-trip; the empty-pool fast-path still returns ``nil`` so clients can distinguish "nothing to do" cleanly. Wired into the pool: * ``SandboxPool.acquire`` (Kotlin) and ``SandboxPoolSync/Async.acquire`` (Python) call ``_kill_discarded_alive`` on the returned list. Failures are logged and swallowed — the primary acquire outcome must not depend on a janitor failure. * ``PoolReconciler.reapExpiredIdle`` routes the returned list through the existing ``onDiscardSandbox`` callback, which already triggers a best-effort kill (same path used by ``shrinkExcessIdle``). ### opensandbox-group#3 — Default no longer breaks existing users with short ``idleTimeout`` A 60s default would fail validation (``acquireMinRemainingTtl < idleTimeout``) for any pool configured with ``idleTimeout <= 60s``, silently breaking user code on upgrade. Builder field is now nullable: ``null`` resolves at ``build()`` time to ``min(60s, idleTimeout / 2)``. This is always strictly less than ``idleTimeout``, so no existing pool construction breaks. Pass ``Duration.ZERO`` to opt out, or any explicit positive value to override. The strict ``< idleTimeout`` validation is retained for explicit user values so misconfigurations (e.g. ``acquireMinRemainingTtl == idleTimeout``) are still caught. ### Tests * Kotlin ``:sandbox:test`` and ``:sandbox-pool-redis:test`` (with real Redis 7 via ``OPENSANDBOX_TEST_REDIS_URL``) — 14/14 in the Redis suite. * Python ``pytest`` — 197 passed. * New cases verify, in both languages and both stores: alive entries below threshold are surfaced; fully-expired entries are silently dropped; ``reapExpiredIdle`` excludes already-expired from the alive list; the default scales to ``min(60s, idleTimeout/2)``; explicit ``ZERO`` opts out; explicit values still get validated against ``idleTimeout``. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pangjiping
left a comment
There was a problem hiding this comment.
Nice fix for a real race condition. A few findings — one bug, one perf concern, two nits.
Four review items from @Pangjiping on 4e4f12c. ### TakeIdleResult is now a data class Auto-generates ``equals``/``hashCode``/``toString``, so test assertions like ``assertEquals(expected, store.tryTakeIdle(...))`` work and debugger output is useful. The ``EMPTY`` singleton optimization is preserved (data classes permit secondary references just fine). ### PoolStateStore defaults: dead-if removed Both ``tryTakeIdle(name, minTtl)`` and ``reapExpiredIdle(name, now, minTtl)`` default impls had identical ``if (zero-or-negative) { ... } else { ... }`` branches. The branches collapsed to a single unconditional delegate. KDoc updated to explain why custom stores that predate the methods always see an empty discarded-alive list. ### Python kill log fires only on success ``_kill_sandbox_best_effort`` now returns ``bool``: - ``True`` on a confirmed kill - ``False`` if no manager is available, or if the kill raised (swallowed) ``_kill_discarded_alive`` only emits the ``"Killed near-expiry idle sandbox"`` debug line when the return value is ``True``. Mirrors the Kotlin pattern where the equivalent log is inside the ``try`` block after ``killSandbox``. The existing ``Callable[[str], None]`` reconciler callback continues to accept the method (PEP 484 ``None`` return ≈ "no return constraint"). ### Async pool: concurrent kills via asyncio.gather ``SandboxPoolAsync._kill_discarded_alive`` used to ``await`` each kill sequentially, so a batch of N near-expiry IDs blocked ``acquire()`` on N serial RPCs. Now uses ``asyncio.gather`` so the calls overlap, then maps the success bool back to the per-ID debug log. ### Tests added (test_pool_kill_discarded.py) - ``_kill_sandbox_best_effort`` returns True/False correctly across success / failure / missing-manager - ``_kill_discarded_alive`` debug log fires only for successful kills, verified with ``caplog`` - async kill batch finishes well under serial-time (5 × 50ms sleeps complete in <150ms) and ``max_in_flight >= 2`` ### Test results Kotlin ``:sandbox:test`` + ``:sandbox-pool-redis:test`` (real Redis 7 via ``OPENSANDBOX_TEST_REDIS_URL``): all green. Python ``pytest``: 203 passed (197 + 6 new). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 33ee6f3dcb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
| val poolName = config.poolName | ||
| val sandboxId = stateStore.tryTakeIdle(poolName) | ||
| val takeResult = stateStore.tryTakeIdle(poolName, config.acquireMinRemainingTtl) |
There was a problem hiding this comment.
Renew warm sandboxes before applying min-TTL gate
This min-TTL check uses the store expiry, but putIdle stamps that as put time + idleTimeout while the server TTL starts earlier in buildSandboxFromSpec() before readiness and warmupSandboxPreparer. Fresh evidence after the prior comment is that the current createOneSandbox() still returns the id after the preparer without renewing before the reconciler enqueues it, so when warmup takes time (for example 45s with idleTimeout=120s and acquireMinRemainingTtl=60s) the store can report >60s remaining while the server has far less, letting this gate still hand out a sandbox that expires during acquire/connect. Renew to idleTimeout or stamp the store expiry from server truth before putIdle.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch. Fixed in a170c1a — createOneSandbox / _create_one_sandbox now calls sandbox.renew(config.idleTimeout) after the warmup preparer runs and before returning the id to the reconciler. The store stamp at putIdle is now consistent with the server-side TTL, so the min-TTL gate cannot overestimate remaining TTL by the warmup duration. Kotlin and both Python (sync + async) paths updated.
New tests (test_sync_create_one_sandbox_renews_before_returning_id / async) use a RenewTrackingSandbox fake that records every renew call and assert renewed == [idle_timeout] after warmup completes.
| val sandboxId = stateStore.tryTakeIdle(poolName) | ||
| val takeResult = stateStore.tryTakeIdle(poolName, config.acquireMinRemainingTtl) | ||
| val sandboxId = takeResult.sandboxId | ||
| killDiscardedAlive(poolName, takeResult.discardedAliveSandboxIds, source = "acquire") |
There was a problem hiding this comment.
Connect to the eligible idle before cleanup
When many FIFO entries are below acquireMinRemainingTtl but a later entry is eligible, this call kills every discarded ID synchronously before connecting to the eligible sandbox. In distributed pools with slow kill RPCs or a large idle buffer, that cleanup can consume the candidate's remaining TTL before connect()/renew() runs, so a sandbox that passed the min-TTL gate can still expire during acquisition. Defer this best-effort cleanup until after the selected sandbox is connected/renewed, or run it in the background.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Real bug. Fixed in a170c1a:
SandboxPool.acquirenow stashesdiscardedAliveSandboxIdsinpendingKilland only schedules the cleanup after the candidate is connected and renewed.- The cleanup is offloaded to
warmupExecutor.submit(...)so the caller does not wait for N kill RPCs (Python sync uses the same ThreadPoolExecutor; Python async usesasyncio.create_taskwith the task tracked on_warmup_tasksso shutdown can wait for it). - Fall-through paths (FAIL_FAST and DIRECT_CREATE) still schedule the cleanup so discarded-alive sandboxes never linger silently.
- New tests in
test_pool_kill_discarded.py(test_sync_schedule_kill_discarded_alive_does_not_block_caller/ async equivalent) submit a 100ms-per-call kill manager and assert scheduling latency stays under 50ms / 10ms while the kills still complete on the executor / event loop afterwards.
Pangjiping
left a comment
There was a problem hiding this comment.
LGTM. Well-structured fix for a real production race condition. Good backward-compat story and thorough test coverage across both Kotlin and Python SDKs.
Minor suggestions (non-blocking):
- PR description says default is
Duration.ZERObut code auto-derivesmin(60s, idleTimeout/2)— worth updating the description. - Kotlin
killDiscardedAliveruns serial kill calls insideacquire()— consider background/async execution like the Pythonasyncio.gatherpath. - Consider a changelog note about the default behavioral change (existing users now get 60s threshold instead of binary expiry).
None of these block merge.
…utIdle Two bot reviews on 33ee6f3. ### Fix A — connect to the eligible idle before cleanup Previously ``acquire`` killed every below-threshold-but-still-alive sandbox synchronously *before* connecting to the chosen candidate. With slow kill RPCs or a large discarded-alive list, that cleanup could consume the candidate's remaining TTL — so a sandbox that just passed ``acquireMinRemainingTtl`` could still expire during ``connect()`` / ``renew()``. The original race this PR was meant to fix. Reordered ``SandboxPool.acquire`` (Kotlin and Python sync+async) to: 1. Take the candidate from the store. 2. Stash ``discardedAliveSandboxIds`` in ``pendingKill``. 3. Connect, run any user health check, ``renew`` for ``sandboxTimeout``. 4. **Then** schedule the kill cleanup, fire-and-forget: - Kotlin: ``warmupExecutor.submit`` (falls back to inline if executor is missing — better to slow the caller than drop the cleanup). - Python sync: ``ThreadPoolExecutor.submit`` on the warmup executor (same fallback). - Python async: ``asyncio.create_task`` registered on the existing ``_warmup_tasks`` set so shutdown can wait on it. The same scheduling fires on the FAIL_FAST / DIRECT_CREATE fall-through paths so discarded-alive sandboxes never linger silently. ### Fix B — renew warm sandbox before recording idle expiry ``putIdle`` stamps store-side expiry as ``putIdle_time + idleTimeout``, but the server-side TTL has been ticking since ``buildSandboxFromSpec()`` /``_build_sandbox_from_spec()`` — through the readiness wait *and* any ``warmupSandboxPreparer`` work. With a long preparer (e.g. 45s with ``idleTimeout=120s`` and ``acquireMinRemainingTtl=60s``), the store can report >60s remaining while the server has far less, letting the min-TTL gate hand out a sandbox that still expires during acquire/connect. Added one ``sandbox.renew(config.idleTimeout)`` call right before returning the id from ``createOneSandbox`` / ``_create_one_sandbox``, so the server-side TTL is reset to match the store's stamp. Kotlin and Python sync+async all updated. ### Tests added (test_pool_kill_discarded.py) * ``_schedule_kill_discarded_alive`` returns immediately for both sync (executor-backed) and async (``create_task``); verified by submitting a 100ms-delay manager and asserting <50ms / <10ms scheduling latency. The kills still complete on the executor / event loop afterwards. * ``_create_one_sandbox`` calls ``renew(idle_timeout)`` exactly once before returning the id — covered for both sync and async pools using a ``RenewTracking*Sandbox`` fake that records ``renew`` calls. ### Test results Kotlin ``:sandbox:test`` + ``:sandbox-pool-redis:test`` (real Redis 7 via ``OPENSANDBOX_TEST_REDIS_URL``): all green. Python ``pytest``: 207 passed (203 + 4 new). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI's SDK Tests failed two quality jobs on the previous head:
- ``Kotlin SDK Quality And Tests (sandbox)`` — ``spotlessCheck`` (ktlint
parser) errored on a KDoc that contained an unbalanced literal
``{...}`` followed by ``"..."``. Rewrote the comment as plain prose so
the parser stops trying to resolve braces inside the docstring. Also
picked up the Spotless auto-format pass on a few unrelated lines in
``InMemoryPoolStateStore`` / ``TakeIdleResult`` (whitespace + line
break normalization).
- ``Python SDK Quality (sandbox)`` — ``ruff check`` flagged two
categories from this PR:
* ``I001`` import ordering in the new helpers in
``pool_types`` / sync+async reconcilers / sync+async pool. Fixed
via ``ruff --fix`` (split into two import groups, alphabetized).
* ``UP037`` quoted self-referential type annotations in the new
``_RenewTracking{Sync,Async}Sandbox`` test fakes. Quotes removed.
* ``B905`` ``zip()`` without ``strict=`` in the new async
``_kill_discarded_alive`` parallel kill loop. Now ``strict=True``.
No source-level behavior change. Existing Python suite still 207/207;
Kotlin ``:sandbox:test`` + ``:sandbox-pool-redis:test`` (real Redis 7
via ``OPENSANDBOX_TEST_REDIS_URL``) green; ``spotlessCheck`` and
``ruff check`` both clean locally.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks for the thorough review @Pangjiping! 🙏 Quick status on the three non-blocking suggestions: 1. PR description outdated — fixed (updated the body just now). Reflects the auto-derived default 2. Kotlin 3. Changelog note on default behavioral change — the project doesn't carry per-SDK CHANGELOG files yet (only Recap of all changes since the dismissed LGTM review
Local verification on
CI jobs on the new head are currently |
CI's Python SDK Quality (sandbox) job failed at the pyright step with 10 errors after I ignored that lint in earlier rounds. All on this PR's new code; tests and ruff are unaffected. ### Three categories **(1) ``timedelta | None`` not assignable to ``timedelta`` (4 errors)** The auto-default change in 4e4f12c made ``PoolConfig.acquire_min_remaining_ttl`` declared as ``timedelta | None`` (``None`` means "auto-derive from idle_timeout" at ``__post_init__``). Even though ``__post_init__`` always resolves it to a concrete ``timedelta`` via ``object.__setattr__``, pyright can't see through that, so passing it to the helper functions tripped. Fix: widen the helper signatures to accept ``timedelta | None`` and treat ``None`` as zero. This matches what ``__post_init__`` would do anyway and removes the need for ``cast`` at every call site. **(2) ``_kill_sandbox_best_effort`` returns ``bool`` but reconciler wants ``Callable[[str], None]`` (2 errors)** In 33ee6f3 ``_kill_sandbox_best_effort`` was changed to return ``bool`` so ``_kill_discarded_alive`` could fire the success debug log only on a confirmed kill. But the reconciler's ``on_discard_sandbox`` parameter is typed ``Callable[[str], None]``, and ``bool`` is not a subtype of ``None``. Fix: introduce a thin ``_discard_sandbox_callback`` adapter on both pool facades that calls ``_kill_sandbox_best_effort`` and drops the return. The reconciler now receives the adapter; the bool is still available to ``_kill_discarded_alive`` for the success-only log. **(3) ``getattr(store, ...)`` returns ``object`` (4 errors)** The Python ``hasattr``-based store-compat shim in ``pool_types.py`` uses ``getattr(store, "try_take_idle_min_ttl", None)``, which pyright types as ``object``. Calling that or awaiting it tripped four errors in the helpers. Fix: ``cast`` each ``getattr`` result to the actual signature (``Awaitable[TakeIdleResult]`` / ``Iterable[str] | None`` etc.). Behavior unchanged at runtime — ``callable(method)`` still gates the call. ### Verification - ``uv run pyright`` — 0 errors - ``uv run ruff check`` — clean - ``uv run pytest`` — 207 passed - Kotlin ``:sandbox:test`` + ``:sandbox-pool-redis:test`` (real Redis 7) + ``spotlessCheck`` — all green Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Pushed All local checks pass on
@Pangjiping — would appreciate workflow approval when you have a moment so the new run can show green 🙏 |
Closes #983.
Summary
When using the distributed sandbox warm pool,
pool.acquire()could return an idle sandbox with only milliseconds of remaining TTL. The subsequentcheckReadypolling (up to 30s) would see the sandbox expire server-side, fail withREADY_TIMEOUT, and never get a chance to callrenew.Root cause:
tryTakeIdleused a binaryexpiresAt > nowcheck — a sandbox with 1ms remaining TTL still passed.What changed
Acquire-time near-expiry filtering
PoolConfig.acquireMinRemainingTtl: Duration. Idle entries whose remaining TTL is below this threshold are skipped on acquire and reaped on reconcile, so the chosen sandbox always has enough remaining TTL to survivecheckReady+renew.idleTimeout:min(60s, idleTimeout / 2)(always strictly less thanidleTimeout). Existing users with short idle timeouts are not silently broken; users with the default 24h idle timeout get a 60s threshold. PassDuration.ZEROto opt out and restore the pre-existing binary-expiry behavior.idleTimeoutfor explicit values (build()rejects>= idleTimeout).Discarded-alive cleanup (no resource leaks)
tryTakeIdle(name, minTtl)andreapExpiredIdle(name, now, minTtl)return discarded-alive IDs to the caller (TakeIdleResultin Kotlin, plain tuple in Python). Already-expired entries are silently dropped — the server has reaped them.SandboxPool.acquire(Kotlin and Python sync+async) defers the kill cleanup until after the chosen candidate is connected and renewed, so slow kill RPCs cannot consume the candidate's remaining TTL. The cleanup runs on the warmup executor (Kotlin/Python sync) or viaasyncio.create_task(Python async) so the caller does not block on N kill round-trips.PoolReconcilerroutes the reaped near-expiry IDs through the existingonDiscardSandboxcallback, which fireskillSandboxbest-effort — same path used byshrinkExcessIdle.Warmup TTL drift fix
createOneSandbox/_create_one_sandboxnow callssandbox.renew(idleTimeout)after the warmup preparer runs and before returning the id. The store stamp atputIdleis now consistent with the server-side TTL, so the min-TTL gate cannot overestimate remaining TTL by the warmup duration.Languages
sandbox,sandbox-pool-redis)Behavioral change worth noting
This PR changes the default behavior of
pool.acquire()for new opt-in users:expiresAt > now, even by 1msmin(60s, idleTimeout/2)remainingREADY_TIMEOUTrace possibleUsers who relied on the old behavior (e.g. tests with
idleTimeoutdeliberately short) can passacquireMinRemainingTtl(Duration.ZERO)to opt out. The auto-default is intentionally derived fromidleTimeoutso existing pools withidleTimeout <= 60sare not broken at construction time.Tests
Kotlin
:sandbox:test—InMemoryPoolStateStoreTest(alive vs expired discarded, threshold edge cases),PoolConfigTest(auto-default scaling, validation),SandboxPoolTest(builder forwarding).:sandbox-pool-redis:test(gated onOPENSANDBOX_TEST_REDIS_URL) — same three behaviors against real Redis 7.spotlessCheckclean.Python
pytest— 207 passed (192 prior + 15 added acrosstest_pool_store.py,test_pool_config.py,test_redis_pool_store.py,test_async_redis_pool_store.py,test_pool_kill_discarded.py).test_pool_kill_discarded.pycovers: scheduling latency for both sync (executor) and async (create_task) is non-blocking; kill log fires only on confirmed success; warmup pipeline callsrenew(idle_timeout)before returning the id.ruff checkclean.Testing
Breaking Changes
pool.acquire()default (see table above). Backward-compatible escape hatch:acquireMinRemainingTtl(Duration.ZERO). Validated to not break existing pool constructors with shortidleTimeout.PoolStateStoreship with default impls so custom user stores keep compiling.Checklist
Commits in this PR (chronological)
d921dfa7— initial Kotlin impl (opt-in, default ZERO)8a6e9539— align with Add minimum remaining TTL check on pool acquire to avoid dispensing near-expiry sandboxes #983 spec, mirror to Python SDK, default 60s, cross-validation3f7dddf9— exposeacquire_min_remaining_ttlon Python pool facades4e4f12c8— kill discarded-alive sandboxes, auto-scale default tomin(60s, idleTimeout/2)33ee6f3d— address @Pangjiping review (data class, dead-if removal, success-only kill log, async parallel kills)a170c1af— defer discarded-alive kill until after candidate is connected/renewed; renew warm sandbox beforeputIdle8d97a64f— fix CI lint failures (ktlint KDoc parse, ruff I001/UP037/B905)