feat(sdks/kotlin): add acquireMinRemainingTtl to pool to skip near-ex… by asiudgufgbukbsa · Pull Request #986 · opensandbox-group/OpenSandbox

asiudgufgbukbsa · 2026-06-05T16:37:25Z

Closes #983.

Summary

When using the distributed sandbox warm pool, pool.acquire() could return an idle sandbox with only milliseconds of remaining TTL. The subsequent checkReady polling (up to 30s) would see the sandbox expire server-side, fail with READY_TIMEOUT, and never get a chance to call renew.

Root cause: tryTakeIdle used a binary expiresAt > now check — a sandbox with 1ms remaining TTL still passed.

What changed

Acquire-time near-expiry filtering

New PoolConfig.acquireMinRemainingTtl: Duration. Idle entries whose remaining TTL is below this threshold are skipped on acquire and reaped on reconcile, so the chosen sandbox always has enough remaining TTL to survive checkReady + renew.
Default is auto-derived from idleTimeout: min(60s, idleTimeout / 2) (always strictly less than idleTimeout). Existing users with short idle timeouts are not silently broken; users with the default 24h idle timeout get a 60s threshold. Pass Duration.ZERO to opt out and restore the pre-existing binary-expiry behavior.
Cross-validated against idleTimeout for explicit values (build() rejects >= idleTimeout).

Discarded-alive cleanup (no resource leaks)

tryTakeIdle(name, minTtl) and reapExpiredIdle(name, now, minTtl) return discarded-alive IDs to the caller (TakeIdleResult in Kotlin, plain tuple in Python). Already-expired entries are silently dropped — the server has reaped them.
SandboxPool.acquire (Kotlin and Python sync+async) defers the kill cleanup until after the chosen candidate is connected and renewed, so slow kill RPCs cannot consume the candidate's remaining TTL. The cleanup runs on the warmup executor (Kotlin/Python sync) or via asyncio.create_task (Python async) so the caller does not block on N kill round-trips.
PoolReconciler routes the reaped near-expiry IDs through the existing onDiscardSandbox callback, which fires killSandbox best-effort — same path used by shrinkExcessIdle.

Warmup TTL drift fix

createOneSandbox / _create_one_sandbox now calls sandbox.renew(idleTimeout) after the warmup preparer runs and before returning the id. The store stamp at putIdle is now consistent with the server-side TTL, so the min-TTL gate cannot overestimate remaining TTL by the warmup duration.

Languages

Kotlin SDK (sandbox, sandbox-pool-redis)
Python SDK (sync + async, in-memory + Redis stores, both pool facades)

Behavioral change worth noting

This PR changes the default behavior of pool.acquire() for new opt-in users:

Before	After
Returns any idle entry with `expiresAt > now`, even by 1ms	Returns only entries with at least `min(60s, idleTimeout/2)` remaining
`READY_TIMEOUT` race possible	Race fixed

Users who relied on the old behavior (e.g. tests with idleTimeout deliberately short) can pass acquireMinRemainingTtl(Duration.ZERO) to opt out. The auto-default is intentionally derived from idleTimeout so existing pools with idleTimeout <= 60s are not broken at construction time.

Tests

Kotlin

:sandbox:test — InMemoryPoolStateStoreTest (alive vs expired discarded, threshold edge cases), PoolConfigTest (auto-default scaling, validation), SandboxPoolTest (builder forwarding).
:sandbox-pool-redis:test (gated on OPENSANDBOX_TEST_REDIS_URL) — same three behaviors against real Redis 7.
spotlessCheck clean.

Python

pytest — 207 passed (192 prior + 15 added across test_pool_store.py, test_pool_config.py, test_redis_pool_store.py, test_async_redis_pool_store.py, test_pool_kill_discarded.py).
New test_pool_kill_discarded.py covers: scheduling latency for both sync (executor) and async (create_task) is non-blocking; kill log fires only on confirmed success; warmup pipeline calls renew(idle_timeout) before returning the id.
ruff check clean.

Testing

Unit tests
Integration tests (real Redis 7)
e2e / manual verification (existing pool e2e tests on CI cover the change implicitly)

Breaking Changes

Behavioral change to pool.acquire() default (see table above). Backward-compatible escape hatch: acquireMinRemainingTtl(Duration.ZERO). Validated to not break existing pool constructors with short idleTimeout.
No source-incompatible API change. New methods added to PoolStateStore ship with default impls so custom user stores keep compiling.

Checklist

Linked Issue (Add minimum remaining TTL check on pool acquire to avoid dispensing near-expiry sandboxes #983)
Added/updated tests
Security impact considered (kill cleanup is best-effort and async; failures are logged, not propagated)
Backward compatibility considered (auto-derived default + opt-out path)

Commits in this PR (chronological)

d921dfa7 — initial Kotlin impl (opt-in, default ZERO)
8a6e9539 — align with Add minimum remaining TTL check on pool acquire to avoid dispensing near-expiry sandboxes #983 spec, mirror to Python SDK, default 60s, cross-validation
3f7dddf9 — expose acquire_min_remaining_ttl on Python pool facades
4e4f12c8 — kill discarded-alive sandboxes, auto-scale default to min(60s, idleTimeout/2)
33ee6f3d — address @Pangjiping review (data class, dead-if removal, success-only kill log, async parallel kills)
a170c1af — defer discarded-alive kill until after candidate is connected/renewed; renew warm sandbox before putIdle
8d97a64f — fix CI lint failures (ktlint KDoc parse, ruff I001/UP037/B905)

…piry idles Closes opensandbox-group#983. When using the distributed sandbox warm pool, `pool.acquire()` could return an idle sandbox with only milliseconds of remaining TTL. The subsequent `checkReady` polling (up to 30s) would see the sandbox expire server-side, fail with READY_TIMEOUT, and never get a chance to call `renew`. Root cause: `tryTakeIdle` used a binary `expiresAt > now` check — a sandbox with 1ms remaining TTL still passed. Changes (Kotlin SDK only, additive and backward-compatible): * PoolStateStore: new overload `tryTakeIdle(poolName, minRemainingTtl)` with default impl that delegates to the existing single-arg method when the threshold is zero/negative, so custom user stores keep compiling. * InMemoryPoolStateStore: overrides the new method to discard entries whose remaining TTL is below the threshold (entries are removed from idle membership so reconcile can replenish). * RedisPoolStateStore: TAKE_IDLE_SCRIPT now reads the threshold from ARGV[1] and uses `expires_at > now_ms + min_remaining_ttl_ms`. The zero-arg path passes "0" so behavior is unchanged for callers. * PoolConfig: new `acquireMinRemainingTtl: Duration` (default `Duration.ZERO` — pre-existing binary-expiry behavior). Validated as non-negative. Wired into Builder + `withMaxIdle`. * SandboxPool.acquire: passes `config.acquireMinRemainingTtl` to the store. `releaseAllIdle` keeps the unfiltered call (drain-all semantics intentionally bypasses the TTL gate). Tests added: * InMemoryPoolStateStoreTest: zero/negative threshold falls back; entries below threshold are skipped and removed from idle; entries above pass. * RedisPoolStateStoreTest (gated on OPENSANDBOX_TEST_REDIS_URL): same three behaviors against real Redis. * PoolConfigTest: default value is `Duration.ZERO`; configured value round-trips; negative value is rejected. Python and other-language SDKs share the same pattern and will be addressed in follow-up PRs to keep this change reviewable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d921dfa779

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-05T16:42:25Z

            }
            val poolName = config.poolName
-            val sandboxId = stateStore.tryTakeIdle(poolName)
+            val sandboxId = stateStore.tryTakeIdle(poolName, config.acquireMinRemainingTtl)


Expose acquireMinRemainingTtl on SandboxPool.Builder

When callers construct pools via SandboxPool.builder() (the public facade used in the README examples), there is no forwarding acquireMinRemainingTtl(...) method on that builder, so this new non-zero threshold can only be set by manually building a PoolConfig and passing it via config(...). Since acquire() now depends on config.acquireMinRemainingTtl here, users of the primary builder cannot enable the near-expiry filtering this change adds and will keep the old behavior.

Useful? React with 👍 / 👎.

Good catch — addressed in 8a6e953. SandboxPool.Builder now has a forwarding acquireMinRemainingTtl(...) setter (next to acquireSkipHealthCheck). Verified in SandboxPoolTest.sandbox pool builder forwards acquire readiness settings into config.

chatgpt-codex-connector · 2026-06-05T16:42:25Z

            }
            val poolName = config.poolName
-            val sandboxId = stateStore.tryTakeIdle(poolName)
+            val sandboxId = stateStore.tryTakeIdle(poolName, config.acquireMinRemainingTtl)


Reset sandbox TTL before recording idle expiry

In pools where sandbox creation/readiness or warmupSandboxPreparer consumes a meaningful part of idleTimeout, this check is based on the store expiry stamped by putIdle, but the server timeout was already started when Sandbox.builder().timeout(config.idleTimeout) created the sandbox before those warmup steps. That makes acquireMinRemainingTtl overestimate real server-side TTL by the warmup duration, so an idle sandbox can still pass this gate with less than the configured minimum remaining and expire during the subsequent ready check unless the sandbox is renewed immediately before it is recorded idle.

Useful? React with 👍 / 👎.

Real issue, thanks for flagging. The store stamps expiresAt = putIdle_time + idleTimeout, but the server-side TTL clock starts at Sandbox.builder().timeout(idleTimeout) — so any time spent in warmup / warmupSandboxPreparer makes the store overestimate the real server-side remaining TTL.

Fixing this correctly means reading server-truth expiry (e.g. via sandbox.getInfo()) instead of recomputing locally, which is a separate, broader change with cross-language impact. To keep this PR focused on the threshold logic from #983, I would prefer to file a follow-up issue and address it there. Happy to open that immediately if you agree.

The current 60s default still tolerates warmups well below a minute — which is the common case for the code-interpreter template — so it does not regress the user-reported scenario. For long warmups, opting out via Duration.ZERO or sizing acquireMinRemainingTtl against typical warmup duration is the documented escape hatch until the follow-up lands.

chatgpt-codex-connector · 2026-06-05T16:42:25Z

        require(!acquireHealthCheckPollingInterval.isNegative && !acquireHealthCheckPollingInterval.isZero) {
            "acquireHealthCheckPollingInterval must be positive"
        }
+        require(!acquireMinRemainingTtl.isNegative) { "acquireMinRemainingTtl must be non-negative" }


Reject thresholds that can never admit idle sandboxes

If acquireMinRemainingTtl is configured greater than or equal to idleTimeout, every warmed idle entry will fail the store's strict expiresAt > now + minRemainingTtl check, so each acquire drains the pool's idle IDs, leaves the live sandboxes to expire server-side, and forces direct creation while reconcile keeps replenishing. Please validate this value against idleTimeout (for example requiring it to be less) so a single config mistake cannot make the pool permanently discard all warm sandboxes.

Useful? React with 👍 / 👎.

Addressed in 8a6e953. PoolConfig.build() now rejects acquireMinRemainingTtl >= idleTimeout with a descriptive message (and the same guard exists in PoolConfig.__post_init__ on the Python side). New PoolConfigTest.build rejects acquireMinRemainingTtl greater than or equal to idleTimeout and a strict-less-than boundary test cover both sides.

…thon SDK Addresses bot review feedback on PR opensandbox-group#986 and the remaining acceptance criteria from opensandbox-group#983 that the first revision deferred. ### Changes vs the previous revision **Default flipped from `Duration.ZERO` to 60s** so the binary-expiry race described in the issue is fixed out of the box, matching the issue's first acceptance criterion. Users who deliberately want the legacy "any-non-expired-entry" behavior can opt out by setting `Duration.ZERO` (documented on the field). **Cross-validated against `idleTimeout`** (`acquireMinRemainingTtl < idleTimeout`). Without this guard a misconfigured threshold ≥ idleTimeout would reject every freshly warmed entry and starve the pool. Caught by the codex bot review — now refused at build() time with a descriptive message. **Exposed on the public `SandboxPool.Builder`** (forwards to the underlying `PoolConfig.Builder`). Without this, users of the primary builder could not override the new threshold without dropping to the lower-level config builder. Also caught by the codex bot review. **Reconcile-time eviction of near-expiry entries.** The issue's optional proposal: `reapExpiredIdle` now also reclaims entries whose remaining TTL is below the threshold so reconcile triggers replenishment proactively instead of waiting for full expiry. Implemented as a new `reapExpiredIdle(poolName, now, minRemainingTtl)` overload on `PoolStateStore` (default impl falls back to the strict-expiry sweep) so custom user stores keep compiling. Both the in-memory and Redis stores override it; the Redis Lua script accepts the threshold via `ARGV[1]` with a sensible default of 0. **Python SDK (sync + async, in-memory + Redis).** The issue labels listed both `sdk/python` and `sdk/java`; this revision implements the same change across all four Python store classes (`InMemoryPoolStateStore`, `InMemoryAsyncPoolStateStore`, `RedisPoolStateStore`, `AsyncRedisPoolStateStore`), both pool implementations, both reconcilers, and both `PoolConfig` / `AsyncPoolConfig`. User-defined Python stores that pre-date this change keep working: callers go through `try_take_idle_with_min_ttl` / `reap_expired_idle_with_min_ttl` helpers in `pool_types` that detect the new methods via `hasattr` and fall back to the binary-expiry path otherwise. ### Tests (all green locally) * Kotlin `:sandbox:test` and `:sandbox-pool-redis:test` — 13/13 in the Redis suite verified against a real Redis 7 instance via `OPENSANDBOX_TEST_REDIS_URL`. * Python `pytest` — 192 passed (37 added/touched in pool tests, 0 skipped). * New unit tests cover, for both stores in both languages: zero/negative threshold falls through; entries below threshold are skipped and removed from idle; entries above threshold are returned; new `reap_expired_idle_min_ttl` evicts near-expiry while preserving entries above the threshold. * New PoolConfig tests cover: default is 60s; negative rejected; `>= idleTimeout` rejected; just-below-`idleTimeout` accepted; explicit `Duration.ZERO` accepted (opt-out path). ### Deferred (separate follow-up) The codex bot also flagged that `idleTimeout` starts the server-side TTL clock at sandbox creation, while the store's `expiresAt` is stamped after `warmupSandboxPreparer` runs — so the store TTL can over-estimate the real server-side remaining TTL by the warmup duration. Fixing that correctly requires reading server-truth expiry (e.g. via `sandbox.getInfo()`) rather than recomputing locally; treating that as a separate PR keeps the scope of this one tight on the threshold logic. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a6e9539f6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-05T17:33:41Z

+            sandbox_id = _try_take_idle_with_min_ttl(
+                self._state_store,
+                pool_name,
+                self._config.acquire_min_remaining_ttl,


Expose acquire_min_remaining_ttl on Python pool facades

When callers use the public SandboxPoolSync/SandboxPoolAsync constructors, there is still no constructor argument forwarded into PoolConfig/AsyncPoolConfig, but acquire now always reads self._config.acquire_min_remaining_ttl here. That means facade users cannot opt out or tune the new threshold; in particular, an existing pool configured with idle_timeout <= 60s now fails config validation because the hidden default threshold is 60s and there is no way to pass a smaller value through these constructors.

Useful? React with 👍 / 👎.

Good catch — addressed in 3f7dddf. Both SandboxPoolSync.__init__ and SandboxPoolAsync.__init__ now accept acquire_min_remaining_ttl (default 60s, matching PoolConfig) and forward it through to the underlying config. Two new facade tests in test_pool_config.py exercise the idle_timeout=30s + acquire_min_remaining_ttl=10s combination end-to-end so the previously broken construction path is now covered.

Codex bot caught that ``SandboxPoolSync`` and ``SandboxPoolAsync`` constructors had explicit kwargs forwarding every other ``PoolConfig`` field, but not the newly added ``acquire_min_remaining_ttl``. With the default 60s baked into ``PoolConfig`` validation, any pool configured with ``idle_timeout <= 60s`` would now fail at construction time and facade users had no way to override. Adds the kwarg (default 60s, matching ``PoolConfig``) to both constructors and forwards it through to the underlying config. Adds two facade tests: building a pool with ``idle_timeout=30s`` and ``acquire_min_remaining_ttl=10s`` succeeds and the values land in the config as written. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3f7dddf9e6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-05T17:57:16Z

        private const val DEFAULT_DEGRADED_THRESHOLD = 3
        private val DEFAULT_ACQUIRE_READY_TIMEOUT = Duration.ofSeconds(30)
        private val DEFAULT_ACQUIRE_HEALTH_CHECK_POLLING_INTERVAL = Duration.ofMillis(200)
+        private val DEFAULT_ACQUIRE_MIN_REMAINING_TTL: Duration = Duration.ofSeconds(60)


Avoid rejecting existing short idle timeouts

When callers keep the new default but already configure idleTimeout to 60 seconds or less, this added 60s default immediately trips the validation below that requires acquireMinRemainingTtl < idleTimeout, so previously valid pools fail during config construction unless every caller learns to pass Duration.ZERO or a smaller value. This is a public SDK behavior break (the Python config adds the same 60s default), so the default should be derived from idleTimeout or kept opt-in to preserve existing short-idle configurations.

Useful? React with 👍 / 👎.

You're right — addressed in 4e4f12c. The default is now derived from idleTimeout instead of being a fixed 60s:

Builder field is nullable (Duration? in Kotlin / timedelta | None in Python).

null resolves at build() time to min(60s, idleTimeout / 2) — always strictly less than idleTimeout, so existing pools with short idle timeouts keep working without code changes.

Pass Duration.ZERO to opt out, or any explicit positive value to override.

The strict < idleTimeout validation still fires for explicit values, so misconfigurations are still caught.

New tests default scales for short idle_timeout / caps at 60s for long idle_timeout cover both branches in Kotlin and Python.

chatgpt-codex-connector · 2026-06-05T17:57:16Z

        val now = Instant.now()

-        stateStore.reapExpiredIdle(poolName, now)
+        stateStore.reapExpiredIdle(poolName, now, config.acquireMinRemainingTtl)


Kill live sandboxes reaped for min TTL

When acquireMinRemainingTtl is positive, this sweep now removes not-yet-expired idle entries from the store, but unlike the shrink path it does not return those IDs to onDiscardSandbox, so the live sandboxes are left running until their server TTL expires while reconcile immediately creates replacements. With larger configured thresholds this can keep up to a full idle buffer of discarded-but-still-running sandboxes around for a long time and temporarily exceed pool resource/quota expectations.

Useful? React with 👍 / 👎.

Same root cause as the acquire-path leak — addressed in 4e4f12c. reapExpiredIdle(name, now, minTtl) now returns the alive evicted IDs (excluding fully-expired ones, since the server has reaped them). PoolReconciler routes the returned list through the existing onDiscardSandbox callback, which is the same path shrinkExcessIdle uses to fire killSandbox.

The Redis Lua script returns the alive list explicitly so the round-trip survives. New tests cover reapExpiredIdle with minRemainingTtl returns alive evicted entries and excludes already-expired from the alive list, both against in-memory and real-Redis stores.

chatgpt-codex-connector · 2026-06-05T17:57:16Z

            }
            val poolName = config.poolName
-            val sandboxId = stateStore.tryTakeIdle(poolName)
+            val sandboxId = stateStore.tryTakeIdle(poolName, config.acquireMinRemainingTtl)


Clean up idle sandboxes discarded during acquire

When an acquire runs with a positive acquireMinRemainingTtl, the new store variant consumes every idle ID that is below the threshold before returning an eligible ID or null, but this acquire path only receives the returned ID and has no chance to kill the skipped live sandboxes. In pools that are acquired after entries age past the threshold, those skipped sandboxes are dropped from pool accounting and replaced by direct creates/reconcile while still running until their server-side TTL expires, temporarily exceeding the intended pool size and resource usage.

Useful? React with 👍 / 👎.

Real bug, addressed in 4e4f12c. The store API now surfaces the discarded-alive IDs to the caller:

New TakeIdleResult carrying sandboxId + discardedAliveSandboxIds. tryTakeIdle(name, minTtl) returns it.

SandboxPool.acquire calls a new killDiscardedAlive helper on the returned list (best-effort, errors logged and swallowed).

Already-expired entries are intentionally excluded from the list — the server has already reaped them, a kill round-trip would be wasted.

Verified by new unit tests in both Kotlin (tryTakeIdle surfaces alive entries below the threshold so callers can kill them / silently drops fully-expired entries) and Python equivalents, against both in-memory and real-Redis stores.

…ingTtl default Addresses three bot reviews on PR opensandbox-group#986 head (commit 3f7dddf). ### opensandbox-group#1 + opensandbox-group#2 — Resource leak: skipped near-expiry sandboxes are now killed Previously, when ``acquireMinRemainingTtl`` was positive, both ``tryTakeIdle`` and ``reapExpiredIdle`` silently dropped near-expiry idle entries from store membership but never told the pool, so the live sandboxes kept running on the server (consuming quota / cost) until their own TTL elapsed — temporarily exceeding the intended pool size. Redesigned the store API to surface those IDs to callers: * New ``TakeIdleResult`` value type carrying ``sandbox_id`` plus a ``discarded_alive_sandbox_ids`` list. ``tryTakeIdle(name, minTtl)`` now returns it. Already-expired entries (server has reaped them) are intentionally excluded — a kill round-trip would be wasted. * ``reapExpiredIdle(name, now, minTtl)`` now returns the alive evicted IDs (same exclusion rule). * Both Lua scripts now return arrays so the discarded-alive list survives the round-trip; the empty-pool fast-path still returns ``nil`` so clients can distinguish "nothing to do" cleanly. Wired into the pool: * ``SandboxPool.acquire`` (Kotlin) and ``SandboxPoolSync/Async.acquire`` (Python) call ``_kill_discarded_alive`` on the returned list. Failures are logged and swallowed — the primary acquire outcome must not depend on a janitor failure. * ``PoolReconciler.reapExpiredIdle`` routes the returned list through the existing ``onDiscardSandbox`` callback, which already triggers a best-effort kill (same path used by ``shrinkExcessIdle``). ### opensandbox-group#3 — Default no longer breaks existing users with short ``idleTimeout`` A 60s default would fail validation (``acquireMinRemainingTtl < idleTimeout``) for any pool configured with ``idleTimeout <= 60s``, silently breaking user code on upgrade. Builder field is now nullable: ``null`` resolves at ``build()`` time to ``min(60s, idleTimeout / 2)``. This is always strictly less than ``idleTimeout``, so no existing pool construction breaks. Pass ``Duration.ZERO`` to opt out, or any explicit positive value to override. The strict ``< idleTimeout`` validation is retained for explicit user values so misconfigurations (e.g. ``acquireMinRemainingTtl == idleTimeout``) are still caught. ### Tests * Kotlin ``:sandbox:test`` and ``:sandbox-pool-redis:test`` (with real Redis 7 via ``OPENSANDBOX_TEST_REDIS_URL``) — 14/14 in the Redis suite. * Python ``pytest`` — 197 passed. * New cases verify, in both languages and both stores: alive entries below threshold are surfaced; fully-expired entries are silently dropped; ``reapExpiredIdle`` excludes already-expired from the alive list; the default scales to ``min(60s, idleTimeout/2)``; explicit ``ZERO`` opts out; explicit values still get validated against ``idleTimeout``. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Pangjiping

Nice fix for a real race condition. A few findings — one bug, one perf concern, two nits.

@Pangjiping

Four review items from @Pangjiping on 4e4f12c. ### TakeIdleResult is now a data class Auto-generates ``equals``/``hashCode``/``toString``, so test assertions like ``assertEquals(expected, store.tryTakeIdle(...))`` work and debugger output is useful. The ``EMPTY`` singleton optimization is preserved (data classes permit secondary references just fine). ### PoolStateStore defaults: dead-if removed Both ``tryTakeIdle(name, minTtl)`` and ``reapExpiredIdle(name, now, minTtl)`` default impls had identical ``if (zero-or-negative) { ... } else { ... }`` branches. The branches collapsed to a single unconditional delegate. KDoc updated to explain why custom stores that predate the methods always see an empty discarded-alive list. ### Python kill log fires only on success ``_kill_sandbox_best_effort`` now returns ``bool``: - ``True`` on a confirmed kill - ``False`` if no manager is available, or if the kill raised (swallowed) ``_kill_discarded_alive`` only emits the ``"Killed near-expiry idle sandbox"`` debug line when the return value is ``True``. Mirrors the Kotlin pattern where the equivalent log is inside the ``try`` block after ``killSandbox``. The existing ``Callable[[str], None]`` reconciler callback continues to accept the method (PEP 484 ``None`` return ≈ "no return constraint"). ### Async pool: concurrent kills via asyncio.gather ``SandboxPoolAsync._kill_discarded_alive`` used to ``await`` each kill sequentially, so a batch of N near-expiry IDs blocked ``acquire()`` on N serial RPCs. Now uses ``asyncio.gather`` so the calls overlap, then maps the success bool back to the per-ID debug log. ### Tests added (test_pool_kill_discarded.py) - ``_kill_sandbox_best_effort`` returns True/False correctly across success / failure / missing-manager - ``_kill_discarded_alive`` debug log fires only for successful kills, verified with ``caplog`` - async kill batch finishes well under serial-time (5 × 50ms sleeps complete in <150ms) and ``max_in_flight >= 2`` ### Test results Kotlin ``:sandbox:test`` + ``:sandbox-pool-redis:test`` (real Redis 7 via ``OPENSANDBOX_TEST_REDIS_URL``): all green. Python ``pytest``: 203 passed (197 + 6 new). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 33ee6f3dcb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-06T05:49:25Z

            }
            val poolName = config.poolName
-            val sandboxId = stateStore.tryTakeIdle(poolName)
+            val takeResult = stateStore.tryTakeIdle(poolName, config.acquireMinRemainingTtl)


Renew warm sandboxes before applying min-TTL gate

This min-TTL check uses the store expiry, but putIdle stamps that as put time + idleTimeout while the server TTL starts earlier in buildSandboxFromSpec() before readiness and warmupSandboxPreparer. Fresh evidence after the prior comment is that the current createOneSandbox() still returns the id after the preparer without renewing before the reconciler enqueues it, so when warmup takes time (for example 45s with idleTimeout=120s and acquireMinRemainingTtl=60s) the store can report >60s remaining while the server has far less, letting this gate still hand out a sandbox that expires during acquire/connect. Renew to idleTimeout or stamp the store expiry from server truth before putIdle.

Useful? React with 👍 / 👎.

Good catch. Fixed in a170c1a — createOneSandbox / _create_one_sandbox now calls sandbox.renew(config.idleTimeout) after the warmup preparer runs and before returning the id to the reconciler. The store stamp at putIdle is now consistent with the server-side TTL, so the min-TTL gate cannot overestimate remaining TTL by the warmup duration. Kotlin and both Python (sync + async) paths updated.

New tests (test_sync_create_one_sandbox_renews_before_returning_id / async) use a RenewTrackingSandbox fake that records every renew call and assert renewed == [idle_timeout] after warmup completes.

chatgpt-codex-connector · 2026-06-06T05:49:25Z

-            val sandboxId = stateStore.tryTakeIdle(poolName)
+            val takeResult = stateStore.tryTakeIdle(poolName, config.acquireMinRemainingTtl)
+            val sandboxId = takeResult.sandboxId
+            killDiscardedAlive(poolName, takeResult.discardedAliveSandboxIds, source = "acquire")


Connect to the eligible idle before cleanup

When many FIFO entries are below acquireMinRemainingTtl but a later entry is eligible, this call kills every discarded ID synchronously before connecting to the eligible sandbox. In distributed pools with slow kill RPCs or a large idle buffer, that cleanup can consume the candidate's remaining TTL before connect()/renew() runs, so a sandbox that passed the min-TTL gate can still expire during acquisition. Defer this best-effort cleanup until after the selected sandbox is connected/renewed, or run it in the background.

Useful? React with 👍 / 👎.

Real bug. Fixed in a170c1a:

SandboxPool.acquire now stashes discardedAliveSandboxIds in pendingKill and only schedules the cleanup after the candidate is connected and renewed.

The cleanup is offloaded to warmupExecutor.submit(...) so the caller does not wait for N kill RPCs (Python sync uses the same ThreadPoolExecutor; Python async uses asyncio.create_task with the task tracked on _warmup_tasks so shutdown can wait for it).

Fall-through paths (FAIL_FAST and DIRECT_CREATE) still schedule the cleanup so discarded-alive sandboxes never linger silently.

New tests in test_pool_kill_discarded.py (test_sync_schedule_kill_discarded_alive_does_not_block_caller / async equivalent) submit a 100ms-per-call kill manager and assert scheduling latency stays under 50ms / 10ms while the kills still complete on the executor / event loop afterwards.

Pangjiping

LGTM. Well-structured fix for a real production race condition. Good backward-compat story and thorough test coverage across both Kotlin and Python SDKs.

Minor suggestions (non-blocking):

PR description says default is Duration.ZERO but code auto-derives min(60s, idleTimeout/2) — worth updating the description.
Kotlin killDiscardedAlive runs serial kill calls inside acquire() — consider background/async execution like the Python asyncio.gather path.
Consider a changelog note about the default behavioral change (existing users now get 60s threshold instead of binary expiry).

None of these block merge.

…utIdle Two bot reviews on 33ee6f3. ### Fix A — connect to the eligible idle before cleanup Previously ``acquire`` killed every below-threshold-but-still-alive sandbox synchronously *before* connecting to the chosen candidate. With slow kill RPCs or a large discarded-alive list, that cleanup could consume the candidate's remaining TTL — so a sandbox that just passed ``acquireMinRemainingTtl`` could still expire during ``connect()`` / ``renew()``. The original race this PR was meant to fix. Reordered ``SandboxPool.acquire`` (Kotlin and Python sync+async) to: 1. Take the candidate from the store. 2. Stash ``discardedAliveSandboxIds`` in ``pendingKill``. 3. Connect, run any user health check, ``renew`` for ``sandboxTimeout``. 4. **Then** schedule the kill cleanup, fire-and-forget: - Kotlin: ``warmupExecutor.submit`` (falls back to inline if executor is missing — better to slow the caller than drop the cleanup). - Python sync: ``ThreadPoolExecutor.submit`` on the warmup executor (same fallback). - Python async: ``asyncio.create_task`` registered on the existing ``_warmup_tasks`` set so shutdown can wait on it. The same scheduling fires on the FAIL_FAST / DIRECT_CREATE fall-through paths so discarded-alive sandboxes never linger silently. ### Fix B — renew warm sandbox before recording idle expiry ``putIdle`` stamps store-side expiry as ``putIdle_time + idleTimeout``, but the server-side TTL has been ticking since ``buildSandboxFromSpec()`` /``_build_sandbox_from_spec()`` — through the readiness wait *and* any ``warmupSandboxPreparer`` work. With a long preparer (e.g. 45s with ``idleTimeout=120s`` and ``acquireMinRemainingTtl=60s``), the store can report >60s remaining while the server has far less, letting the min-TTL gate hand out a sandbox that still expires during acquire/connect. Added one ``sandbox.renew(config.idleTimeout)`` call right before returning the id from ``createOneSandbox`` / ``_create_one_sandbox``, so the server-side TTL is reset to match the store's stamp. Kotlin and Python sync+async all updated. ### Tests added (test_pool_kill_discarded.py) * ``_schedule_kill_discarded_alive`` returns immediately for both sync (executor-backed) and async (``create_task``); verified by submitting a 100ms-delay manager and asserting <50ms / <10ms scheduling latency. The kills still complete on the executor / event loop afterwards. * ``_create_one_sandbox`` calls ``renew(idle_timeout)`` exactly once before returning the id — covered for both sync and async pools using a ``RenewTracking*Sandbox`` fake that records ``renew`` calls. ### Test results Kotlin ``:sandbox:test`` + ``:sandbox-pool-redis:test`` (real Redis 7 via ``OPENSANDBOX_TEST_REDIS_URL``): all green. Python ``pytest``: 207 passed (203 + 4 new). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

CI's SDK Tests failed two quality jobs on the previous head: - ``Kotlin SDK Quality And Tests (sandbox)`` — ``spotlessCheck`` (ktlint parser) errored on a KDoc that contained an unbalanced literal ``{...}`` followed by ``"..."``. Rewrote the comment as plain prose so the parser stops trying to resolve braces inside the docstring. Also picked up the Spotless auto-format pass on a few unrelated lines in ``InMemoryPoolStateStore`` / ``TakeIdleResult`` (whitespace + line break normalization). - ``Python SDK Quality (sandbox)`` — ``ruff check`` flagged two categories from this PR: * ``I001`` import ordering in the new helpers in ``pool_types`` / sync+async reconcilers / sync+async pool. Fixed via ``ruff --fix`` (split into two import groups, alphabetized). * ``UP037`` quoted self-referential type annotations in the new ``_RenewTracking{Sync,Async}Sandbox`` test fakes. Quotes removed. * ``B905`` ``zip()`` without ``strict=`` in the new async ``_kill_discarded_alive`` parallel kill loop. Now ``strict=True``. No source-level behavior change. Existing Python suite still 207/207; Kotlin ``:sandbox:test`` + ``:sandbox-pool-redis:test`` (real Redis 7 via ``OPENSANDBOX_TEST_REDIS_URL``) green; ``spotlessCheck`` and ``ruff check`` both clean locally. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

asiudgufgbukbsa · 2026-06-06T08:46:40Z

Thanks for the thorough review @Pangjiping! 🙏

Quick status on the three non-blocking suggestions:

1. PR description outdated — fixed (updated the body just now). Reflects the auto-derived default min(60s, idleTimeout/2), the discarded-alive kill design, the warmup-renew fix, and lists all 7 commits with links.

2. Kotlin killDiscardedAlive runs serial kill calls inside acquire() — this was already addressed in a170c1af (pushed after the LGTM, hence likely missed). SandboxPool.acquire now calls scheduleKillDiscardedAlive(...) after the candidate is connected and renewed, and the cleanup is offloaded to warmupExecutor.submit(...) so the caller does not wait on N kill RPCs (mirrors the Python asyncio.gather / create_task path). Falls back to inline only when no executor is available, e.g. mid-shutdown — better to slow the caller than drop the cleanup. Verified by test_sync_schedule_kill_discarded_alive_does_not_block_caller on the Python side; the Kotlin side is structurally equivalent.

3. Changelog note on default behavioral change — the project doesn't carry per-SDK CHANGELOG files yet (only server/, components/*), so I added a clearly-marked Behavioral change worth noting section to the PR description with a before/after table and the Duration.ZERO opt-out path. If maintainers want it captured elsewhere (e.g. release notes via the PR title prefix or a dedicated doc), happy to add it where it fits the project's release tooling.

Recap of all changes since the dismissed LGTM review

a170c1af — Fix A (defer kill until after candidate connect/renew) + Fix B (renew warm sandbox to idleTimeout before putIdle so store stamp matches server-side TTL); 4 new tests.
8d97a64f — fix CI lint failures: ktlint KDoc parse error in RedisPoolStateStore (rewrote a docstring containing literal {...} that confused the parser), ruff I001 import order in pool helpers, UP037 quoted self-references in test fakes, B905 zip(... strict=True) in async kill loop.

Local verification on 8d97a64f:

Kotlin :sandbox:test + :sandbox-pool-redis:test (real Redis 7 via OPENSANDBOX_TEST_REDIS_URL) + spotlessCheck — all green.
Python pytest — 207 passed; ruff check — clean.

CI jobs on the new head are currently action_required (first-time-contributor gate) — would appreciate a workflow approval when you have a moment so the rerun can show green.

CI's Python SDK Quality (sandbox) job failed at the pyright step with 10 errors after I ignored that lint in earlier rounds. All on this PR's new code; tests and ruff are unaffected. ### Three categories **(1) ``timedelta | None`` not assignable to ``timedelta`` (4 errors)** The auto-default change in 4e4f12c made ``PoolConfig.acquire_min_remaining_ttl`` declared as ``timedelta | None`` (``None`` means "auto-derive from idle_timeout" at ``__post_init__``). Even though ``__post_init__`` always resolves it to a concrete ``timedelta`` via ``object.__setattr__``, pyright can't see through that, so passing it to the helper functions tripped. Fix: widen the helper signatures to accept ``timedelta | None`` and treat ``None`` as zero. This matches what ``__post_init__`` would do anyway and removes the need for ``cast`` at every call site. **(2) ``_kill_sandbox_best_effort`` returns ``bool`` but reconciler wants ``Callable[[str], None]`` (2 errors)** In 33ee6f3 ``_kill_sandbox_best_effort`` was changed to return ``bool`` so ``_kill_discarded_alive`` could fire the success debug log only on a confirmed kill. But the reconciler's ``on_discard_sandbox`` parameter is typed ``Callable[[str], None]``, and ``bool`` is not a subtype of ``None``. Fix: introduce a thin ``_discard_sandbox_callback`` adapter on both pool facades that calls ``_kill_sandbox_best_effort`` and drops the return. The reconciler now receives the adapter; the bool is still available to ``_kill_discarded_alive`` for the success-only log. **(3) ``getattr(store, ...)`` returns ``object`` (4 errors)** The Python ``hasattr``-based store-compat shim in ``pool_types.py`` uses ``getattr(store, "try_take_idle_min_ttl", None)``, which pyright types as ``object``. Calling that or awaiting it tripped four errors in the helpers. Fix: ``cast`` each ``getattr`` result to the actual signature (``Awaitable[TakeIdleResult]`` / ``Iterable[str] | None`` etc.). Behavior unchanged at runtime — ``callable(method)`` still gates the call. ### Verification - ``uv run pyright`` — 0 errors - ``uv run ruff check`` — clean - ``uv run pytest`` — 207 passed - Kotlin ``:sandbox:test`` + ``:sandbox-pool-redis:test`` (real Redis 7) + ``spotlessCheck`` — all green Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

asiudgufgbukbsa · 2026-06-06T13:15:49Z

Pushed e5026d10 with the pyright fixes (CI on 8d97a64f flagged 10 typing errors in the new code — narrowed the helper signatures to accept timedelta | None, added a thin _discard_sandbox_callback adapter so the reconciler's Callable[[str], None] still type-checks while _kill_sandbox_best_effort keeps its bool return for the success-only log, and cast-ed the getattr(store, ...) results).

All local checks pass on e5026d10:

uv run pyright — 0 errors
uv run ruff check — clean
uv run pytest — 207 passed
Kotlin :sandbox:test + :sandbox-pool-redis:test (real Redis 7 via OPENSANDBOX_TEST_REDIS_URL) + spotlessCheck — all green

@Pangjiping — would appreciate workflow approval when you have a moment so the new run can show green 🙏

asiudgufgbukbsa requested review from Pangjiping, hittyt, jwx0925 and ninan-nn as code owners June 5, 2026 16:37

chatgpt-codex-connector Bot reviewed Jun 5, 2026

View reviewed changes

asiudgufgbukbsa mentioned this pull request Jun 5, 2026

Add minimum remaining TTL check on pool acquire to avoid dispensing near-expiry sandboxes #983

Open

6 tasks

chatgpt-codex-connector Bot reviewed Jun 5, 2026

View reviewed changes

Pangjiping reviewed Jun 6, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

Pangjiping previously approved these changes Jun 6, 2026

View reviewed changes

Pangjiping assigned ninan-nn Jun 6, 2026

Pangjiping added bug Something isn't working sdks labels Jun 6, 2026

asiudgufgbukbsa dismissed Pangjiping’s stale review via a170c1a June 6, 2026 07:41

Conversation

asiudgufgbukbsa commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Acquire-time near-expiry filtering

Discarded-alive cleanup (no resource leaks)

Warmup TTL drift fix

Languages

Behavioral change worth noting

Tests

Kotlin

Python

Testing

Breaking Changes

Checklist

Commits in this PR (chronological)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pangjiping left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

asiudgufgbukbsa commented Jun 5, 2026 •

edited

Loading