Skip to content

feat(sdks/kotlin): add acquireMinRemainingTtl to pool to skip near-ex…#986

Open
asiudgufgbukbsa wants to merge 8 commits into
opensandbox-group:mainfrom
asiudgufgbukbsa:feat/pool-acquire-min-remaining-ttl
Open

feat(sdks/kotlin): add acquireMinRemainingTtl to pool to skip near-ex…#986
asiudgufgbukbsa wants to merge 8 commits into
opensandbox-group:mainfrom
asiudgufgbukbsa:feat/pool-acquire-min-remaining-ttl

Conversation

@asiudgufgbukbsa
Copy link
Copy Markdown
Contributor

@asiudgufgbukbsa asiudgufgbukbsa commented Jun 5, 2026

Closes #983.

Summary

When using the distributed sandbox warm pool, pool.acquire() could return an idle sandbox with only milliseconds of remaining TTL. The subsequent checkReady polling (up to 30s) would see the sandbox expire server-side, fail with READY_TIMEOUT, and never get a chance to call renew.

Root cause: tryTakeIdle used a binary expiresAt > now check — a sandbox with 1ms remaining TTL still passed.

What changed

Acquire-time near-expiry filtering

  • New PoolConfig.acquireMinRemainingTtl: Duration. Idle entries whose remaining TTL is below this threshold are skipped on acquire and reaped on reconcile, so the chosen sandbox always has enough remaining TTL to survive checkReady + renew.
  • Default is auto-derived from idleTimeout: min(60s, idleTimeout / 2) (always strictly less than idleTimeout). Existing users with short idle timeouts are not silently broken; users with the default 24h idle timeout get a 60s threshold. Pass Duration.ZERO to opt out and restore the pre-existing binary-expiry behavior.
  • Cross-validated against idleTimeout for explicit values (build() rejects >= idleTimeout).

Discarded-alive cleanup (no resource leaks)

  • tryTakeIdle(name, minTtl) and reapExpiredIdle(name, now, minTtl) return discarded-alive IDs to the caller (TakeIdleResult in Kotlin, plain tuple in Python). Already-expired entries are silently dropped — the server has reaped them.
  • SandboxPool.acquire (Kotlin and Python sync+async) defers the kill cleanup until after the chosen candidate is connected and renewed, so slow kill RPCs cannot consume the candidate's remaining TTL. The cleanup runs on the warmup executor (Kotlin/Python sync) or via asyncio.create_task (Python async) so the caller does not block on N kill round-trips.
  • PoolReconciler routes the reaped near-expiry IDs through the existing onDiscardSandbox callback, which fires killSandbox best-effort — same path used by shrinkExcessIdle.

Warmup TTL drift fix

  • createOneSandbox / _create_one_sandbox now calls sandbox.renew(idleTimeout) after the warmup preparer runs and before returning the id. The store stamp at putIdle is now consistent with the server-side TTL, so the min-TTL gate cannot overestimate remaining TTL by the warmup duration.

Languages

  • Kotlin SDK (sandbox, sandbox-pool-redis)
  • Python SDK (sync + async, in-memory + Redis stores, both pool facades)

Behavioral change worth noting

This PR changes the default behavior of pool.acquire() for new opt-in users:

Before After
Returns any idle entry with expiresAt > now, even by 1ms Returns only entries with at least min(60s, idleTimeout/2) remaining
READY_TIMEOUT race possible Race fixed

Users who relied on the old behavior (e.g. tests with idleTimeout deliberately short) can pass acquireMinRemainingTtl(Duration.ZERO) to opt out. The auto-default is intentionally derived from idleTimeout so existing pools with idleTimeout <= 60s are not broken at construction time.

Tests

Kotlin

  • :sandbox:testInMemoryPoolStateStoreTest (alive vs expired discarded, threshold edge cases), PoolConfigTest (auto-default scaling, validation), SandboxPoolTest (builder forwarding).
  • :sandbox-pool-redis:test (gated on OPENSANDBOX_TEST_REDIS_URL) — same three behaviors against real Redis 7.
  • spotlessCheck clean.

Python

  • pytest — 207 passed (192 prior + 15 added across test_pool_store.py, test_pool_config.py, test_redis_pool_store.py, test_async_redis_pool_store.py, test_pool_kill_discarded.py).
  • New test_pool_kill_discarded.py covers: scheduling latency for both sync (executor) and async (create_task) is non-blocking; kill log fires only on confirmed success; warmup pipeline calls renew(idle_timeout) before returning the id.
  • ruff check clean.

Testing

  • Unit tests
  • Integration tests (real Redis 7)
  • e2e / manual verification (existing pool e2e tests on CI cover the change implicitly)

Breaking Changes

  • Behavioral change to pool.acquire() default (see table above). Backward-compatible escape hatch: acquireMinRemainingTtl(Duration.ZERO). Validated to not break existing pool constructors with short idleTimeout.
  • No source-incompatible API change. New methods added to PoolStateStore ship with default impls so custom user stores keep compiling.

Checklist

Commits in this PR (chronological)

  1. d921dfa7 — initial Kotlin impl (opt-in, default ZERO)
  2. 8a6e9539 — align with Add minimum remaining TTL check on pool acquire to avoid dispensing near-expiry sandboxes #983 spec, mirror to Python SDK, default 60s, cross-validation
  3. 3f7dddf9 — expose acquire_min_remaining_ttl on Python pool facades
  4. 4e4f12c8 — kill discarded-alive sandboxes, auto-scale default to min(60s, idleTimeout/2)
  5. 33ee6f3d — address @Pangjiping review (data class, dead-if removal, success-only kill log, async parallel kills)
  6. a170c1af — defer discarded-alive kill until after candidate is connected/renewed; renew warm sandbox before putIdle
  7. 8d97a64f — fix CI lint failures (ktlint KDoc parse, ruff I001/UP037/B905)

…piry idles

Closes opensandbox-group#983.

When using the distributed sandbox warm pool, `pool.acquire()` could return
an idle sandbox with only milliseconds of remaining TTL. The subsequent
`checkReady` polling (up to 30s) would see the sandbox expire server-side,
fail with READY_TIMEOUT, and never get a chance to call `renew`.

Root cause: `tryTakeIdle` used a binary `expiresAt > now` check — a sandbox
with 1ms remaining TTL still passed.

Changes (Kotlin SDK only, additive and backward-compatible):

* PoolStateStore: new overload `tryTakeIdle(poolName, minRemainingTtl)`
  with default impl that delegates to the existing single-arg method when
  the threshold is zero/negative, so custom user stores keep compiling.
* InMemoryPoolStateStore: overrides the new method to discard entries
  whose remaining TTL is below the threshold (entries are removed from
  idle membership so reconcile can replenish).
* RedisPoolStateStore: TAKE_IDLE_SCRIPT now reads the threshold from
  ARGV[1] and uses `expires_at > now_ms + min_remaining_ttl_ms`. The
  zero-arg path passes "0" so behavior is unchanged for callers.
* PoolConfig: new `acquireMinRemainingTtl: Duration` (default
  `Duration.ZERO` — pre-existing binary-expiry behavior). Validated as
  non-negative. Wired into Builder + `withMaxIdle`.
* SandboxPool.acquire: passes `config.acquireMinRemainingTtl` to the
  store. `releaseAllIdle` keeps the unfiltered call (drain-all
  semantics intentionally bypasses the TTL gate).

Tests added:

* InMemoryPoolStateStoreTest: zero/negative threshold falls back; entries
  below threshold are skipped and removed from idle; entries above pass.
* RedisPoolStateStoreTest (gated on OPENSANDBOX_TEST_REDIS_URL): same
  three behaviors against real Redis.
* PoolConfigTest: default value is `Duration.ZERO`; configured value
  round-trips; negative value is rejected.

Python and other-language SDKs share the same pattern and will be
addressed in follow-up PRs to keep this change reviewable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d921dfa779

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}
val poolName = config.poolName
val sandboxId = stateStore.tryTakeIdle(poolName)
val sandboxId = stateStore.tryTakeIdle(poolName, config.acquireMinRemainingTtl)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Expose acquireMinRemainingTtl on SandboxPool.Builder

When callers construct pools via SandboxPool.builder() (the public facade used in the README examples), there is no forwarding acquireMinRemainingTtl(...) method on that builder, so this new non-zero threshold can only be set by manually building a PoolConfig and passing it via config(...). Since acquire() now depends on config.acquireMinRemainingTtl here, users of the primary builder cannot enable the near-expiry filtering this change adds and will keep the old behavior.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — addressed in 8a6e953. SandboxPool.Builder now has a forwarding acquireMinRemainingTtl(...) setter (next to acquireSkipHealthCheck). Verified in SandboxPoolTest.sandbox pool builder forwards acquire readiness settings into config.

}
val poolName = config.poolName
val sandboxId = stateStore.tryTakeIdle(poolName)
val sandboxId = stateStore.tryTakeIdle(poolName, config.acquireMinRemainingTtl)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reset sandbox TTL before recording idle expiry

In pools where sandbox creation/readiness or warmupSandboxPreparer consumes a meaningful part of idleTimeout, this check is based on the store expiry stamped by putIdle, but the server timeout was already started when Sandbox.builder().timeout(config.idleTimeout) created the sandbox before those warmup steps. That makes acquireMinRemainingTtl overestimate real server-side TTL by the warmup duration, so an idle sandbox can still pass this gate with less than the configured minimum remaining and expire during the subsequent ready check unless the sandbox is renewed immediately before it is recorded idle.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real issue, thanks for flagging. The store stamps expiresAt = putIdle_time + idleTimeout, but the server-side TTL clock starts at Sandbox.builder().timeout(idleTimeout) — so any time spent in warmup / warmupSandboxPreparer makes the store overestimate the real server-side remaining TTL.

Fixing this correctly means reading server-truth expiry (e.g. via sandbox.getInfo()) instead of recomputing locally, which is a separate, broader change with cross-language impact. To keep this PR focused on the threshold logic from #983, I would prefer to file a follow-up issue and address it there. Happy to open that immediately if you agree.

The current 60s default still tolerates warmups well below a minute — which is the common case for the code-interpreter template — so it does not regress the user-reported scenario. For long warmups, opting out via Duration.ZERO or sizing acquireMinRemainingTtl against typical warmup duration is the documented escape hatch until the follow-up lands.

require(!acquireHealthCheckPollingInterval.isNegative && !acquireHealthCheckPollingInterval.isZero) {
"acquireHealthCheckPollingInterval must be positive"
}
require(!acquireMinRemainingTtl.isNegative) { "acquireMinRemainingTtl must be non-negative" }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject thresholds that can never admit idle sandboxes

If acquireMinRemainingTtl is configured greater than or equal to idleTimeout, every warmed idle entry will fail the store's strict expiresAt > now + minRemainingTtl check, so each acquire drains the pool's idle IDs, leaves the live sandboxes to expire server-side, and forces direct creation while reconcile keeps replenishing. Please validate this value against idleTimeout (for example requiring it to be less) so a single config mistake cannot make the pool permanently discard all warm sandboxes.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 8a6e953. PoolConfig.build() now rejects acquireMinRemainingTtl >= idleTimeout with a descriptive message (and the same guard exists in PoolConfig.__post_init__ on the Python side). New PoolConfigTest.build rejects acquireMinRemainingTtl greater than or equal to idleTimeout and a strict-less-than boundary test cover both sides.

…thon SDK

Addresses bot review feedback on PR opensandbox-group#986 and the remaining acceptance criteria
from opensandbox-group#983 that the first revision deferred.

### Changes vs the previous revision

**Default flipped from `Duration.ZERO` to 60s** so the binary-expiry race
described in the issue is fixed out of the box, matching the issue's first
acceptance criterion. Users who deliberately want the legacy
"any-non-expired-entry" behavior can opt out by setting `Duration.ZERO`
(documented on the field).

**Cross-validated against `idleTimeout`** (`acquireMinRemainingTtl <
idleTimeout`). Without this guard a misconfigured threshold ≥ idleTimeout
would reject every freshly warmed entry and starve the pool. Caught by the
codex bot review — now refused at build() time with a descriptive message.

**Exposed on the public `SandboxPool.Builder`** (forwards to the underlying
`PoolConfig.Builder`). Without this, users of the primary builder could not
override the new threshold without dropping to the lower-level config
builder. Also caught by the codex bot review.

**Reconcile-time eviction of near-expiry entries.** The issue's optional
proposal: `reapExpiredIdle` now also reclaims entries whose remaining TTL
is below the threshold so reconcile triggers replenishment proactively
instead of waiting for full expiry. Implemented as a new
`reapExpiredIdle(poolName, now, minRemainingTtl)` overload on
`PoolStateStore` (default impl falls back to the strict-expiry sweep) so
custom user stores keep compiling. Both the in-memory and Redis stores
override it; the Redis Lua script accepts the threshold via `ARGV[1]`
with a sensible default of 0.

**Python SDK (sync + async, in-memory + Redis).** The issue labels listed
both `sdk/python` and `sdk/java`; this revision implements the same change
across all four Python store classes (`InMemoryPoolStateStore`,
`InMemoryAsyncPoolStateStore`, `RedisPoolStateStore`,
`AsyncRedisPoolStateStore`), both pool implementations, both reconcilers,
and both `PoolConfig` / `AsyncPoolConfig`.

User-defined Python stores that pre-date this change keep working: callers
go through `try_take_idle_with_min_ttl` / `reap_expired_idle_with_min_ttl`
helpers in `pool_types` that detect the new methods via `hasattr` and fall
back to the binary-expiry path otherwise.

### Tests (all green locally)

* Kotlin `:sandbox:test` and `:sandbox-pool-redis:test` — 13/13 in the
  Redis suite verified against a real Redis 7 instance via
  `OPENSANDBOX_TEST_REDIS_URL`.
* Python `pytest` — 192 passed (37 added/touched in pool tests, 0 skipped).
* New unit tests cover, for both stores in both languages: zero/negative
  threshold falls through; entries below threshold are skipped and
  removed from idle; entries above threshold are returned; new
  `reap_expired_idle_min_ttl` evicts near-expiry while preserving entries
  above the threshold.
* New PoolConfig tests cover: default is 60s; negative rejected;
  `>= idleTimeout` rejected; just-below-`idleTimeout` accepted; explicit
  `Duration.ZERO` accepted (opt-out path).

### Deferred (separate follow-up)

The codex bot also flagged that `idleTimeout` starts the server-side TTL
clock at sandbox creation, while the store's `expiresAt` is stamped after
`warmupSandboxPreparer` runs — so the store TTL can over-estimate the
real server-side remaining TTL by the warmup duration. Fixing that
correctly requires reading server-truth expiry (e.g. via
`sandbox.getInfo()`) rather than recomputing locally; treating that as a
separate PR keeps the scope of this one tight on the threshold logic.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a6e9539f6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +177 to +180
sandbox_id = _try_take_idle_with_min_ttl(
self._state_store,
pool_name,
self._config.acquire_min_remaining_ttl,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Expose acquire_min_remaining_ttl on Python pool facades

When callers use the public SandboxPoolSync/SandboxPoolAsync constructors, there is still no constructor argument forwarded into PoolConfig/AsyncPoolConfig, but acquire now always reads self._config.acquire_min_remaining_ttl here. That means facade users cannot opt out or tune the new threshold; in particular, an existing pool configured with idle_timeout <= 60s now fails config validation because the hidden default threshold is 60s and there is no way to pass a smaller value through these constructors.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — addressed in 3f7dddf. Both SandboxPoolSync.__init__ and SandboxPoolAsync.__init__ now accept acquire_min_remaining_ttl (default 60s, matching PoolConfig) and forward it through to the underlying config. Two new facade tests in test_pool_config.py exercise the idle_timeout=30s + acquire_min_remaining_ttl=10s combination end-to-end so the previously broken construction path is now covered.

Codex bot caught that ``SandboxPoolSync`` and ``SandboxPoolAsync``
constructors had explicit kwargs forwarding every other ``PoolConfig``
field, but not the newly added ``acquire_min_remaining_ttl``. With the
default 60s baked into ``PoolConfig`` validation, any pool configured
with ``idle_timeout <= 60s`` would now fail at construction time and
facade users had no way to override.

Adds the kwarg (default 60s, matching ``PoolConfig``) to both
constructors and forwards it through to the underlying config. Adds
two facade tests: building a pool with ``idle_timeout=30s`` and
``acquire_min_remaining_ttl=10s`` succeeds and the values land in the
config as written.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3f7dddf9e6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

private const val DEFAULT_DEGRADED_THRESHOLD = 3
private val DEFAULT_ACQUIRE_READY_TIMEOUT = Duration.ofSeconds(30)
private val DEFAULT_ACQUIRE_HEALTH_CHECK_POLLING_INTERVAL = Duration.ofMillis(200)
private val DEFAULT_ACQUIRE_MIN_REMAINING_TTL: Duration = Duration.ofSeconds(60)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid rejecting existing short idle timeouts

When callers keep the new default but already configure idleTimeout to 60 seconds or less, this added 60s default immediately trips the validation below that requires acquireMinRemainingTtl < idleTimeout, so previously valid pools fail during config construction unless every caller learns to pass Duration.ZERO or a smaller value. This is a public SDK behavior break (the Python config adds the same 60s default), so the default should be derived from idleTimeout or kept opt-in to preserve existing short-idle configurations.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — addressed in 4e4f12c. The default is now derived from idleTimeout instead of being a fixed 60s:

  • Builder field is nullable (Duration? in Kotlin / timedelta | None in Python).
  • null resolves at build() time to min(60s, idleTimeout / 2) — always strictly less than idleTimeout, so existing pools with short idle timeouts keep working without code changes.
  • Pass Duration.ZERO to opt out, or any explicit positive value to override.
  • The strict < idleTimeout validation still fires for explicit values, so misconfigurations are still caught.

New tests default scales for short idle_timeout / caps at 60s for long idle_timeout cover both branches in Kotlin and Python.

val now = Instant.now()

stateStore.reapExpiredIdle(poolName, now)
stateStore.reapExpiredIdle(poolName, now, config.acquireMinRemainingTtl)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Kill live sandboxes reaped for min TTL

When acquireMinRemainingTtl is positive, this sweep now removes not-yet-expired idle entries from the store, but unlike the shrink path it does not return those IDs to onDiscardSandbox, so the live sandboxes are left running until their server TTL expires while reconcile immediately creates replacements. With larger configured thresholds this can keep up to a full idle buffer of discarded-but-still-running sandboxes around for a long time and temporarily exceed pool resource/quota expectations.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same root cause as the acquire-path leak — addressed in 4e4f12c. reapExpiredIdle(name, now, minTtl) now returns the alive evicted IDs (excluding fully-expired ones, since the server has reaped them). PoolReconciler routes the returned list through the existing onDiscardSandbox callback, which is the same path shrinkExcessIdle uses to fire killSandbox.

The Redis Lua script returns the alive list explicitly so the round-trip survives. New tests cover reapExpiredIdle with minRemainingTtl returns alive evicted entries and excludes already-expired from the alive list, both against in-memory and real-Redis stores.

}
val poolName = config.poolName
val sandboxId = stateStore.tryTakeIdle(poolName)
val sandboxId = stateStore.tryTakeIdle(poolName, config.acquireMinRemainingTtl)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clean up idle sandboxes discarded during acquire

When an acquire runs with a positive acquireMinRemainingTtl, the new store variant consumes every idle ID that is below the threshold before returning an eligible ID or null, but this acquire path only receives the returned ID and has no chance to kill the skipped live sandboxes. In pools that are acquired after entries age past the threshold, those skipped sandboxes are dropped from pool accounting and replaced by direct creates/reconcile while still running until their server-side TTL expires, temporarily exceeding the intended pool size and resource usage.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real bug, addressed in 4e4f12c. The store API now surfaces the discarded-alive IDs to the caller:

  • New TakeIdleResult carrying sandboxId + discardedAliveSandboxIds. tryTakeIdle(name, minTtl) returns it.
  • SandboxPool.acquire calls a new killDiscardedAlive helper on the returned list (best-effort, errors logged and swallowed).
  • Already-expired entries are intentionally excluded from the list — the server has already reaped them, a kill round-trip would be wasted.

Verified by new unit tests in both Kotlin (tryTakeIdle surfaces alive entries below the threshold so callers can kill them / silently drops fully-expired entries) and Python equivalents, against both in-memory and real-Redis stores.

…ingTtl default

Addresses three bot reviews on PR opensandbox-group#986 head (commit 3f7dddf).

### opensandbox-group#1 + opensandbox-group#2 — Resource leak: skipped near-expiry sandboxes are now killed

Previously, when ``acquireMinRemainingTtl`` was positive, both
``tryTakeIdle`` and ``reapExpiredIdle`` silently dropped near-expiry idle
entries from store membership but never told the pool, so the live
sandboxes kept running on the server (consuming quota / cost) until their
own TTL elapsed — temporarily exceeding the intended pool size.

Redesigned the store API to surface those IDs to callers:

* New ``TakeIdleResult`` value type carrying ``sandbox_id`` plus a
  ``discarded_alive_sandbox_ids`` list. ``tryTakeIdle(name, minTtl)`` now
  returns it. Already-expired entries (server has reaped them) are
  intentionally excluded — a kill round-trip would be wasted.
* ``reapExpiredIdle(name, now, minTtl)`` now returns the alive evicted IDs
  (same exclusion rule).
* Both Lua scripts now return arrays so the discarded-alive list survives
  the round-trip; the empty-pool fast-path still returns ``nil`` so clients
  can distinguish "nothing to do" cleanly.

Wired into the pool:

* ``SandboxPool.acquire`` (Kotlin) and ``SandboxPoolSync/Async.acquire``
  (Python) call ``_kill_discarded_alive`` on the returned list. Failures
  are logged and swallowed — the primary acquire outcome must not depend
  on a janitor failure.
* ``PoolReconciler.reapExpiredIdle`` routes the returned list through the
  existing ``onDiscardSandbox`` callback, which already triggers a
  best-effort kill (same path used by ``shrinkExcessIdle``).

### opensandbox-group#3 — Default no longer breaks existing users with short ``idleTimeout``

A 60s default would fail validation (``acquireMinRemainingTtl <
idleTimeout``) for any pool configured with ``idleTimeout <= 60s``,
silently breaking user code on upgrade. Builder field is now nullable:
``null`` resolves at ``build()`` time to ``min(60s, idleTimeout / 2)``.
This is always strictly less than ``idleTimeout``, so no existing pool
construction breaks. Pass ``Duration.ZERO`` to opt out, or any explicit
positive value to override.

The strict ``< idleTimeout`` validation is retained for explicit user
values so misconfigurations (e.g. ``acquireMinRemainingTtl ==
idleTimeout``) are still caught.

### Tests

* Kotlin ``:sandbox:test`` and ``:sandbox-pool-redis:test`` (with real
  Redis 7 via ``OPENSANDBOX_TEST_REDIS_URL``) — 14/14 in the Redis suite.
* Python ``pytest`` — 197 passed.
* New cases verify, in both languages and both stores: alive entries below
  threshold are surfaced; fully-expired entries are silently dropped;
  ``reapExpiredIdle`` excludes already-expired from the alive list; the
  default scales to ``min(60s, idleTimeout/2)``; explicit ``ZERO`` opts
  out; explicit values still get validated against ``idleTimeout``.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@Pangjiping Pangjiping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix for a real race condition. A few findings — one bug, one perf concern, two nits.

Comment thread sdks/sandbox/python/src/opensandbox/sync/pool.py Outdated
Comment thread sdks/sandbox/python/src/opensandbox/pool_async.py Outdated
Four review items from @Pangjiping on 4e4f12c.

### TakeIdleResult is now a data class

Auto-generates ``equals``/``hashCode``/``toString``, so test assertions like
``assertEquals(expected, store.tryTakeIdle(...))`` work and debugger output is
useful. The ``EMPTY`` singleton optimization is preserved (data classes
permit secondary references just fine).

### PoolStateStore defaults: dead-if removed

Both ``tryTakeIdle(name, minTtl)`` and ``reapExpiredIdle(name, now, minTtl)``
default impls had identical ``if (zero-or-negative) { ... } else { ... }``
branches. The branches collapsed to a single unconditional delegate. KDoc
updated to explain why custom stores that predate the methods always see an
empty discarded-alive list.

### Python kill log fires only on success

``_kill_sandbox_best_effort`` now returns ``bool``:

- ``True`` on a confirmed kill
- ``False`` if no manager is available, or if the kill raised (swallowed)

``_kill_discarded_alive`` only emits the ``"Killed near-expiry idle sandbox"``
debug line when the return value is ``True``. Mirrors the Kotlin pattern
where the equivalent log is inside the ``try`` block after ``killSandbox``.

The existing ``Callable[[str], None]`` reconciler callback continues to
accept the method (PEP 484 ``None`` return ≈ "no return constraint").

### Async pool: concurrent kills via asyncio.gather

``SandboxPoolAsync._kill_discarded_alive`` used to ``await`` each kill
sequentially, so a batch of N near-expiry IDs blocked ``acquire()`` on N
serial RPCs. Now uses ``asyncio.gather`` so the calls overlap, then maps
the success bool back to the per-ID debug log.

### Tests added (test_pool_kill_discarded.py)

- ``_kill_sandbox_best_effort`` returns True/False correctly across
  success / failure / missing-manager
- ``_kill_discarded_alive`` debug log fires only for successful kills,
  verified with ``caplog``
- async kill batch finishes well under serial-time (5 × 50ms sleeps
  complete in <150ms) and ``max_in_flight >= 2``

### Test results

Kotlin ``:sandbox:test`` + ``:sandbox-pool-redis:test`` (real Redis 7
via ``OPENSANDBOX_TEST_REDIS_URL``): all green.
Python ``pytest``: 203 passed (197 + 6 new).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 33ee6f3dcb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}
val poolName = config.poolName
val sandboxId = stateStore.tryTakeIdle(poolName)
val takeResult = stateStore.tryTakeIdle(poolName, config.acquireMinRemainingTtl)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Renew warm sandboxes before applying min-TTL gate

This min-TTL check uses the store expiry, but putIdle stamps that as put time + idleTimeout while the server TTL starts earlier in buildSandboxFromSpec() before readiness and warmupSandboxPreparer. Fresh evidence after the prior comment is that the current createOneSandbox() still returns the id after the preparer without renewing before the reconciler enqueues it, so when warmup takes time (for example 45s with idleTimeout=120s and acquireMinRemainingTtl=60s) the store can report >60s remaining while the server has far less, letting this gate still hand out a sandbox that expires during acquire/connect. Renew to idleTimeout or stamp the store expiry from server truth before putIdle.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in a170c1acreateOneSandbox / _create_one_sandbox now calls sandbox.renew(config.idleTimeout) after the warmup preparer runs and before returning the id to the reconciler. The store stamp at putIdle is now consistent with the server-side TTL, so the min-TTL gate cannot overestimate remaining TTL by the warmup duration. Kotlin and both Python (sync + async) paths updated.

New tests (test_sync_create_one_sandbox_renews_before_returning_id / async) use a RenewTrackingSandbox fake that records every renew call and assert renewed == [idle_timeout] after warmup completes.

val sandboxId = stateStore.tryTakeIdle(poolName)
val takeResult = stateStore.tryTakeIdle(poolName, config.acquireMinRemainingTtl)
val sandboxId = takeResult.sandboxId
killDiscardedAlive(poolName, takeResult.discardedAliveSandboxIds, source = "acquire")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Connect to the eligible idle before cleanup

When many FIFO entries are below acquireMinRemainingTtl but a later entry is eligible, this call kills every discarded ID synchronously before connecting to the eligible sandbox. In distributed pools with slow kill RPCs or a large idle buffer, that cleanup can consume the candidate's remaining TTL before connect()/renew() runs, so a sandbox that passed the min-TTL gate can still expire during acquisition. Defer this best-effort cleanup until after the selected sandbox is connected/renewed, or run it in the background.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real bug. Fixed in a170c1a:

  • SandboxPool.acquire now stashes discardedAliveSandboxIds in pendingKill and only schedules the cleanup after the candidate is connected and renewed.
  • The cleanup is offloaded to warmupExecutor.submit(...) so the caller does not wait for N kill RPCs (Python sync uses the same ThreadPoolExecutor; Python async uses asyncio.create_task with the task tracked on _warmup_tasks so shutdown can wait for it).
  • Fall-through paths (FAIL_FAST and DIRECT_CREATE) still schedule the cleanup so discarded-alive sandboxes never linger silently.
  • New tests in test_pool_kill_discarded.py (test_sync_schedule_kill_discarded_alive_does_not_block_caller / async equivalent) submit a 100ms-per-call kill manager and assert scheduling latency stays under 50ms / 10ms while the kills still complete on the executor / event loop afterwards.

Pangjiping
Pangjiping previously approved these changes Jun 6, 2026
Copy link
Copy Markdown
Collaborator

@Pangjiping Pangjiping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Well-structured fix for a real production race condition. Good backward-compat story and thorough test coverage across both Kotlin and Python SDKs.

Minor suggestions (non-blocking):

  1. PR description says default is Duration.ZERO but code auto-derives min(60s, idleTimeout/2) — worth updating the description.
  2. Kotlin killDiscardedAlive runs serial kill calls inside acquire() — consider background/async execution like the Python asyncio.gather path.
  3. Consider a changelog note about the default behavioral change (existing users now get 60s threshold instead of binary expiry).

None of these block merge.

@Pangjiping Pangjiping added bug Something isn't working sdks labels Jun 6, 2026
…utIdle

Two bot reviews on 33ee6f3.

### Fix A — connect to the eligible idle before cleanup

Previously ``acquire`` killed every below-threshold-but-still-alive
sandbox synchronously *before* connecting to the chosen candidate. With
slow kill RPCs or a large discarded-alive list, that cleanup could
consume the candidate's remaining TTL — so a sandbox that just passed
``acquireMinRemainingTtl`` could still expire during ``connect()`` /
``renew()``. The original race this PR was meant to fix.

Reordered ``SandboxPool.acquire`` (Kotlin and Python sync+async) to:
1. Take the candidate from the store.
2. Stash ``discardedAliveSandboxIds`` in ``pendingKill``.
3. Connect, run any user health check, ``renew`` for ``sandboxTimeout``.
4. **Then** schedule the kill cleanup, fire-and-forget:
   - Kotlin: ``warmupExecutor.submit`` (falls back to inline if executor
     is missing — better to slow the caller than drop the cleanup).
   - Python sync: ``ThreadPoolExecutor.submit`` on the warmup executor
     (same fallback).
   - Python async: ``asyncio.create_task`` registered on the existing
     ``_warmup_tasks`` set so shutdown can wait on it.

The same scheduling fires on the FAIL_FAST / DIRECT_CREATE fall-through
paths so discarded-alive sandboxes never linger silently.

### Fix B — renew warm sandbox before recording idle expiry

``putIdle`` stamps store-side expiry as ``putIdle_time + idleTimeout``,
but the server-side TTL has been ticking since
``buildSandboxFromSpec()`` /``_build_sandbox_from_spec()`` — through the
readiness wait *and* any ``warmupSandboxPreparer`` work. With a long
preparer (e.g. 45s with ``idleTimeout=120s`` and
``acquireMinRemainingTtl=60s``), the store can report >60s remaining
while the server has far less, letting the min-TTL gate hand out a
sandbox that still expires during acquire/connect.

Added one ``sandbox.renew(config.idleTimeout)`` call right before
returning the id from ``createOneSandbox`` / ``_create_one_sandbox``,
so the server-side TTL is reset to match the store's stamp. Kotlin and
Python sync+async all updated.

### Tests added (test_pool_kill_discarded.py)

* ``_schedule_kill_discarded_alive`` returns immediately for both sync
  (executor-backed) and async (``create_task``); verified by submitting
  a 100ms-delay manager and asserting <50ms / <10ms scheduling latency.
  The kills still complete on the executor / event loop afterwards.
* ``_create_one_sandbox`` calls ``renew(idle_timeout)`` exactly once
  before returning the id — covered for both sync and async pools using
  a ``RenewTracking*Sandbox`` fake that records ``renew`` calls.

### Test results

Kotlin ``:sandbox:test`` + ``:sandbox-pool-redis:test`` (real Redis 7
via ``OPENSANDBOX_TEST_REDIS_URL``): all green.
Python ``pytest``: 207 passed (203 + 4 new).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI's SDK Tests failed two quality jobs on the previous head:

- ``Kotlin SDK Quality And Tests (sandbox)`` — ``spotlessCheck`` (ktlint
  parser) errored on a KDoc that contained an unbalanced literal
  ``{...}`` followed by ``"..."``. Rewrote the comment as plain prose so
  the parser stops trying to resolve braces inside the docstring. Also
  picked up the Spotless auto-format pass on a few unrelated lines in
  ``InMemoryPoolStateStore`` / ``TakeIdleResult`` (whitespace + line
  break normalization).

- ``Python SDK Quality (sandbox)`` — ``ruff check`` flagged two
  categories from this PR:
  * ``I001`` import ordering in the new helpers in
    ``pool_types`` / sync+async reconcilers / sync+async pool. Fixed
    via ``ruff --fix`` (split into two import groups, alphabetized).
  * ``UP037`` quoted self-referential type annotations in the new
    ``_RenewTracking{Sync,Async}Sandbox`` test fakes. Quotes removed.
  * ``B905`` ``zip()`` without ``strict=`` in the new async
    ``_kill_discarded_alive`` parallel kill loop. Now ``strict=True``.

No source-level behavior change. Existing Python suite still 207/207;
Kotlin ``:sandbox:test`` + ``:sandbox-pool-redis:test`` (real Redis 7
via ``OPENSANDBOX_TEST_REDIS_URL``) green; ``spotlessCheck`` and
``ruff check`` both clean locally.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@asiudgufgbukbsa
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review @Pangjiping! 🙏

Quick status on the three non-blocking suggestions:

1. PR description outdated — fixed (updated the body just now). Reflects the auto-derived default min(60s, idleTimeout/2), the discarded-alive kill design, the warmup-renew fix, and lists all 7 commits with links.

2. Kotlin killDiscardedAlive runs serial kill calls inside acquire() — this was already addressed in a170c1af (pushed after the LGTM, hence likely missed). SandboxPool.acquire now calls scheduleKillDiscardedAlive(...) after the candidate is connected and renewed, and the cleanup is offloaded to warmupExecutor.submit(...) so the caller does not wait on N kill RPCs (mirrors the Python asyncio.gather / create_task path). Falls back to inline only when no executor is available, e.g. mid-shutdown — better to slow the caller than drop the cleanup. Verified by test_sync_schedule_kill_discarded_alive_does_not_block_caller on the Python side; the Kotlin side is structurally equivalent.

3. Changelog note on default behavioral change — the project doesn't carry per-SDK CHANGELOG files yet (only server/, components/*), so I added a clearly-marked Behavioral change worth noting section to the PR description with a before/after table and the Duration.ZERO opt-out path. If maintainers want it captured elsewhere (e.g. release notes via the PR title prefix or a dedicated doc), happy to add it where it fits the project's release tooling.

Recap of all changes since the dismissed LGTM review

  • a170c1af — Fix A (defer kill until after candidate connect/renew) + Fix B (renew warm sandbox to idleTimeout before putIdle so store stamp matches server-side TTL); 4 new tests.
  • 8d97a64f — fix CI lint failures: ktlint KDoc parse error in RedisPoolStateStore (rewrote a docstring containing literal {...} that confused the parser), ruff I001 import order in pool helpers, UP037 quoted self-references in test fakes, B905 zip(... strict=True) in async kill loop.

Local verification on 8d97a64f:

  • Kotlin :sandbox:test + :sandbox-pool-redis:test (real Redis 7 via OPENSANDBOX_TEST_REDIS_URL) + spotlessCheck — all green.
  • Python pytest — 207 passed; ruff check — clean.

CI jobs on the new head are currently action_required (first-time-contributor gate) — would appreciate a workflow approval when you have a moment so the rerun can show green.

CI's Python SDK Quality (sandbox) job failed at the pyright step with 10
errors after I ignored that lint in earlier rounds. All on this PR's
new code; tests and ruff are unaffected.

### Three categories

**(1) ``timedelta | None`` not assignable to ``timedelta`` (4 errors)**

The auto-default change in 4e4f12c made ``PoolConfig.acquire_min_remaining_ttl``
declared as ``timedelta | None`` (``None`` means "auto-derive from idle_timeout"
at ``__post_init__``). Even though ``__post_init__`` always resolves it to a
concrete ``timedelta`` via ``object.__setattr__``, pyright can't see through
that, so passing it to the helper functions tripped.

Fix: widen the helper signatures to accept ``timedelta | None`` and treat
``None`` as zero. This matches what ``__post_init__`` would do anyway and
removes the need for ``cast`` at every call site.

**(2) ``_kill_sandbox_best_effort`` returns ``bool`` but reconciler wants
``Callable[[str], None]`` (2 errors)**

In 33ee6f3 ``_kill_sandbox_best_effort`` was changed to return ``bool`` so
``_kill_discarded_alive`` could fire the success debug log only on a confirmed
kill. But the reconciler's ``on_discard_sandbox`` parameter is typed
``Callable[[str], None]``, and ``bool`` is not a subtype of ``None``.

Fix: introduce a thin ``_discard_sandbox_callback`` adapter on both pool
facades that calls ``_kill_sandbox_best_effort`` and drops the return. The
reconciler now receives the adapter; the bool is still available to
``_kill_discarded_alive`` for the success-only log.

**(3) ``getattr(store, ...)`` returns ``object`` (4 errors)**

The Python ``hasattr``-based store-compat shim in ``pool_types.py`` uses
``getattr(store, "try_take_idle_min_ttl", None)``, which pyright types as
``object``. Calling that or awaiting it tripped four errors in the helpers.

Fix: ``cast`` each ``getattr`` result to the actual signature
(``Awaitable[TakeIdleResult]`` / ``Iterable[str] | None`` etc.). Behavior
unchanged at runtime — ``callable(method)`` still gates the call.

### Verification

- ``uv run pyright`` — 0 errors
- ``uv run ruff check`` — clean
- ``uv run pytest`` — 207 passed
- Kotlin ``:sandbox:test`` + ``:sandbox-pool-redis:test`` (real Redis 7) +
  ``spotlessCheck`` — all green

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@asiudgufgbukbsa
Copy link
Copy Markdown
Contributor Author

Pushed e5026d10 with the pyright fixes (CI on 8d97a64f flagged 10 typing errors in the new code — narrowed the helper signatures to accept timedelta | None, added a thin _discard_sandbox_callback adapter so the reconciler's Callable[[str], None] still type-checks while _kill_sandbox_best_effort keeps its bool return for the success-only log, and cast-ed the getattr(store, ...) results).

All local checks pass on e5026d10:

  • uv run pyright — 0 errors
  • uv run ruff check — clean
  • uv run pytest — 207 passed
  • Kotlin :sandbox:test + :sandbox-pool-redis:test (real Redis 7 via OPENSANDBOX_TEST_REDIS_URL) + spotlessCheck — all green

@Pangjiping — would appreciate workflow approval when you have a moment so the new run can show green 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working sdks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add minimum remaining TTL check on pool acquire to avoid dispensing near-expiry sandboxes

3 participants