Skip to content

Add minimum remaining TTL check on pool acquire to avoid dispensing near-expiry sandboxes #983

@Pangjiping

Description

@Pangjiping

Problem

When using the distributed sandbox warm pool (e.g. code-interpreter template), pool.acquire() intermittently returns sandboxes that expire during the checkReady phase, causing READY_TIMEOUT failures. The renew call never gets a chance to execute.

User report: "Just acquired it asynchronously, next second it's already expired, check fails, and forceRenew never runs."

Root cause: tryTakeIdle uses a binary expiry check (expiresAt > now) — a sandbox with 1ms remaining TTL still passes. During the subsequent checkReady polling (up to 30s), the sandbox expires server-side and becomes unreachable.

Affected code paths:

Store Current logic
InMemoryPoolStateStore.tryTakeIdle if (entry.expiresAt.isAfter(now)) return sandboxId
RedisPoolStateStore TAKE_IDLE_SCRIPT (Lua) if tonumber(expires_at) > now_ms then return sandbox_id

Additionally, reconciler.reapExpiredIdle only cleans up already-expired entries — it won't proactively reclaim near-expiry sandboxes to trigger replenishment.

Proposed Solution

1. Add acquireMinRemainingTtl to PoolConfig

data class PoolConfig(
    // ... existing fields
    val acquireMinRemainingTtl: Duration = Duration.ofSeconds(60),
)

2. Update tryTakeIdle condition

Before:

if (entry.expiresAt.isAfter(now)) return sandboxId

After:

if (entry.expiresAt.isAfter(now.plus(minRemainingTtl))) return sandboxId
// else: discard and continue to next entry

Redis Lua — before:

if tonumber(expires_at) > now_ms then return sandbox_id

After:

if tonumber(expires_at) > (now_ms + min_remaining_ttl_ms) then return sandbox_id

3. (Optional) Proactive reconciler cleanup

Make reapExpiredIdle also evict entries where TTL < minRemainingTtl, triggering the pool to replenish with fresh sandboxes.

Change Scope

File Change Lines
PoolConfig Add acquireMinRemainingTtl field + default ~3
InMemoryPoolStateStore.tryTakeIdle Update expiry condition ~2
RedisPoolStateStore TAKE_IDLE_SCRIPT Update Lua condition ~1
SandboxPool.acquire / store interface Pass minRemainingTtl parameter ~3
Total ~10-15 lines

Acceptance Criteria

  • acquireMinRemainingTtl defaults to 60s — no behavior change for users who don't set it
  • Sandboxes with remaining TTL < minRemainingTtl are skipped during acquire (discarded, try next)
  • Skipped near-expiry sandboxes are removed from idle set (not left to block future acquires)
  • Pool replenishment triggered when near-expiry sandboxes are discarded
  • Existing tests pass without modification (backward compatible)
  • New unit test: acquire skips sandbox with TTL = 5s when minRemainingTtl = 60s

Labels

enhancement, sdk, pool, reliability

Priority

High — causes intermittent READY_TIMEOUT in production with no user-side workaround.

Metadata

Metadata

Assignees

No one assigned

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions