Skip to content

Commit b787f3b

Browse files
docs: add ISSUE-007 and Step 5.5 for ScheduledExecutorService churn
ISSUE-007 documents that PendingVRpc.monitorDeadline() and per-session heartbeat scheduling both contribute to ScheduledExecutorService heap churn. At ~100ms heartbeat intervals (10 fires/sec/session) and ~1ms vRPC p50, cancelled deadline futures accumulate as zombies in the DelayQueue, inflating O(log n) insert cost for heartbeats. Mitigations ranked from quick fix to long-term solution. Step 5.5 added to THREADING_REFACTOR_PLAN.md between Steps 5 and 6: switch both heartbeat and deadline monitoring to a pool-internal ScheduledThreadPoolExecutor with setRemoveOnCancelPolicy(true), eliminating zombie accumulation while avoiding new dependencies. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent e14eb71 commit b787f3b

2 files changed

Lines changed: 474 additions & 0 deletions

File tree

java-bigtable/ISSUES.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,3 +105,53 @@ a recovery mechanism. Consider a typed `VRpcTask` wrapper that enforces the "alw
105105
contract at compile time.
106106

107107
**Files:** `session/VRpcImpl.java`, `middleware/RetryingVRpc.java`
108+
109+
---
110+
111+
## SessionPoolImpl — PendingVRpc
112+
113+
### ISSUE-007: `PendingVRpc.monitorDeadline()` causes `ScheduledExecutorService` heap churn under load
114+
115+
Every RPC that cannot immediately acquire a session goes through `PendingVRpc` and schedules a
116+
deadline-monitoring `ScheduledFuture` on the pool's `ScheduledExecutorService`. Under normal
117+
conditions the future is cancelled almost immediately — sessions are expected to be available
118+
within ~1 ms at p50. `ScheduledFuture.cancel(false)` marks the future cancelled but does **not**
119+
remove it from the underlying `DelayQueue`. Cancelled futures remain in the heap until their
120+
deadline expires naturally (typically seconds to minutes), inflating the queue and increasing
121+
O(log n) insert/remove cost for every subsequent schedule operation.
122+
123+
**Why this matters at the given operating point:**
124+
125+
Both `PendingVRpc.monitorDeadline()` and per-session heartbeat scheduling hit the same
126+
`ScheduledExecutorService`, and their effects compound.
127+
128+
**Heartbeat pressure:** At ~100 ms per heartbeat, each session fires 10 `schedule()` calls/sec.
129+
Heartbeat tasks run and reschedule — they do not cancel, so they produce no zombies — but they
130+
sustain O(log n) heap churn at 10N ops/sec for N sessions. This is not a low frequency in
131+
context: at a vRPC p50 of ~1 ms, a heartbeat fires every ~100 vRPCs, meaning background
132+
scheduling overhead is in the same order of magnitude as per-RPC work.
133+
134+
**Deadline monitor zombie accumulation:** `cancel(false)` marks a future cancelled but does not
135+
remove it from the `DelayQueue`. At ~1 ms p50 session-wait, a deadline future (say, 60 s) is
136+
created and cancelled almost immediately. At 10 000 RPC/s with 10 % transiently pending:
137+
1 000 futures/sec added, each living ~60 s → steady-state zombie count of ~60 000 entries.
138+
139+
**Compounding:** Heartbeat inserts pay O(log n) against a queue inflated by zombie deadline
140+
futures. With 10 sessions and 60 000 zombies, each heartbeat insert costs O(log 60 010) ≈ 16
141+
comparisons instead of O(log 10) ≈ 3. The absolute cost is small today but grows linearly with
142+
both session count and RPC throughput.
143+
144+
**Mitigations (in increasing order of impact):**
145+
146+
1. **Short-circuit**: if the deadline is already expired when `PendingVRpc.start()` is called,
147+
reject immediately without scheduling a future.
148+
2. **`setRemoveOnCancelPolicy(true)`** on the `ScheduledThreadPoolExecutor`: removes cancelled
149+
tasks from the queue eagerly in O(log n). Eliminates zombie accumulation. Requires
150+
controlling the executor construction.
151+
3. **Hashed wheel timer** (e.g., Netty's `HashedWheelTimer`): O(1) insert and O(1) cancel with
152+
no zombie accumulation. Neither deadline monitoring nor heartbeat checking requires
153+
sub-millisecond precision; a wheel tick of ~10 ms is appropriate for both. This is the right
154+
long-term fix for both sources of churn.
155+
156+
**Files:** `session/SessionPoolImpl.java``PendingVRpc.monitorDeadline()` and
157+
`session/SessionImpl.java``scheduleHeartbeatCheck()`

0 commit comments

Comments
 (0)