RUBY-3364 fix connection pool thread starvation under oversubscription#3028
RUBY-3364 fix connection pool thread starvation under oversubscription#3028comandeo-mongo merged 8 commits intomongodb:masterfrom
Conversation
Drives the pool under high thread:connection oversubscription and reports per-thread service counts and wait-time distribution. Used to diagnose thread starvation under contention. Kept as a permanent diagnostic tool under profile/ alongside driver_bench. Tunable via THREADS, POOL_SIZE, DURATION_SEC, WAIT_TIMEOUT env vars. Requires a running cluster at MONGODB_URI (defaults to local replica set).
There was a problem hiding this comment.
Pull request overview
Addresses RUBY-3364 by improving fairness in Mongo::Server::ConnectionPool under heavy thread oversubscription, aiming to prevent long-lived starvation cycles.
Changes:
- Add per-gate waiter counters (
@size_waiters,@max_connecting_waiters) and incorporate them into checkout gating logic. - Add a new RSpec regression test to detect starvation/unfair service distribution under oversubscription.
- Add a profiling harness to reproduce and measure pool fairness characteristics outside CI.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| lib/mongo/server/connection_pool.rb | Introduces waiter counters and modifies checkout gate loops to reduce mutex barging/starvation. |
| spec/mongo/server/connection_pool_fairness_spec.rb | Adds a regression spec asserting no checkout timeouts and roughly even per-thread service. |
| profile/connection_pool_fairness.rb | Adds a standalone profiling script to measure fairness and wait-time distributions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| until (max_size == 0 || unavailable_connections < max_size) && !must_wait | ||
| wait = deadline - Utils.monotonic_time | ||
| raise_check_out_timeout!(connection_global_id) if wait <= 0 | ||
| @size_cv.wait(wait) | ||
| @size_waiters += 1 | ||
| begin | ||
| @size_cv.wait(wait) | ||
| ensure |
There was a problem hiding this comment.
Forcing a @size_cv.wait solely because @size_waiters > 0 can create a lost-wakeup/spurious-timeout scenario: a thread may enter the wait even though unavailable_connections < max_size is already true, and if no further @size_cv.signal occurs (e.g., multiple check-ins happened before this thread began waiting, leaving capacity available but no subsequent signals), it will sleep until deadline and raise a checkout timeout despite capacity being available. This fairness mechanism likely needs a handoff/ticket-style turnstile (or some form of waiter-to-waiter signaling) so a “queue tax” wait does not depend on an external signal when the predicate is already satisfied.
There was a problem hiding this comment.
Good catch. This is a real lost-wakeup: a newcomer entering during the wake-up window (observing a stale @size_waiters > 0 before the waking waiter has decremented it) can be parked on @size_cv with capacity available, and no further signal is guaranteed.
Fixed in b7b50f2: after a successful checkout updates @checked_out_connections, signal @size_cv if there is a waiter AND the gate predicate is satisfied for them. The predicate guard matters — an unconditional signal would wake a waiter whose predicate still fails, causing it to re-queue itself at the back of the FIFO and break the wait-queue-fairness CMAP spec test.
Reproduction test added in fa55ebd (white-box, instruments the pool CV to force the race window open). Verified it fails on the prior commit and passes after the fix.
| @@ -1353,10 +1375,16 @@ def wait_for_connection(connection_global_id, deadline) | |||
| decrement_connection_requests_and_signal | |||
| raise_check_out_timeout!(connection_global_id) | |||
| end | |||
| @max_connecting_cv.wait(wait) | |||
| @max_connecting_waiters += 1 | |||
| begin | |||
There was a problem hiding this comment.
The same must_wait pattern on @max_connecting_cv can block even when a connection is already available (@available_connections.any?), and can similarly lead to a spurious timeout if the thread starts waiting after the last relevant signal and no further signals occur. Consider ensuring the fairness mechanism does not require an additional CV signal when the gate predicate is already satisfied (e.g., via an explicit FIFO/turnstile handoff among waiters).
There was a problem hiding this comment.
Same fix applied to @max_connecting_cv in wait_for_connection. Commit b7b50f2.
Predicate-guarded signal (@available_connections.any? || @pending_connections.length < @max_connecting) avoids the FIFO-reordering side effect that would otherwise surface on the CMAP wait-queue-fairness spec test.
| # Drain queue into array. | ||
| rows = [] | ||
| rows << results.pop until results.empty? | ||
|
|
There was a problem hiding this comment.
This harness enqueues a full per-operation row [tid, t1, t2, dur, err] into results for the entire run, then drains it into rows. Under fast operations (or longer durations), this can grow to a very large number of Ruby objects and can OOM the process, making the diagnostic unreliable. Consider aggregating statistics online (counts/histograms per thread) and only optionally storing full rows behind an env flag or sampling rate.
There was a problem hiding this comment.
Noted. Deferring to a follow-up. At the default workload (200 threads × 30s ≈ 120k rows, ~24 MB of Ruby objects) the harness is fine; it would need online aggregation for multi-minute runs. Since this script lives under profile/ and is run by humans for diagnostic purposes rather than in CI, the memory ceiling is acceptable for now. I will add a header note.
| # satisfied. Without this, re-entering threads (those that just | ||
| # checked a connection back in) barge past existing waiters and | ||
| # the 195 blocked threads in a 200:5 scenario never wake. | ||
| must_wait = @size_waiters > 0 |
There was a problem hiding this comment.
The new must_wait gating can block even when max_size == 0 (unlimited pool). With max_size == 0, the size gate should never wait; however the current condition still waits whenever @size_waiters > 0, which can introduce unnecessary blocking and even checkout timeouts in an otherwise-unbounded pool. Consider bypassing the must_wait logic entirely when max_size == 0 (or folding it into must_wait so it is always false in this case).
| must_wait = @size_waiters > 0 | |
| # Do not apply this gate to unlimited pools, where max_size == 0 | |
| # means there is no size slot to wait for. | |
| must_wait = max_size != 0 && @size_waiters > 0 |
There was a problem hiding this comment.
Applied: must_wait = max_size != 0 && @size_waiters > 0. Commit b7b50f2.
max_connecting does not have an unlimited sentinel (it is always a positive integer), so no parallel change is needed for the second gate.
Copilot review on PR mongodb#3028 identified two correctness bugs in the initial anti-barging fix: 1. Lost wakeup (comments #1/#2): When a successful waiter leaves the size gate, it does not signal the next waiter. A newcomer that arrived during the wake-up window (observing a stale size_waiters>0) can be parked on size_cv with capacity already available, and will not be woken by any subsequent event unless an external check-in happens. Symptom: newcomer waits for the full wait_queue_timeout despite capacity being free, even under light load. Fix: after a successful checkout (once @checked_out_connections is updated), signal size_cv if there are waiters AND the gate predicate is satisfied for them. The predicate guard matters: signaling unconditionally would wake a waiter whose predicate still fails, causing it to requeue itself at the back of the FIFO and break the wait-queue-fairness CMAP spec test. The same pattern is mirrored in wait_for_connection for @max_connecting_cv. 2. Unlimited pool (comment mongodb#4): must_wait fired even when max_size==0, introducing pointless blocking for unbounded pools where there is no size constraint to wait on. Skip the gate in that case. Validation: - New white-box regression test reproduces the lost wakeup deterministically by instrumenting size_cv.wait to release the lock during the wake-up window. - spec/spec_tests/cmap_spec.rb wait-queue-fairness case (CMAP unified test): 16 examples, 0 failures. - spec/mongo/server/connection_pool_spec.rb and spec/mongo/server/connection_pool/: 115 examples, 0 failures, 3 pending. - spec/spec_tests/cmap_spec.rb (full): 226 examples, 0 failures. - profile/connection_pool_fairness.rb at 200:5 ratio: 0 timeouts, even distribution.
Deterministically reproduces the lost-wakeup scenario Copilot identified on PR mongodb#3028 by monkey-patching size_cv.wait on the pool's own CV instance to release the mutex for 300ms after the first wait returns. That window lets a newcomer enter while @size_waiters is still stale (not yet decremented), queue up on size_cv, and verify it is woken by the baton-pass signal rather than waiting for wait_queue_timeout. Without the fix: newcomer latency == wait_queue_timeout (test fails). With the fix: newcomer latency << wait_queue_timeout (test passes).
Copilot review on PR mongodb#3028 flagged that the harness accumulates every operation into memory. At default settings the memory footprint is modest, but long runs could grow unbounded. Documented in the header with guidance to switch to online aggregation for multi-minute runs.
Summary
Fixes thread starvation in
Mongo::Server::ConnectionPoolwhen worker threadssignificantly outnumber the pool size.
Root cause: mutex barging at the checkout predicate gate. A thread returning
from
do_check_inre-entersretrieve_and_connect_connectionwhileunavailable_connections < max_sizeis still true, bypasses the wait loopentirely, and takes the freshly-released connection before any thread blocked
in
@size_cv.waitcan re-acquire@lock. With 200 threads and a 5-connectionpool, 5 threads form a closed cycle and the other 195 starve for the entire
run (each completes 2 ops and hits the wait timeout once).
This is not non-FIFO condition variable behavior — MRI's
::ConditionVariablesignals waiters in insertion order. The barging happens at the mutex
re-acquisition boundary, not inside the CV. The full diagnosis and the
rationale for the minimal fix (vs. a larger per-waiter-CV redesign) is in
the design doc that accompanies this change.
Fix
Introduce per-gate waiter counters (
@size_waiters,@max_connecting_waiters)and require any thread that arrives with a non-zero counter to enter the
wait loop at least once, regardless of whether the predicate is currently
satisfied. Once a thread has served one wait cycle it is eligible on the
next iteration.
Preserves the existing architecture and synchronization primitives; no
changes to
do_check_insignaling or@connection_requestscounting.Validation
All existing suites pass against a local replica set:
spec/mongo/server/connection_pool_spec.rb+connection_pool/subdir: 114 examples, 0 failures, 3 pendingspec/spec_tests/cmap_spec.rb: 226 examples, 0 failuresspec/integration/connection_pool_populator_spec.rb+connection_spec.rb: 23 examples, 0 failures, 10 pendingbundle exec rubocop lib/mongo/server/connection_pool.rb spec/mongo/server/connection_pool_fairness_spec.rb: cleanChanges
lib/mongo/server/connection_pool.rb: add@size_waiters/@max_connecting_waiterscounters; gate both checkout waits on themspec/mongo/server/connection_pool_fairness_spec.rb: new regression test asserting per-thread service-count ratio and zero checkout timeouts under oversubscription. Verified to fail on the pre-fix code (ratio=0.0, min=1, max=~4000) and pass on the fixed code (ratio >= 0.99).profile/connection_pool_fairness.rb: fairness diagnostic harness kept underprofile/alongsidedriver_bench. Used to reproduce RUBY-3364 and to validate the fix; not run in default CI.Test Plan
profile/connection_pool_fairness.rblocally and confirms even distributionratio >= 0.5threshold in the regression spec is appropriate for all CI variants (local observation is>= 0.99; 0.5 was chosen to absorb CI noise)