Commit da3872f
authored
Fix Reader Thread Scaling Bottlenecks due to PausePoint Mutex (valkey-io#1037)
## Fix Reader Thread Scaling Bottleneck: PausePoint Mutex Contention
## Problem
Adding reader threads **degrades** performance instead of improving it:
| Reader Threads | TEXT 10K (req/s) | NUMERIC 10K (req/s) | TAG 10K
(req/s) |
|:-:|:-:|:-:|:-:|
| 1 | 1,195 | 2,229 | 1,588 |
| 2 | 538 (↓55%) | 573 (↓74%) | 578 (↓64%) |
| 4 | 415 (↓65%) | 409 (↓82%) | 377 (↓76%) |
| 8 | 440 (↓63%) | 510 (↓77%) | 465 (↓71%) |
| 16 | 459 (↓62%) | 523 (↓77%) | 433 (↓73%) |
Performance drops 2-3x at 2 threads and plateaus at 4+. Reader threads
are actively harmful.
## Root Cause (perf proof)
```
Baseline perf profile (8 reader threads, TEXT 10K):
61.28% absl::Mutex::Lock
11.17% absl::Mutex::Unlock
4.09% SearchNonVectorQuery
2.54% InternedString::DecrementRefCount
```
`BACKGROUND_PAUSEPOINT` is called multiple times per query on every
reader thread. It unconditionally calls `PausePoint()` which locks a
single global `absl::Mutex`. All threads serialize on this lock — 72% of
CPU is spent waiting for it.
## Fix
Skip the mutex entirely when no pause points are registered (the normal
production case). A plain `bool` flag is set to `true` only when
`FT._DEBUG` registers a pause point. The macros check this flag inline
before calling into the mutex-protected path.
```cpp
// BEFORE: always calls PausePoint() → always locks mutex
#define BACKGROUND_PAUSEPOINT(name) \
if (!vmsdk::IsMainThread()) { \
PAUSEPOINT(name); \
}
// AFTER: skip entirely when no pause points are active
#define BACKGROUND_PAUSEPOINT(name) \
do { \
if (vmsdk::debug::pause_points_enabled && !vmsdk::IsMainThread()) { \
PAUSEPOINT(name); \
} \
} while (false)
```
The flag is non-atomic — it's only written by the main thread during
`FT._DEBUG` commands and read by reader threads. This is safe because:
- False negatives (reading stale `false` after a pause point is set)
only delay the pause point taking effect by one query — acceptable for a
debug tool.
- False positives cannot occur — once set to `true`, it stays `true`.
## Results
Benchmark: 500 clients, 30 seconds per test, core-pinned (server cores
0-31, client cores 32-63).
### TEXT — 10K matches (req/s)
| Reader Threads | Baseline | + PausePoint Fix | vs Baseline |
|:-:|:-:|:-:|:-:|
| 1 | 1,195 | 1,581 | 1.3x |
| 2 | 538 | 3,012 | 5.6x |
| 4 | 415 | 5,633 | 13.6x |
| 8 | 440 | 7,564 | 17.2x |
| 16 | 459 | 9,068 | 19.8x |
### NUMERIC — 10K matches (req/s)
| Reader Threads | Baseline | + PausePoint Fix | vs Baseline |
|:-:|:-:|:-:|:-:|
| 1 | 2,229 | 4,094 | 1.8x |
| 2 | 573 | 7,047 | 12.3x |
| 4 | 409 | 13,702 | 33.5x |
| 8 | 510 | 18,208 | 35.7x |
| 16 | 523 | 10,447 | 20.0x |
### TAG — 10K matches (req/s)
| Reader Threads | Baseline | + PausePoint Fix | vs Baseline |
|:-:|:-:|:-:|:-:|
| 1 | 1,588 | 2,336 | 1.5x |
| 2 | 578 | 3,245 | 5.6x |
| 4 | 377 | 5,092 | 13.5x |
| 8 | 465 | 9,354 | 20.1x |
| 16 | 433 | 10,912 | 25.2x |
### 100 matches — no regression
| Reader Threads | TEXT Baseline | TEXT Fixed | NUMERIC Baseline |
NUMERIC Fixed | TAG Baseline | TAG Fixed |
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | 86,671 | 85,061 | 100,574 | 95,488 | 91,206 | 95,206 |
| 2 | 59,449 | 84,228 | 63,224 | 92,685 | 61,895 | 94,552 |
| 4 | 44,872 | 80,494 | 40,106 | 79,978 | 40,269 | 89,594 |
| 8 | 44,661 | 72,067 | 46,526 | 78,173 | 45,405 | 78,578 |
| 16 | 45,542 | 72,358 | 46,547 | 74,804 | 42,161 | 76,501 |
Light queries also improve significantly at 2+ threads (no longer
serializing on the mutex).
## After Fix — Perf Profile (16T, TEXT 10K)
```
44.05% InternedString::DecrementRefCount
14.80% TermIterator::InsertValidKeyIterator
12.54% TermIterator::FindMinimumValidKey
10.78% SearchNonVectorQuery
0.00% absl::Mutex::Lock
```
Zero mutex contention. The next bottleneck is `DecrementRefCount` at 44%
— atomic ref counting on `InternedStringPtr` during iteration causes
cache-line bouncing across cores. I am addressing this in a follow-up.
## Impact
- Reader thread scaling works as intended — performance improves with
additional threads instead of degrading.
- Heavy queries (10K matches) achieve up to 20x improvement at 16
threads vs baseline.
- No regression at 1 thread for any query type.
- Light queries (100 matches) also improve at 2+ threads.
---------
Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>1 parent 7cb0d02 commit da3872f
2 files changed
Lines changed: 23 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
| 32 | + | |
| 33 | + | |
32 | 34 | | |
33 | 35 | | |
34 | 36 | | |
| |||
85 | 87 | | |
86 | 88 | | |
87 | 89 | | |
| 90 | + | |
88 | 91 | | |
89 | 92 | | |
90 | 93 | | |
| |||
139 | 142 | | |
140 | 143 | | |
141 | 144 | | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
142 | 148 | | |
143 | 149 | | |
144 | 150 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
24 | 29 | | |
25 | 30 | | |
26 | 31 | | |
| |||
30 | 35 | | |
31 | 36 | | |
32 | 37 | | |
33 | | - | |
34 | | - | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
35 | 44 | | |
36 | 45 | | |
37 | | - | |
38 | | - | |
39 | | - | |
40 | | - | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
41 | 52 | | |
42 | 53 | | |
43 | 54 | | |
| |||
0 commit comments