Fix Reader Thread Scaling Bottlenecks due to PausePoint Mutex (valkey-io#1037)

KarthikSubbarao · web-flow · commit da3872fc8b93 · 2026-05-16T12:22:02.000-07:00
## Fix Reader Thread Scaling Bottleneck: PausePoint Mutex Contention

## Problem

Adding reader threads **degrades** performance instead of improving it:

| Reader Threads | TEXT 10K (req/s) | NUMERIC 10K (req/s) | TAG 10K
(req/s) |
|:-:|:-:|:-:|:-:|
| 1 | 1,195 | 2,229 | 1,588 |
| 2 | 538 (↓55%) | 573 (↓74%) | 578 (↓64%) |
| 4 | 415 (↓65%) | 409 (↓82%) | 377 (↓76%) |
| 8 | 440 (↓63%) | 510 (↓77%) | 465 (↓71%) |
| 16 | 459 (↓62%) | 523 (↓77%) | 433 (↓73%) |

Performance drops 2-3x at 2 threads and plateaus at 4+. Reader threads
are actively harmful.

## Root Cause (perf proof)

```
Baseline perf profile (8 reader threads, TEXT 10K):
    61.28%  absl::Mutex::Lock
    11.17%  absl::Mutex::Unlock
     4.09%  SearchNonVectorQuery
     2.54%  InternedString::DecrementRefCount
```

`BACKGROUND_PAUSEPOINT` is called multiple times per query on every
reader thread. It unconditionally calls `PausePoint()` which locks a
single global `absl::Mutex`. All threads serialize on this lock — 72% of
CPU is spent waiting for it.

## Fix

Skip the mutex entirely when no pause points are registered (the normal
production case). A plain `bool` flag is set to `true` only when
`FT._DEBUG` registers a pause point. The macros check this flag inline
before calling into the mutex-protected path.

```cpp
// BEFORE: always calls PausePoint() → always locks mutex
#define BACKGROUND_PAUSEPOINT(name) \
  if (!vmsdk::IsMainThread()) {     \
    PAUSEPOINT(name);               \
  }

// AFTER: skip entirely when no pause points are active
#define BACKGROUND_PAUSEPOINT(name)                                     \
  do {                                                                  \
    if (vmsdk::debug::pause_points_enabled &amp;&amp; !vmsdk::IsMainThread()) { \
      PAUSEPOINT(name);                                                 \
    }                                                                   \
  } while (false)
```

The flag is non-atomic — it's only written by the main thread during
`FT._DEBUG` commands and read by reader threads. This is safe because:
- False negatives (reading stale `false` after a pause point is set)
only delay the pause point taking effect by one query — acceptable for a
debug tool.
- False positives cannot occur — once set to `true`, it stays `true`.

## Results

Benchmark: 500 clients, 30 seconds per test, core-pinned (server cores
0-31, client cores 32-63).

### TEXT — 10K matches (req/s)

| Reader Threads | Baseline | + PausePoint Fix | vs Baseline |
|:-:|:-:|:-:|:-:|
| 1 | 1,195 | 1,581 | 1.3x |
| 2 | 538 | 3,012 | 5.6x |
| 4 | 415 | 5,633 | 13.6x |
| 8 | 440 | 7,564 | 17.2x |
| 16 | 459 | 9,068 | 19.8x |

### NUMERIC — 10K matches (req/s)

| Reader Threads | Baseline | + PausePoint Fix | vs Baseline |
|:-:|:-:|:-:|:-:|
| 1 | 2,229 | 4,094 | 1.8x |
| 2 | 573 | 7,047 | 12.3x |
| 4 | 409 | 13,702 | 33.5x |
| 8 | 510 | 18,208 | 35.7x |
| 16 | 523 | 10,447 | 20.0x |

### TAG — 10K matches (req/s)

| Reader Threads | Baseline | + PausePoint Fix | vs Baseline |
|:-:|:-:|:-:|:-:|
| 1 | 1,588 | 2,336 | 1.5x |
| 2 | 578 | 3,245 | 5.6x |
| 4 | 377 | 5,092 | 13.5x |
| 8 | 465 | 9,354 | 20.1x |
| 16 | 433 | 10,912 | 25.2x |

### 100 matches — no regression

| Reader Threads | TEXT Baseline | TEXT Fixed | NUMERIC Baseline |
NUMERIC Fixed | TAG Baseline | TAG Fixed |
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | 86,671 | 85,061 | 100,574 | 95,488 | 91,206 | 95,206 |
| 2 | 59,449 | 84,228 | 63,224 | 92,685 | 61,895 | 94,552 |
| 4 | 44,872 | 80,494 | 40,106 | 79,978 | 40,269 | 89,594 |
| 8 | 44,661 | 72,067 | 46,526 | 78,173 | 45,405 | 78,578 |
| 16 | 45,542 | 72,358 | 46,547 | 74,804 | 42,161 | 76,501 |

Light queries also improve significantly at 2+ threads (no longer
serializing on the mutex).

## After Fix — Perf Profile (16T, TEXT 10K)

```
    44.05%  InternedString::DecrementRefCount
    14.80%  TermIterator::InsertValidKeyIterator
    12.54%  TermIterator::FindMinimumValidKey
    10.78%  SearchNonVectorQuery
     0.00%  absl::Mutex::Lock
```

Zero mutex contention. The next bottleneck is `DecrementRefCount` at 44%
— atomic ref counting on `InternedStringPtr` during iteration causes
cache-line bouncing across cores. I am addressing this in a follow-up.

## Impact

- Reader thread scaling works as intended — performance improves with
additional threads instead of degrading.
- Heavy queries (10K matches) achieve up to 20x improvement at 16
threads vs baseline.
- No regression at 1 thread for any query type.
- Light queries (100 matches) also improve at 2+ threads.

---------

Signed-off-by: Karthik Subbarao &lt;karthikrs2021@gmail.com&gt;
diff --git a/vmsdk/src/debug.cc b/vmsdk/src/debug.cc
@@ -29,6 +29,8 @@ struct Waiter {
 };
 
 absl::flat_hash_map<std::string, std::vector<Waiter>> pause_point_waiters;
+// Global flag: set to true when any pause point is registered.
+bool pause_points_enabled{false};
 
 static std::string ToString(const std::source_location& location) {
   std::string os;
@@ -85,6 +87,7 @@ void PausePointControl(absl::string_view point, bool enable) {
   if (enable) {
     if (!pause_point_waiters.contains(point)) {
       pause_point_waiters[point];
+      pause_points_enabled = true;
     }
     CHECK(pause_point_waiters.contains(point));
   } else {
@@ -139,6 +142,9 @@ void PausePointList(ValkeyModuleCtx* ctx) {
 void ClearAllPausePoints() {
   absl::MutexLock lock(&pause_point_lock);
   pause_point_waiters.clear();
+  // Note: pause_points_enabled stays true. No reason to reset it —
+  // it's only checked as a fast-path skip in production where no
+  // pause points are ever set.
 }
 
 //
diff --git a/vmsdk/src/debug.h b/vmsdk/src/debug.h
@@ -21,6 +21,11 @@
 namespace vmsdk {
 namespace debug {
 
+// Global flag: set to true when any pause point is registered via FT._DEBUG.
+// Non-atomic — only written by the main thread during FT._DEBUG commands.
+// Checked inline by PAUSEPOINT/BACKGROUND_PAUSEPOINT macros for fast skip.
+extern bool pause_points_enabled;
+
 //
 // PausePoints are a tool to help with debugging of background processes.
 //
@@ -30,14 +35,20 @@ namespace debug {
 // current thread. A unique label is provided to distinguish this pause point
 // with others.
 //
-#define PAUSEPOINT(name) \
-  vmsdk::debug::PausePoint(name, std::source_location::current())
+#define PAUSEPOINT(name)                                               \
+  do {                                                                 \
+    if (vmsdk::debug::pause_points_enabled) {                          \
+      vmsdk::debug::PausePoint(name, std::source_location::current()); \
+    }                                                                  \
+  } while (false)
 void PausePoint(absl::string_view point, std::source_location location);
 
-#define BACKGROUND_PAUSEPOINT(name) \
-  if (!vmsdk::IsMainThread()) {     \
-    PAUSEPOINT(name);               \
-  }                                 \
+#define BACKGROUND_PAUSEPOINT(name)                                     \
+  do {                                                                  \
+    if (vmsdk::debug::pause_points_enabled && !vmsdk::IsMainThread()) { \
+      PAUSEPOINT(name);                                                 \
+    }                                                                   \
+  } while (false)
 //
 // This function is used by the control machinery (FT.DEBUG) to enable/disable
 // and test PausePoints.