[EFA] Add MC_EFA_CQ_THREADS env var and reduce idle CPU spin by yuhuiaws · Pull Request #2113 · kvcache-ai/Mooncake

yuhuiaws · 2026-05-15T12:55:47Z

Summary

Add MC_EFA_CQ_THREADS environment variable to cap the number of CQ polling threads in EFA transport. When multiple EFA consumers coexist in the same process (e.g., KV cache transfer + DeepEP MoE all-to-all), each creates one poller thread per context. This env var allows limiting thread count to reduce contention.
Replace std::this_thread::yield() with std::this_thread::sleep_for(std::chrono::microseconds(10)) in workerThreadFunc idle path. On Linux, yield() compiles to sched_yield() which busy-spins at 100% CPU when there is no CQ work, wasting cores that could serve other EFA consumers.

Motivation

When running SGLang PD (Prefill-Decode) disaggregated serving with DeepSeek-V4 on AWS EFA, Mooncake handles KV cache transfer while UCCL-EP (DeepEP) handles MoE all-to-all — both over EFA in the same process. Without thread capping, Mooncake spawns one CQ poller per EFA context, and the yield() busy-spin consumes significant CPU, starving the DeepEP communication threads.

With MC_EFA_CQ_THREADS=1 and the sleep-based idle loop, CPU usage drops substantially with no measurable impact on transfer latency.

Test plan

Verify MC_EFA_CQ_THREADS is respected: set to 1, confirm only 1 CQ thread spawns
Verify unset MC_EFA_CQ_THREADS preserves default behavior (one thread per context)
Verify MC_EFA_CQ_THREADS > context count is safely ignored
Measure transfer latency with and without the sleep change
Run SGLang PD serving with Mooncake EFA backend end-to-end

🤖 Generated with Claude Code

Two changes to EFA transport CQ polling: 1. Add MC_EFA_CQ_THREADS environment variable to cap the number of CQ polling threads. When running multiple EFA consumers (e.g. KV transfer + DeepEP all-to-all) in the same process, each creates threads per context. This allows limiting contention. 2. Replace std::this_thread::yield() with sleep_for(10us) in the idle path of workerThreadFunc. yield() on Linux compiles to sched_yield() which busy-spins at 100% CPU when there is no CQ work, wasting cores that could serve other EFA consumers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces the ability to configure the number of EFA completion queue worker threads via the MC_EFA_CQ_THREADS environment variable and replaces std::this_thread::yield() with a 10-microsecond sleep in the worker thread loop to reduce CPU usage during idle periods. A review comment correctly identified that std::stoull can throw exceptions if the environment variable contains invalid data, which could lead to an unexpected crash during initialization; adding error handling or using a non-throwing parsing function was recommended.

gemini-code-assist · 2026-05-15T12:57:35Z

+    if (cq_env) {
+        size_t cq_val = std::stoull(cq_env);
+        if (cq_val > 0 && cq_val < num_threads) num_threads = cq_val;
+    }


The use of std::stoull can throw std::invalid_argument or std::out_of_range exceptions if the environment variable MC_EFA_CQ_THREADS contains a malformed string or an empty value. Since this is read during initialization, it could cause the process to crash unexpectedly. It is safer to wrap this in a try-catch block or use a non-throwing parsing method like strtoull to ensure robustness against invalid user input.

if (cq_env) { try { size_t cq_val = std::stoull(cq_env); if (cq_val > 0 && cq_val < num_threads) num_threads = cq_val; } catch (const std::exception& e) { LOG(WARNING) << "Invalid MC_EFA_CQ_THREADS value: " << cq_env << " (" << e.what() << ")"; } }

codecov-commenter · 2026-05-15T13:23:08Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

whn09

Thanks for picking this up — the underlying issue (Mooncake CQ pollers competing with DeepEP for CPU when both run in the same process) is real, and capping the poller count is a reasonable knob. A few requested changes before we merge:

1. PR description does not follow the Mooncake template.

The repo expects the sections from .github/pull_request_template.md:

## Description (link to relevant issue / scenario)
## Module (check Transfer Engine here)
## Type of Change (probably New feature + Bug fix)
## How Has This Been Tested? (this is where the actual numbers go — see point 4)
## Checklist (self-review, ./scripts/code_format.sh, docs, tests)

Please rewrite the body so future readers can navigate it the same way as other Mooncake PRs.

2. New env var should go through mooncake-common/src/environ.cpp, not raw getenv.

The Mooncake codebase has a centralized Environ singleton (mooncake-common/src/environ.cpp) that owns all MC_* vars — MC_MAX_WR, MC_MAX_EP_PER_CTX, MC_NUM_CQ_PER_CTX, etc. Reading MC_EFA_CQ_THREADS directly via std::getenv and parsing with std::stoull bypasses that pattern, and on top of that, an invalid input like MC_EFA_CQ_THREADS=abc will throw an uncaught std::invalid_argument and crash the process. Please:

Add efa_cq_threads_ (or similar) to Environ with a GetInt("MC_EFA_CQ_THREADS", 0) (0 = unset / use default).
Read it from Environ::Get().GetEfaCqThreads() (or whatever accessor name fits), so parsing/error handling is uniform.
Drop the <cstdlib> include and the inline std::stoull call.

3. Concerns about std::this_thread::sleep_for(std::chrono::microseconds(10)).

Two things worry me:

Actual sleep time: on Linux with CONFIG_HZ=1000, nanosleep's minimum effective sleep is ~50–100 µs (timer slack + scheduler tick), not 10 µs. So the description's "10us" is optimistic. Whatever the real sleep length turns out to be, that quantum is now added to KV-transfer tail latency every time the poller goes idle.
yield() doesn't always busy-spin: sched_yield() only burns CPU when there are no other runnable threads in the same priority class. In the exact scenario this PR targets (Mooncake + DeepEP coexisting), there are other runnable threads, so yield() should actually deschedule. The CPU saving may be smaller than expected in that case, while the latency penalty from sleeping is real.

Could you:

(a) Provide measured numbers (Mooncake CPU before/after, transfer p50/p99 latency before/after, DeepEP all-to-all latency before/after) in the rewritten "How Has This Been Tested?" section, and
(b) Consider making the sleep duration configurable (e.g. a second env MC_EFA_CQ_IDLE_SLEEP_US, default whatever your measurements show is right) so users can trade latency vs CPU based on their workload?

Long-term, the cleanest fix is to switch the EFA CQ to FI_WAIT_FD and have the worker thread block on epoll instead of polling — but that's a bigger change and out of scope for this PR.

4. Other concerns.

The current test plan checkboxes are all unchecked. Before requesting merge, please run them and report results, especially the latency measurement.
Once you switch to Environ::Get().GetEfaCqThreads() the thread-count guard becomes:
```
size_t cq_val = Environ::Get().GetEfaCqThreads();
if (cq_val > 0 && cq_val < num_threads) num_threads = cq_val;
```
i.e. 0 = "user did not configure, keep default" — same semantics as getenv returning nullptr, but without exception risk.
The line // One poller thread per context for responsive CQ draining under load is now stale once the env var caps it. Either update it or add a follow-up note explaining the trade-off when the env is set.

Happy to re-review once these are addressed.

yuhuiaws requested review from alogfans and whn09 as code owners May 15, 2026 12:55

github-actions Bot added run-ci Transfer Engine labels May 15, 2026

gemini-code-assist Bot reviewed May 15, 2026

View reviewed changes

whn09 requested changes May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EFA] Add MC_EFA_CQ_THREADS env var and reduce idle CPU spin#2113

[EFA] Add MC_EFA_CQ_THREADS env var and reduce idle CPU spin#2113
yuhuiaws wants to merge 1 commit into
kvcache-ai:mainfrom
yuhuiaws:feat/efa-cq-threads-env-and-sleep

yuhuiaws commented May 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

codecov-commenter commented May 15, 2026

Uh oh!

whn09 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuhuiaws commented May 15, 2026

Summary

Motivation

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 15, 2026

Codecov Report

Uh oh!

whn09 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants