[EFA] Add MC_EFA_CQ_THREADS env var and reduce idle CPU spin#2113
[EFA] Add MC_EFA_CQ_THREADS env var and reduce idle CPU spin#2113yuhuiaws wants to merge 1 commit into
Conversation
Two changes to EFA transport CQ polling: 1. Add MC_EFA_CQ_THREADS environment variable to cap the number of CQ polling threads. When running multiple EFA consumers (e.g. KV transfer + DeepEP all-to-all) in the same process, each creates threads per context. This allows limiting contention. 2. Replace std::this_thread::yield() with sleep_for(10us) in the idle path of workerThreadFunc. yield() on Linux compiles to sched_yield() which busy-spins at 100% CPU when there is no CQ work, wasting cores that could serve other EFA consumers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces the ability to configure the number of EFA completion queue worker threads via the MC_EFA_CQ_THREADS environment variable and replaces std::this_thread::yield() with a 10-microsecond sleep in the worker thread loop to reduce CPU usage during idle periods. A review comment correctly identified that std::stoull can throw exceptions if the environment variable contains invalid data, which could lead to an unexpected crash during initialization; adding error handling or using a non-throwing parsing function was recommended.
| if (cq_env) { | ||
| size_t cq_val = std::stoull(cq_env); | ||
| if (cq_val > 0 && cq_val < num_threads) num_threads = cq_val; | ||
| } |
There was a problem hiding this comment.
The use of std::stoull can throw std::invalid_argument or std::out_of_range exceptions if the environment variable MC_EFA_CQ_THREADS contains a malformed string or an empty value. Since this is read during initialization, it could cause the process to crash unexpectedly. It is safer to wrap this in a try-catch block or use a non-throwing parsing method like strtoull to ensure robustness against invalid user input.
if (cq_env) {
try {
size_t cq_val = std::stoull(cq_env);
if (cq_val > 0 && cq_val < num_threads) num_threads = cq_val;
} catch (const std::exception& e) {
LOG(WARNING) << "Invalid MC_EFA_CQ_THREADS value: " << cq_env << " (" << e.what() << ")";
}
}|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
whn09
left a comment
There was a problem hiding this comment.
Thanks for picking this up — the underlying issue (Mooncake CQ pollers competing with DeepEP for CPU when both run in the same process) is real, and capping the poller count is a reasonable knob. A few requested changes before we merge:
1. PR description does not follow the Mooncake template.
The repo expects the sections from .github/pull_request_template.md:
## Description(link to relevant issue / scenario)## Module(check Transfer Engine here)## Type of Change(probably New feature + Bug fix)## How Has This Been Tested?(this is where the actual numbers go — see point 4)## Checklist(self-review,./scripts/code_format.sh, docs, tests)
Please rewrite the body so future readers can navigate it the same way as other Mooncake PRs.
2. New env var should go through mooncake-common/src/environ.cpp, not raw getenv.
The Mooncake codebase has a centralized Environ singleton (mooncake-common/src/environ.cpp) that owns all MC_* vars — MC_MAX_WR, MC_MAX_EP_PER_CTX, MC_NUM_CQ_PER_CTX, etc. Reading MC_EFA_CQ_THREADS directly via std::getenv and parsing with std::stoull bypasses that pattern, and on top of that, an invalid input like MC_EFA_CQ_THREADS=abc will throw an uncaught std::invalid_argument and crash the process. Please:
- Add
efa_cq_threads_(or similar) toEnvironwith aGetInt("MC_EFA_CQ_THREADS", 0)(0 = unset / use default). - Read it from
Environ::Get().GetEfaCqThreads()(or whatever accessor name fits), so parsing/error handling is uniform. - Drop the
<cstdlib>include and the inlinestd::stoullcall.
3. Concerns about std::this_thread::sleep_for(std::chrono::microseconds(10)).
Two things worry me:
- Actual sleep time: on Linux with
CONFIG_HZ=1000,nanosleep's minimum effective sleep is ~50–100 µs (timer slack + scheduler tick), not 10 µs. So the description's "10us" is optimistic. Whatever the real sleep length turns out to be, that quantum is now added to KV-transfer tail latency every time the poller goes idle. yield()doesn't always busy-spin:sched_yield()only burns CPU when there are no other runnable threads in the same priority class. In the exact scenario this PR targets (Mooncake + DeepEP coexisting), there are other runnable threads, soyield()should actually deschedule. The CPU saving may be smaller than expected in that case, while the latency penalty from sleeping is real.
Could you:
(a) Provide measured numbers (Mooncake CPU before/after, transfer p50/p99 latency before/after, DeepEP all-to-all latency before/after) in the rewritten "How Has This Been Tested?" section, and
(b) Consider making the sleep duration configurable (e.g. a second env MC_EFA_CQ_IDLE_SLEEP_US, default whatever your measurements show is right) so users can trade latency vs CPU based on their workload?
Long-term, the cleanest fix is to switch the EFA CQ to FI_WAIT_FD and have the worker thread block on epoll instead of polling — but that's a bigger change and out of scope for this PR.
4. Other concerns.
- The current test plan checkboxes are all unchecked. Before requesting merge, please run them and report results, especially the latency measurement.
- Once you switch to
Environ::Get().GetEfaCqThreads()the thread-count guard becomes:i.e. 0 = "user did not configure, keep default" — same semantics assize_t cq_val = Environ::Get().GetEfaCqThreads(); if (cq_val > 0 && cq_val < num_threads) num_threads = cq_val;
getenvreturningnullptr, but without exception risk. - The line
// One poller thread per context for responsive CQ draining under loadis now stale once the env var caps it. Either update it or add a follow-up note explaining the trade-off when the env is set.
Happy to re-review once these are addressed.
Summary
MC_EFA_CQ_THREADSenvironment variable to cap the number of CQ polling threads in EFA transport. When multiple EFA consumers coexist in the same process (e.g., KV cache transfer + DeepEP MoE all-to-all), each creates one poller thread per context. This env var allows limiting thread count to reduce contention.std::this_thread::yield()withstd::this_thread::sleep_for(std::chrono::microseconds(10))inworkerThreadFuncidle path. On Linux,yield()compiles tosched_yield()which busy-spins at 100% CPU when there is no CQ work, wasting cores that could serve other EFA consumers.Motivation
When running SGLang PD (Prefill-Decode) disaggregated serving with DeepSeek-V4 on AWS EFA, Mooncake handles KV cache transfer while UCCL-EP (DeepEP) handles MoE all-to-all — both over EFA in the same process. Without thread capping, Mooncake spawns one CQ poller per EFA context, and the
yield()busy-spin consumes significant CPU, starving the DeepEP communication threads.With
MC_EFA_CQ_THREADS=1and the sleep-based idle loop, CPU usage drops substantially with no measurable impact on transfer latency.Test plan
MC_EFA_CQ_THREADSis respected: set to 1, confirm only 1 CQ thread spawnsMC_EFA_CQ_THREADSpreserves default behavior (one thread per context)MC_EFA_CQ_THREADS> context count is safely ignored🤖 Generated with Claude Code