Skip to content

[EFA] Add MC_EFA_CQ_THREADS env var and reduce idle CPU spin#2113

Open
yuhuiaws wants to merge 1 commit into
kvcache-ai:mainfrom
yuhuiaws:feat/efa-cq-threads-env-and-sleep
Open

[EFA] Add MC_EFA_CQ_THREADS env var and reduce idle CPU spin#2113
yuhuiaws wants to merge 1 commit into
kvcache-ai:mainfrom
yuhuiaws:feat/efa-cq-threads-env-and-sleep

Conversation

@yuhuiaws
Copy link
Copy Markdown

Summary

  • Add MC_EFA_CQ_THREADS environment variable to cap the number of CQ polling threads in EFA transport. When multiple EFA consumers coexist in the same process (e.g., KV cache transfer + DeepEP MoE all-to-all), each creates one poller thread per context. This env var allows limiting thread count to reduce contention.
  • Replace std::this_thread::yield() with std::this_thread::sleep_for(std::chrono::microseconds(10)) in workerThreadFunc idle path. On Linux, yield() compiles to sched_yield() which busy-spins at 100% CPU when there is no CQ work, wasting cores that could serve other EFA consumers.

Motivation

When running SGLang PD (Prefill-Decode) disaggregated serving with DeepSeek-V4 on AWS EFA, Mooncake handles KV cache transfer while UCCL-EP (DeepEP) handles MoE all-to-all — both over EFA in the same process. Without thread capping, Mooncake spawns one CQ poller per EFA context, and the yield() busy-spin consumes significant CPU, starving the DeepEP communication threads.

With MC_EFA_CQ_THREADS=1 and the sleep-based idle loop, CPU usage drops substantially with no measurable impact on transfer latency.

Test plan

  • Verify MC_EFA_CQ_THREADS is respected: set to 1, confirm only 1 CQ thread spawns
  • Verify unset MC_EFA_CQ_THREADS preserves default behavior (one thread per context)
  • Verify MC_EFA_CQ_THREADS > context count is safely ignored
  • Measure transfer latency with and without the sleep change
  • Run SGLang PD serving with Mooncake EFA backend end-to-end

🤖 Generated with Claude Code

Two changes to EFA transport CQ polling:

1. Add MC_EFA_CQ_THREADS environment variable to cap the number of CQ
   polling threads. When running multiple EFA consumers (e.g. KV transfer
   + DeepEP all-to-all) in the same process, each creates threads per
   context. This allows limiting contention.

2. Replace std::this_thread::yield() with sleep_for(10us) in the idle
   path of workerThreadFunc. yield() on Linux compiles to sched_yield()
   which busy-spins at 100% CPU when there is no CQ work, wasting cores
   that could serve other EFA consumers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the ability to configure the number of EFA completion queue worker threads via the MC_EFA_CQ_THREADS environment variable and replaces std::this_thread::yield() with a 10-microsecond sleep in the worker thread loop to reduce CPU usage during idle periods. A review comment correctly identified that std::stoull can throw exceptions if the environment variable contains invalid data, which could lead to an unexpected crash during initialization; adding error handling or using a non-throwing parsing function was recommended.

Comment on lines +112 to +115
if (cq_env) {
size_t cq_val = std::stoull(cq_env);
if (cq_val > 0 && cq_val < num_threads) num_threads = cq_val;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of std::stoull can throw std::invalid_argument or std::out_of_range exceptions if the environment variable MC_EFA_CQ_THREADS contains a malformed string or an empty value. Since this is read during initialization, it could cause the process to crash unexpectedly. It is safer to wrap this in a try-catch block or use a non-throwing parsing method like strtoull to ensure robustness against invalid user input.

    if (cq_env) {
        try {
            size_t cq_val = std::stoull(cq_env);
            if (cq_val > 0 && cq_val < num_threads) num_threads = cq_val;
        } catch (const std::exception& e) {
            LOG(WARNING) << "Invalid MC_EFA_CQ_THREADS value: " << cq_env << " (" << e.what() << ")";
        }
    }

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Collaborator

@whn09 whn09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up — the underlying issue (Mooncake CQ pollers competing with DeepEP for CPU when both run in the same process) is real, and capping the poller count is a reasonable knob. A few requested changes before we merge:

1. PR description does not follow the Mooncake template.

The repo expects the sections from .github/pull_request_template.md:

  • ## Description (link to relevant issue / scenario)
  • ## Module (check Transfer Engine here)
  • ## Type of Change (probably New feature + Bug fix)
  • ## How Has This Been Tested? (this is where the actual numbers go — see point 4)
  • ## Checklist (self-review, ./scripts/code_format.sh, docs, tests)

Please rewrite the body so future readers can navigate it the same way as other Mooncake PRs.

2. New env var should go through mooncake-common/src/environ.cpp, not raw getenv.

The Mooncake codebase has a centralized Environ singleton (mooncake-common/src/environ.cpp) that owns all MC_* vars — MC_MAX_WR, MC_MAX_EP_PER_CTX, MC_NUM_CQ_PER_CTX, etc. Reading MC_EFA_CQ_THREADS directly via std::getenv and parsing with std::stoull bypasses that pattern, and on top of that, an invalid input like MC_EFA_CQ_THREADS=abc will throw an uncaught std::invalid_argument and crash the process. Please:

  • Add efa_cq_threads_ (or similar) to Environ with a GetInt("MC_EFA_CQ_THREADS", 0) (0 = unset / use default).
  • Read it from Environ::Get().GetEfaCqThreads() (or whatever accessor name fits), so parsing/error handling is uniform.
  • Drop the <cstdlib> include and the inline std::stoull call.

3. Concerns about std::this_thread::sleep_for(std::chrono::microseconds(10)).

Two things worry me:

  • Actual sleep time: on Linux with CONFIG_HZ=1000, nanosleep's minimum effective sleep is ~50–100 µs (timer slack + scheduler tick), not 10 µs. So the description's "10us" is optimistic. Whatever the real sleep length turns out to be, that quantum is now added to KV-transfer tail latency every time the poller goes idle.
  • yield() doesn't always busy-spin: sched_yield() only burns CPU when there are no other runnable threads in the same priority class. In the exact scenario this PR targets (Mooncake + DeepEP coexisting), there are other runnable threads, so yield() should actually deschedule. The CPU saving may be smaller than expected in that case, while the latency penalty from sleeping is real.

Could you:

(a) Provide measured numbers (Mooncake CPU before/after, transfer p50/p99 latency before/after, DeepEP all-to-all latency before/after) in the rewritten "How Has This Been Tested?" section, and
(b) Consider making the sleep duration configurable (e.g. a second env MC_EFA_CQ_IDLE_SLEEP_US, default whatever your measurements show is right) so users can trade latency vs CPU based on their workload?

Long-term, the cleanest fix is to switch the EFA CQ to FI_WAIT_FD and have the worker thread block on epoll instead of polling — but that's a bigger change and out of scope for this PR.

4. Other concerns.

  • The current test plan checkboxes are all unchecked. Before requesting merge, please run them and report results, especially the latency measurement.
  • Once you switch to Environ::Get().GetEfaCqThreads() the thread-count guard becomes:
    size_t cq_val = Environ::Get().GetEfaCqThreads();
    if (cq_val > 0 && cq_val < num_threads) num_threads = cq_val;
    i.e. 0 = "user did not configure, keep default" — same semantics as getenv returning nullptr, but without exception risk.
  • The line // One poller thread per context for responsive CQ draining under load is now stale once the env var caps it. Either update it or add a follow-up note explaining the trade-off when the env is set.

Happy to re-review once these are addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants