Skip to content

[DEP-950][feat] add multi-rank sleep/wakeup support to MPI executor path#14636

Open
hhzhang16 wants to merge 23 commits into
NVIDIA:mainfrom
hhzhang16:hannahz/dep-950-add-multi-rank-sleepwakeup-support-to-mpi-executor-path
Open

[DEP-950][feat] add multi-rank sleep/wakeup support to MPI executor path#14636
hhzhang16 wants to merge 23 commits into
NVIDIA:mainfrom
hhzhang16:hannahz/dep-950-add-multi-rank-sleepwakeup-support-to-mpi-executor-path

Conversation

@hhzhang16

@hhzhang16 hhzhang16 commented May 27, 2026

Copy link
Copy Markdown
Contributor

Extends sleep() / wakeup() to work correctly when tensor_parallel_size > 1.

Previously the MPI/IPC path rejected multi-rank sleep/wakeup because control_action() only pauses the local rank-0 executor loop; it does not run code in peer rank processes. Since release_with_tag() / materialize_with_tag() must execute in each rank's own process, this MR adds a dedicated control communicator and listener thread for non-rank-0 workers.

Control-Listener Infrastructure (py_executor.py)

  • Duplicates a secondary MPI communicator (_control_comm) in start_worker() after PP broadcast-comm setup. This is collective and runs on the main thread before the worker thread starts.
  • Spawns a _control_listener_thread daemon on non-rank-0 ranks.
    • The listener blocks on _CONTROL_ACTION_TAG.
    • It executes the requested VMM operation locally.
    • It sends a structured ACK on _CONTROL_ACK_TAG with status: ok/error.
  • Adds _control_action_lock to serialize concurrent callers.
    • control_action() uses an Event barrier, not a mutex.
    • Without the lock, two concurrent RPCs can both pass the barrier and interleave on _control_comm or mutate VMM out of order.
    • The lock covers both single-rank and multi-rank sleep/wakeup.
  • During shutdown, rank 0 sends a "shutdown" sentinel to all non-rank-0 listeners.

Coordination (base_worker.py)

  • Removes the world_size > 1 NotImplementedError from _check_sleep_wakeup_preconditions().
  • Adds _multi_rank_sleep_wakeup():
    • Acquires _control_action_lock.
    • Enters control_action().
    • Broadcasts the requested action/tags to non-rank-0 workers.
    • Runs the local VMM operation on rank 0.
    • Drains all peer ACKs before raising any aggregate error.
  • Updates sleep() and wakeup() so both branches are serialized:
    • world_size == 1
    • world_size > 1

Proxy Allowlist (proxy.py, rpc_proxy.py)

  • Adds _MULTI_RANK_ALLOWED_METHODS = {"sleep", "wakeup"}.
  • Updates _check_collective_rpc_guard() to accept method=.
  • Multi-rank collective_rpc() still rejects general methods, but allows sleep() / wakeup() because those now coordinate internally across ranks.

Sleep/Wakeup Sequence

sequenceDiagram
    participant API as API caller
    participant R0 as Rank 0 worker
    participant Loop0 as Rank 0 executor loop
    participant C as _control_comm
    participant R1 as Rank 1 listener
    participant R2 as Rank 2 listener

    API->>R0: sleep(tags) or wakeup(tags)
    R0->>R0: acquire _control_action_lock
    R0->>Loop0: enter control_action()
    Loop0-->>R0: drained and paused

    R0->>C: send action/tags to rank 1
    C->>R1: deliver control action
    R0->>C: send action/tags to rank 2
    C->>R2: deliver control action

    par local VMM ops
        R0->>R0: release/materialize local VMM
    and peer VMM ops
        R1->>R1: release/materialize local VMM
        R2->>R2: release/materialize local VMM
    end

    R1-->>C: ACK status ok/error
    R2-->>C: ACK status ok/error
    C-->>R0: ACKs

    R0->>R0: drain all ACKs
    R0->>Loop0: exit control_action()
    R0->>R0: release _control_action_lock
    R0-->>API: success or aggregate error
Loading

Error Handling

flowchart TD
    Start["Rank 0 broadcasts action"]
    Local["Rank 0 local VMM op"]
    Peer1["Rank 1 VMM op"]
    Peer2["Rank 2 VMM op"]

    Ack1["Rank 1 ACK: status ok/error"]
    Ack2["Rank 2 ACK: status ok/error"]

    Drain["Rank 0 drains all ACKs"]
    AnyError{"Any local or peer error?"}
    Raise["Raise aggregate RuntimeError"]
    Return["Return success"]

    Start --> Local
    Start --> Peer1 --> Ack1 --> Drain
    Start --> Peer2 --> Ack2 --> Drain
    Local --> Drain
    Drain --> AnyError
    AnyError -->|yes| Raise
    AnyError -->|no| Return
Loading

The important guarantee is that rank 0 always drains peer ACKs before raising. That prevents stale ACKs from being consumed by the next sleep/wakeup operation.

Tests

  • test_sleep_collective_rpc_guards.py

    • Replaces removed-contract tests with routing and allowlist tests.
    • Adds TestMultiRankSleepWakeupLock.
      • Uses SpyLock to verify acquire/release ordering.
      • Adds a concurrent-overlap test.
      • Patches CUDA/VMM globals once in the main thread so both worker threads share the same mock scope.
  • TestMultiRankAckErrorPropagation

    • Drives _multi_rank_sleep_wakeup() directly with a fake recv() that returns scripted ACKs.
    • test_peer_error_ack_raises verifies a single non-ok ACK surfaces its error detail.
    • test_multiple_peer_errors_aggregated uses world_size=3 with two error ACKs and verifies both messages appear in the aggregate error.
  • TestMultiRankRank0LocalFailureDrainsAcks

    • Patches release_with_tag() to raise.
    • Asserts all peer ACKs are still received before the method raises.
    • Covers the combined failure case where rank 0 fails and a peer also returns an error; both messages must appear.
  • TestSingleRankLockAcquired

    • Uses a minimal SpyLock to verify single-rank sleep() / wakeup() also acquire _control_action_lock.
    • Parametrized over both methods.
    • The engine stub does not need _control_comm because the single-rank path never touches it.
  • _make_proto_worker

    • Shared by the ACK/error-path test classes.
    • Defaults to world_size=3 so tests cover multiple peers instead of only the degenerate single-peer case.
  • test_mpi_sleep_wakeup.py (new, @pytest.mark.gpu2)

    • Adds TP=2 integration coverage using TinyLlama.
    • _per_device_gpu_memory() queries NVML per device.
    • This ensures a rank-1 leak on GPU 1 cannot be masked by rank 0 releasing enough memory on GPU 0.

Summary by CodeRabbit

Release Notes

  • New Features

    • Sleep/wakeup memory management operations now support multi-rank distributed deployments with coordinated control across all GPU ranks
  • Bug Fixes

    • Multi-rank configurations can now use sleep/wakeup functionality (previously unavailable for distributed setups)
  • Tests

    • Added comprehensive test coverage for multi-rank sleep/wakeup operations, including memory monitoring validation

Review Change Stack

Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
@hhzhang16 hhzhang16 requested review from a team as code owners May 27, 2026 14:56
@hhzhang16 hhzhang16 requested review from JunyiXu-nv and joyang-nv May 27, 2026 14:56
@hhzhang16 hhzhang16 changed the title feat: add multi-rank sleep/wakeup support to MPI executor path [DEP-950][feat] add multi-rank sleep/wakeup support to MPI executor path May 27, 2026
@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR enables coordinated GPU virtual memory sleep/wakeup across multiple MPI-distributed ranks. It adds a dedicated control communicator, background listener threads on non-rank-0 workers, orchestration logic on rank 0 to broadcast commands and collect ACKs, updates RPC guards to allowlist these operations for multi-rank deployments, and includes integration and unit tests for the full control flow.

Changes

Multi-rank GPU VMM Sleep/Wakeup Support

Layer / File(s) Summary
MPI control channel setup and lifecycle
tensorrt_llm/_torch/pyexecutor/py_executor.py
Introduces _CONTROL_ACTION_TAG and _CONTROL_ACK_TAG for a dedicated control message channel, adds _control_comm, _control_listener_thread, and _control_action_lock executor state, duplicates the main MPI communicator on startup when world_size > 1, starts the listener daemon on non-rank-0 workers, and sends "shutdown" messages on rank 0 during shutdown to cleanly exit listener threads.
Non-rank-0 control listener loop
tensorrt_llm/_torch/pyexecutor/py_executor.py
Implements _control_listener_loop that runs as a background thread on each non-rank-0 worker: receives sleep/wakeup/shutdown actions from rank 0, executes corresponding VMM operations, returns ACKs with optional error payloads, synchronizes CUDA device state, and cleans up MPI thread-local communicator on exit.
BaseWorker multi-rank coordination
tensorrt_llm/executor/base_worker.py
Introduces _multi_rank_sleep_wakeup helper that coordinates across ranks: rank 0 executes local VMM operations, broadcasts control messages to all other ranks, receives and aggregates ACKs/errors; sleep() and wakeup() now dispatch to multi-rank path when world_size > 1 and use _control_action_lock to serialize concurrent control operations; adds traceback import for error reporting.
RPC guard allowlisting for sleep/wakeup
tensorrt_llm/executor/proxy.py, tensorrt_llm/executor/rpc_proxy.py
Adds _MULTI_RANK_ALLOWED_METHODS allowlist and extends _check_collective_rpc_guard with method parameter to permit model_world_size > 1 only for sleep/wakeup; updates docstrings to describe rank-0 RPC shim behavior for multi-rank sleep/wakeup.
Multi-GPU integration tests
tests/unittest/_torch/multi_gpu/test_mpi_sleep_wakeup.py
Adds TP=2 integration tests with TinyLlama: test_mpi_sleep_wakeup_tp2 and test_mpi_sleep_wakeup_kv_cache_only_tp2 verify sleep/wakeup operations reduce and recover GPU memory (when NVML available) and maintain output correctness across sleep/wakeup cycles.
Guard, locking, and protocol unit tests
tests/unittest/executor/test_sleep_collective_rpc_guards.py
Expands test suite with multi-rank dispatch verification, _control_action_lock concurrency testing, ACK/error aggregation from peers, rank-0 local failure recovery with peer ACK draining, single-rank lock validation, and RPC guard allowlisting enforcement for sleep/wakeup vs other methods.

Sequence Diagram(s)

sequenceDiagram
  participant Rank0Worker as Rank 0 Worker
  participant Rank0Lock as _control_action_lock
  participant ControlComm as control_comm
  participant RankNListener as Rank N Listener
  Rank0Worker->>Rank0Lock: acquire
  Rank0Worker->>Rank0Worker: release_with_tag (local)
  Rank0Worker->>ControlComm: send sleep/wakeup action
  ControlComm->>RankNListener: recv action
  RankNListener->>RankNListener: release/materialize_with_tag
  RankNListener->>ControlComm: send ACK
  Rank0Worker->>ControlComm: recv all ACKs
  Rank0Worker->>Rank0Lock: release
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 65.91% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description is comprehensive, covering the motivation (multi-rank sleep/wakeup support), key implementation details (control communicator, listener thread, error handling), and test coverage. However, it does not follow the required repository template structure with sections for ticket ID/type, explicit description, test coverage, and checklist. Restructure the description to follow the repository template: add ticket ID and type prefix, separate explicit description section, itemize test coverage section, and include the PR checklist with verification items.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding multi-rank sleep/wakeup support to the MPI executor path, which aligns with the changeset across multiple files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (3)
tests/unittest/executor/test_sleep_collective_rpc_guards.py (1)

502-577: No QA test-list update is needed for this unittest file.

These additions are narrow unit coverage under tests/unittest/, so they do not need a tests/integration/test_lists/qa/* entry on their own. Any QA-list work belongs to the separate integration-test changes in this PR, not this file.

As per coding guidelines, "If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/executor/test_sleep_collective_rpc_guards.py` around lines 502
- 577, This unittest-only change adds narrow unit coverage under tests/unittest
(e.g., TestProxyCollectiveRpcGuards and TestIpcProxyRpcClientGuard) and does not
require any QA test-list updates; update the PR description or commit message to
explicitly state that QA list updates are unnecessary for this change
(mentioning the added tests in TestProxyCollectiveRpcGuards and
TestIpcProxyRpcClientGuard) so reviewers know no
tests/integration/test_lists/qa/* entry is required.
tests/unittest/_torch/multi_gpu/test_mpi_sleep_wakeup.py (1)

81-191: QA list update looks unnecessary for this addition.

This lands under tests/unittest/_torch/multi_gpu/, not tests/integration/defs/, so I do not see a required update under tests/integration/test_lists/qa/ for this PR.

As per coding guidelines "If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/multi_gpu/test_mpi_sleep_wakeup.py` around lines 81 -
191, This PR adds unit tests (functions test_mpi_sleep_wakeup_tp2 and
test_mpi_sleep_wakeup_kv_cache_only_tp2) under the multi-GPU unittest suite and
therefore does not require QA list changes; update the PR description (or the
commit message) with a one-line note stating "QA list update unnecessary —
changes are limited to unit tests under the unittest multi-GPU suite" so
reviewers see you considered QA guidance and explicitly declared it unnecessary.
tensorrt_llm/executor/base_worker.py (1)

719-723: ⚡ Quick win

Use PEP 585 generics in these new annotations.

Since this repo targets Python 3.10+, the new/modified signatures should use list[...] instead of List[...] for consistency with the rest of the codebase.

Suggested fix
-    def _multi_rank_sleep_wakeup(
-        self,
-        action: str,
-        tags: List[ExecutorMemoryType],
-    ) -> None:
+    def _multi_rank_sleep_wakeup(
+        self,
+        action: str,
+        tags: list[ExecutorMemoryType],
+    ) -> None:
@@
-    def sleep(self, sleep_tags: List[str]) -> None:
+    def sleep(self, sleep_tags: list[str]) -> None:
@@
-    def wakeup(self, wakeup_tags: List[str]) -> None:
+    def wakeup(self, wakeup_tags: list[str]) -> None:
As per coding guidelines, "Prefer built-in types `list`, `dict`, and `tuple` to legacy `typing.List`, `typing.Dict`, `typing.Tuple` in Python"; the retrieved learning also notes this repo requires Python >=3.10 and can use PEP 585 generics like `list[str]`.

Also applies to: 812-812, 861-861

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/executor/base_worker.py` around lines 719 - 723, The function
_multi_rank_sleep_wakeup currently uses typing.List for its parameter
annotation; update this and all other occurrences in this file to use PEP 585
built-in generics (e.g., use list[ExecutorMemoryType] instead of
List[ExecutorMemoryType]). Locate _multi_rank_sleep_wakeup and the other
annotated functions mentioned in the comment (the additional occurrences noted
around the other two locations) and replace typing.List, typing.Dict,
typing.Tuple usages with their modern built-in equivalents to match the codebase
style.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 2116-2153: The control listener decodes msg["tags"] and constructs
ExecutorMemoryType(t) before entering the try/except, so decoding errors can
bypass the finally ACK; move the decoding/validation of msg (accessing
msg["tags"], building tags list via ExecutorMemoryType) inside the try block
that ends with the finally which sends the ACK, and narrow the except to
specific exceptions (e.g., KeyError, TypeError, ValueError, and the runtime
errors from ExecutorMemoryType/materialize/release) instead of a bare except
Exception so any decode/runtime failure results in an error ACK being sent via
self._control_comm.send in the finally; keep the existing action handling
(sleep/wakeup/shutdown/unknown) and logging unchanged.

In `@tensorrt_llm/executor/base_worker.py`:
- Around line 747-748: Reformat the wrapped import of materialize_with_tag and
release_with_tag from tensorrt_llm._torch.virtual_memory to a normal inline
import to satisfy Flake8/isort; replace the parenthesized, wrapped form with a
single-line import statement that imports materialize_with_tag and
release_with_tag from tensorrt_llm._torch.virtual_memory (so references to
materialize_with_tag and release_with_tag in base_worker.py continue to work).
- Around line 784-788: Refactor the rank-0 VMM local error handling so ACK
draining always runs by moving the ACK drain logic out of the broad except and
into a finally block (ensure the drain code that currently follows the
try/except always executes); narrow the exception capture around the local VMM
call (the block that calls release_with_tag / materialize_with_tag) to only
catch CUDA-related errors (RuntimeError and torch.OutOfMemoryError) and build
local_error/wrap into RuntimeError for those cases, but for any other unexpected
exception re-raise it after the finally drain so it is not masked; keep the
existing local_error variable semantics for the CUDA-related path and ensure
traceback.format_exc() is only included for the CUDA-handled branch.

In `@tests/unittest/_torch/multi_gpu/test_mpi_sleep_wakeup.py`:
- Around line 114-145: Snapshot the set of active devices from
_per_device_gpu_memory() into a variable (e.g., active_devices = {dev for dev,
bytes in mem_active.items() if bytes > 0}) when
process_gpu_memory_info_available is true, assert that active_devices is
non-empty before calling llm._collective_rpc("sleep", ...) to ensure the test
can fail, then use that active_devices set for both the post-sleep and
post-wakeup comparisons (iterating over active_devices and comparing mem_sleep
and mem_wakeup for those devices) instead of relying on mem_active/mem_sleep
emptiness checks inside the loops.
- Around line 57-77: The NVML error handling in _per_device_gpu_memory() is too
broad; narrow the excepts so we don't silently hide probe/init failures: only
swallow ImportError when importing pynvml (return {}), but let pynvml.nvmlInit()
raise pynvml.NVMLError (do not convert it to {}); in the per-device loop around
pynvml.nvmlDeviceGetComputeRunningProcesses(handle) only catch specific NVML
errors that mean “no data available” (e.g., pynvml.NVMLError_NotSupported or
NVMLError_NoPermission) and set result[idx]=0 for those, otherwise re-raise
other pynvml.NVMLError exceptions so real probe failures surface.

In `@tests/unittest/executor/test_sleep_collective_rpc_guards.py`:
- Around line 397-404: Reindent the multi-line context-manager "with patch(...)"
blocks in tests/unittest/executor/test_sleep_collective_rpc_guards.py so
continuation lines align under the opening parenthesis (use 4 spaces) and remove
the stray backslash continuation; specifically reformat the block that patches
"tensorrt_llm._torch.virtual_memory.release_with_tag",
"tensorrt_llm._torch.virtual_memory.materialize_with_tag",
"torch.cuda.synchronize", "gc.collect", and "torch.cuda.empty_cache" (and the
similar block around the other occurrence) so each patch(...) call is a properly
indented argument to the initial with and no over-indented continuation lines
remain, conforming to Python 3.10+ 4-space indentation and fixing Flake8 E127.
- Around line 193-195: Update the failing multiline docstrings in the test file
so they conform to Ruff D205: for the docstring that begins
"_control_action_lock must be acquired before and released after the
control_action + send/recv sequence." (and the other occurrences reported),
either collapse the docstring to a single-line or insert a blank line between
the one-line summary and the following descriptive text; ensure you update each
offending docstring (the ones around the _control_action_lock comment and the
other reported ranges) to follow that pattern.

---

Nitpick comments:
In `@tensorrt_llm/executor/base_worker.py`:
- Around line 719-723: The function _multi_rank_sleep_wakeup currently uses
typing.List for its parameter annotation; update this and all other occurrences
in this file to use PEP 585 built-in generics (e.g., use
list[ExecutorMemoryType] instead of List[ExecutorMemoryType]). Locate
_multi_rank_sleep_wakeup and the other annotated functions mentioned in the
comment (the additional occurrences noted around the other two locations) and
replace typing.List, typing.Dict, typing.Tuple usages with their modern built-in
equivalents to match the codebase style.

In `@tests/unittest/_torch/multi_gpu/test_mpi_sleep_wakeup.py`:
- Around line 81-191: This PR adds unit tests (functions
test_mpi_sleep_wakeup_tp2 and test_mpi_sleep_wakeup_kv_cache_only_tp2) under the
multi-GPU unittest suite and therefore does not require QA list changes; update
the PR description (or the commit message) with a one-line note stating "QA list
update unnecessary — changes are limited to unit tests under the unittest
multi-GPU suite" so reviewers see you considered QA guidance and explicitly
declared it unnecessary.

In `@tests/unittest/executor/test_sleep_collective_rpc_guards.py`:
- Around line 502-577: This unittest-only change adds narrow unit coverage under
tests/unittest (e.g., TestProxyCollectiveRpcGuards and
TestIpcProxyRpcClientGuard) and does not require any QA test-list updates;
update the PR description or commit message to explicitly state that QA list
updates are unnecessary for this change (mentioning the added tests in
TestProxyCollectiveRpcGuards and TestIpcProxyRpcClientGuard) so reviewers know
no tests/integration/test_lists/qa/* entry is required.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 32c9d98e-0f54-4127-a4bd-bfbf14156c00

📥 Commits

Reviewing files that changed from the base of the PR and between 021e4d8 and ffdc5f8.

📒 Files selected for processing (6)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/executor/base_worker.py
  • tensorrt_llm/executor/proxy.py
  • tensorrt_llm/executor/rpc_proxy.py
  • tests/unittest/_torch/multi_gpu/test_mpi_sleep_wakeup.py
  • tests/unittest/executor/test_sleep_collective_rpc_guards.py

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated
Comment thread tensorrt_llm/executor/base_worker.py Outdated
Comment thread tensorrt_llm/executor/base_worker.py Outdated
Comment thread tests/unittest/_torch/multi_gpu/test_mpi_sleep_wakeup.py
Comment thread tests/unittest/_torch/multi_gpu/test_mpi_sleep_wakeup.py
Comment thread tests/unittest/executor/test_sleep_collective_rpc_guards.py Outdated
Comment thread tests/unittest/executor/test_sleep_collective_rpc_guards.py Outdated
hhzhang16 added 2 commits May 27, 2026 11:13
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
@chienchunhung chienchunhung self-requested a review May 27, 2026 20:23
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>

@chienchunhung chienchunhung left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Overall approach looks good. I took the first pass and left a couple of comments.

Question: what are the expected behavior if the sleep/wakeup actions succeeded partially, e.g., some ranks succeed while some fail? Do we reconcile?

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py
Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py
Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py
hhzhang16 added 2 commits May 28, 2026 14:28
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
…nnahz/dep-950-add-multi-rank-sleepwakeup-support-to-mpi-executor-path
@hhzhang16

Copy link
Copy Markdown
Contributor Author

Question: what are the expected behavior if the sleep/wakeup actions succeeded partially, e.g., some ranks succeed while some fail? Do we reconcile?

There's no reconciliation -- this is an intentional design decision matching how the rest of the executor treats unrecoverable errors. If any rank fails its VMM op, then the system is in an inconsistent state due to some ranks having released/materialized their memory and some not. Attempting to reconcile would require another coordinated round trip and could itself fail. Instead, all errors are aggregated and a single RuntimeError is raised on rank 0 listing all failing ranks.

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py
Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py
Comment thread tests/unittest/executor/test_sleep_collective_rpc_guards.py
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated
hhzhang16 added 3 commits June 3, 2026 11:56
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
…nnahz/dep-950-add-multi-rank-sleepwakeup-support-to-mpi-executor-path
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>

@chienchunhung chienchunhung left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. It'd be great to address the following in the follow-up PRs (1) add the load tests, (2) harden the corner case thread cleanup.

hhzhang16 added 2 commits June 4, 2026 02:25
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
…nnahz/dep-950-add-multi-rank-sleepwakeup-support-to-mpi-executor-path

@Superjomn Superjomn left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chienchunhung

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@chienchunhung

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53634 [ run ] triggered by Bot. Commit: 26bcb4f Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53634 [ run ] completed with state FAILURE. Commit: 26bcb4f
/LLM/main/L0_MergeRequest_PR pipeline #42780 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@chienchunhung

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54322 [ run ] triggered by Bot. Commit: 971d10c Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54322 [ run ] completed with state FAILURE. Commit: 971d10c
/LLM/main/L0_MergeRequest_PR pipeline #43393 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@chienchunhung

Copy link
Copy Markdown
Collaborator

The real PR-caused failure is the new MPI sleep/wakeup path:

  • unittest/_torch/multi_gpu/test_mpi_sleep_wakeup.py::test_mpi_sleep_wakeup_tp2
  • unittest/_torch/multi_gpu/test_mpi_sleep_wakeup.py::test_mpi_sleep_wakeup_kv_cache_only_tp2

Both fail on DGX_H100-2_GPUs-PyTorch-Others-1 with RequestError: Hang detected on rank 0 in PyExecutor.

The hang happens after sleep() / wakeup(), when the test calls llm.generate() again. Log evidence shows rank 1’s executor loop stuck in _handle_control_request() waiting on control_action_done, while the new sleep/wakeup listener thread is blocked in _sleep_wakeup_comm.recv(...). That matches the old guard this PR removed: non-rank-0 processes can enter the control-action pause with no matching caller to release their local control_action_done.

So the listener executes/receives sleep-wakeup messages, but it does not solve the executor control barrier on peer ranks. The PR’s rank-0-only _multi_rank_sleep_wakeup() enters engine.control_action() on rank 0, while peer event loops can still pause and never resume.

Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
auto-merge was automatically disabled June 17, 2026 13:54

Head branch was pushed to by a user without write access

@chienchunhung

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54881 [ run ] triggered by Bot. Commit: cefe870 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54881 [ run ] completed with state FAILURE. Commit: cefe870
/LLM/main/L0_MergeRequest_PR pipeline #43887 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@chienchunhung

Copy link
Copy Markdown
Collaborator

@hhzhang16 One recommendation: Initialize the sleep/wakeup communicator/listener only when multi-rank sleep/wakeup can actually be used, e.g. PyTorch backend + sleep_config enabled + MPI executor path, rather than all multi-rank PyExecutor workers.

I believe the current implementation would initiate the listener thread even when Ray orchestration is used, which is not necessary.

Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
…nnahz/dep-950-add-multi-rank-sleepwakeup-support-to-mpi-executor-path
@hhzhang16

Copy link
Copy Markdown
Contributor Author

@chienchunhung good catch, added gate

@chienchunhung

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55327 [ run ] triggered by Bot. Commit: cd99391 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55327 [ run ] completed with state FAILURE. Commit: cd99391
/LLM/main/L0_MergeRequest_PR pipeline #44278 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@chienchunhung

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55559 [ run ] triggered by Bot. Commit: cd99391 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55559 [ run ] completed with state FAILURE. Commit: cd99391
/LLM/main/L0_MergeRequest_PR pipeline #44481 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@chienchunhung

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55856 [ run ] triggered by Bot. Commit: 7be30a0 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55856 [ run ] completed with state FAILURE. Commit: 7be30a0
/LLM/main/L0_MergeRequest_PR pipeline #44747 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants