Hongbinl/dev offload on lifuz#4795
Open
lhb8125 wants to merge 4 commits into
Open
Conversation
Carries forward lifuz's uncommitted working-tree changes from /lustre/fsw/coreai_dlalgo_llm/users/lifuz/NT3/Megatron-LM-nanz as of 2026-05-14, including the cuda_graphs.py / experts.py SRELU support, the moe_utils.py MoEAuxLossAutoScaler CUDA-graph-safety fix (save metadata instead of save_for_backward), and the training.py / common_config.py / utils.py touch-ups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…inflight_offloads throttle Backport of NVIDIA#4514 onto the lifuz WIP base (29195cf + lifuz cuda_graphs / experts / moe_utils etc.): * TransformerConfig.fine_grained_offloading_max_inflight_offloads: Optional[int] = None. Per offload-group-name cap on how many D2H copies may be in flight before the main stream wait_event's on the oldest event for that name. 0 = wait after every offload. None = no joins. Required to be non-None when fine-grained activation offloading is combined with local + full_iteration CUDA graphs. * PipelineOffloadManager.init_model_chunk_offload_handler / ChunkOffloadHandler.__init__ / FineGrainedActivationOffloadingInterface. init_chunk_handler grow a max_inflight_offloads parameter that flows end-to-end. ChunkOffloadHandler tracks pending offload events per group_name and drains older ones via _drain_offload_pending when the cap is exceeded. * gpt_model.preprocess_for_fine_grained_offloading and mamba_model.preprocess_for_fine_grained_offloading pass the new config knob through to init_chunk_handler. * TransformerConfig.__post_init__ now compares cuda_graph_scope against [CudaGraphScope.full_iteration] (the normalized list form) instead of the literal string "full_iteration"; the prior check was always False after normalization in __post_init__ and silently rejected legitimate local + full_iteration configs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps MambaMixer._ssm_training with the FineGrainedActivationOffloadingInterface context manager so the activations the fused mamba_split_conv1d_scan_combined kernel saves for backward (zxBCdt + the depthwise conv1d output for x/B/C) can be evicted to pinned host memory and reloaded on demand. * megatron/core/ssm/mamba_mixer.py — import off_interface, add a self.offload_ssm_training flag in MambaMixer.__init__ (gated on config.fine_grained_activation_offloading and "mamba_ssm_training" in config.offload_modules), wrap the _ssm_training call in forward() with off_interface(...) / group_offload(...) using the throttle knob from PR NVIDIA#4514 via delay_offload=config.delay_offload_until_cuda_graph. * megatron/core/transformer/transformer_layer.py — mirror the offload_mamba_ssm_training flag in _set_offload_modules for symmetry / future layer-level cuda-graph compat checks (currently unread because MambaMixer reads config.offload_modules directly). * megatron/core/transformer/transformer_config.py — add "mamba_ssm_training" to the allowlist in the fine-grained-offload validation block so --offload-modules mamba_ssm_training survives arg validation. Measured impact at seq=32k on lyris GB200 (baseline vs offload + PR NVIDIA#4514 max_inflight=1): max_allocated -4.24 GB / rank, reserved snapshot -2.1 GB, iter 60 throughput within 0.1 % of baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
359fdc2 to
fbf3f11
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.