Skip to content

[PyTorch] Add op-level activation offload opt-out API#3108

Merged
timmoon10 merged 12 commits into
NVIDIA:mainfrom
lhb8125:codex/te-op-offload-control
Jun 11, 2026
Merged

[PyTorch] Add op-level activation offload opt-out API#3108
timmoon10 merged 12 commits into
NVIDIA:mainfrom
lhb8125:codex/te-op-offload-control

Conversation

@lhb8125

@lhb8125 lhb8125 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Follow-up to #3047.

This PR adds an op-level activation offload policy for saved activation tensors so downstream fused grouped MLP users can opt individual TE ops out of activation offloading without changing the surrounding CPU offload context.

  • add BasicOperation.set_activation_offloading(enabled)
  • route activation offload marking through BasicOperation.maybe_mark_activation_offload
  • use mark_not_offload for ops whose saved activations are opted out
  • keep global CPU offload context checks and start_offload calls at their original call sites
  • add unit coverage for the per-op mark/opt-out behavior

Testing

  • git diff --check
  • python3.12 -m py_compile on the changed TE op files and tests/pytorch/test_fusible_ops.py

Signed-off-by: hongbinl <hongbinl@nvidia.com>
@lhb8125 lhb8125 requested a review from timmoon10 as a code owner June 9, 2026 12:27
@github-actions github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label Jun 9, 2026
@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces a per-op activation CPU offload opt-out API on BasicOperation, routing all activation offload marking through a new mark_for_cpu_offload_if_needed helper and replacing scattered if is_cpu_offload_enabled(): mark_activation_offload(...) patterns across 14 files.

  • BasicOperation._activation_offloading_enabled and set_activation_offloading(enabled) let callers opt individual ops out of CPU offloading without altering the surrounding context; FusedOperation.set_activation_offloading delegates to its constituent basic_ops.
  • mark_for_cpu_offload_if_needed centralises the enabled-check and routes tensors to mark_activation_offload (opt-in) or mark_not_offload (opt-out), correctly handling None, plain torch.Tensor, QuantizedTensorStorage, and GroupedTensorStorage (including list slices) through filter_supported_and_extend.
  • The unit test correctly patches transformer_engine.pytorch.ops.op.mark_activation_offload and mark_not_offload (the names as imported into op.py's namespace), ensuring the monkeypatch is effective.

Confidence Score: 5/5

Safe to merge. The opt-out flag is additive and defaults to enabled, so existing call sites are unaffected. The V2 and V1 offloading paths both handle the new mark_not_offload correctly relative to the existing push_tensor/_check_if_offload mechanism.

All changed ops call mark_for_cpu_offload_if_needed before save_for_backward, preserving the required ordering. The filter_supported_and_extend helper correctly handles None, single tensors, and list slices. The test correctly patches the names in op.py's namespace. The only finding is the unused exclude parameter, which has no runtime impact.

No files require special attention.

Important Files Changed

Filename Overview
transformer_engine/pytorch/ops/op.py Core change: adds _activation_offloading_enabled, set_activation_offloading, and mark_for_cpu_offload_if_needed to BasicOperation; adds set_activation_offloading delegation to FusedOperation. The exclude kwarg in mark_for_cpu_offload_if_needed is defined but never used by any caller in the codebase.
transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py Replaces single bulk mark_activation_offload(*activation_tensors) with per-op mark_for_cpu_offload_if_needed calls; start_offload is still called on all non-None tensors first (correct ordering before save_for_backward).
transformer_engine/pytorch/ops/basic/grouped_linear.py Removes maybe_mark_and_start_activation_offload; _save_ctx now calls mark_for_cpu_offload_if_needed directly; start_offload is still called at compute time in each _fuser_forward_* variant. List slices passed to mark_for_cpu_offload_if_needed are handled correctly by filter_supported_and_extend (iterated element-by-element).
tests/pytorch/test_fusible_ops.py New test test_activation_offloading_policy correctly patches op_module.mark_activation_offload and op_module.mark_not_offload (names in op.py's namespace) and validates the enable/disable toggle.
transformer_engine/pytorch/ops/fused/forward_linear_bias_activation.py Simple refactor: if is_cpu_offload_enabled(): mark_activation_offload(saved_input)linear_op.mark_for_cpu_offload_if_needed(saved_input). Correct and no ordering issues.
transformer_engine/pytorch/ops/basic/basic_linear.py Removes is_cpu_offload_enabled import and guard; the check is now inside mark_for_cpu_offload_if_needed. Correctly called before ctx.save_for_backward.

Sequence Diagram

sequenceDiagram
    participant Caller as Caller / Fuser
    participant Op as BasicOperation
    participant CPUOffload as cpu_offload module

    Caller->>Op: set_activation_offloading(False)
    note over Op: _activation_offloading_enabled = False

    Caller->>Op: fuser_forward(...)
    Op->>CPUOffload: "start_offload(*tensors)  [if cpu_offloading]"
    Op->>Op: "mark_for_cpu_offload_if_needed(*tensors)"
    alt op opted in
        Op->>CPUOffload: "mark_activation_offload(*include_tensors)"
    else op opted out
        Op->>CPUOffload: "mark_not_offload(*exclude_tensors)"
        note over CPUOffload: sets _TE_do_not_offload=True on component tensors (V2)
    end
    Op->>Op: "ctx.save_for_backward(*tensors)"
    note over CPUOffload: push_tensor hook: _check_if_offload respects _TE_do_not_offload
Loading

Reviews (8): Last reviewed commit: "Debug failure with grouped linear" | Re-trigger Greptile

Comment thread transformer_engine/pytorch/ops/basic/grouped_linear.py Outdated
Comment thread transformer_engine/pytorch/ops/op.py Outdated
Signed-off-by: hongbinl <hongbinl@nvidia.com>
@lhb8125 lhb8125 changed the title [PyTorch] Add op-level CPU offload opt-out API [PyTorch] Add op-level activation offload opt-out API Jun 9, 2026
lhb8125 added 3 commits June 9, 2026 05:53
Signed-off-by: hongbinl <hongbinl@nvidia.com>
Signed-off-by: hongbinl <hongbinl@nvidia.com>
Signed-off-by: hongbinl <hongbinl@nvidia.com>
Comment thread transformer_engine/pytorch/ops/op.py Outdated
Comment thread transformer_engine/pytorch/ops/op.py Outdated

@ptrendx ptrendx left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general question about the motivation of this feature - I believe we already have a heuristics to not offload tensors which are too small, is that not enough?

CC @pggPL

Signed-off-by: hongbinl <hongbinl@nvidia.com>
@lhb8125

lhb8125 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

A general question about the motivation of this feature - I believe we already have a heuristics to not offload tensors which are too small, is that not enough?

CC @pggPL

This is to support the selective offloading for Nemotron model training. If using fused MLP, MCore doesn't know which tensor is from expert_fc1 or moe_act or expert_fc2, so we need to expose an API to manually set offloading strategy for different ops. cc @timmoon10

Signed-off-by: hongbinl <hongbinl@nvidia.com>
Comment thread tests/pytorch/test_fusible_ops.py Outdated
Signed-off-by: hongbinl <hongbinl@nvidia.com>
@ptrendx

ptrendx commented Jun 9, 2026

Copy link
Copy Markdown
Member

Ok, but does it actually matter to you that a specific tensor gets offloaded rather than getting the right amount of data to be offloaded?

@xrennvidia

Copy link
Copy Markdown
Collaborator

Ok, but does it actually matter to you that a specific tensor gets offloaded rather than getting the right amount of data to be offloaded?

Different activation tensors have different amount of data, selectively offloading activations is the way how we control the amount of data to be offloaded. If we offload too much, offloading latency can be exposed, if we offload too few, we get OOM. I think the essence of fine-grained offloading is to allow users to control the offloading of each tensor separately, the we can make the better perf tradeoffs.

@ptrendx

ptrendx commented Jun 10, 2026

Copy link
Copy Markdown
Member

Well, but that could be controlled by a coarser mechanism. E.g. if we provided you with a way to say "offload up to X MBs per layer" and then choose the tensors out of all eligible in that layer wouldn't that also fulfill the same purpose, but potentially be easier to integrate/not require exposing the full control?

@xrennvidia

Copy link
Copy Markdown
Collaborator

Well, but that could be controlled by a coarser mechanism. E.g. if we provided you with a way to say "offload up to X MBs per layer" and then choose the tensors out of all eligible in that layer wouldn't that also fulfill the same purpose, but potentially be easier to integrate/not require exposing the full control?

I talked to @lhb8125 , he told me there is fraction config which controls the number of layers that do offloading. The layers that do offloading will offload all activations, other layers don't do offloading at all. And the method can have the following drawbacks:

  1. Assuming there are 100 layers and the first 50 layers will offload 2x activation compared with baseline, if we reload the activations of 49th layer from the 100th layer's bprop and so on, the computing window is long enough but the reloaded tensors will accumulate and have a chance of running out of GPU memory.
  2. If we reload the activations of 49th layer from the 50th layer's bprop and so on, the computing window will be only one layer, which may be not enough to overlap the reload of 2x activations.

Your idea is to config the number of bytes to offload per layer, it can fulfill the same purpose also. But what's the definition of a layer? Is it MoE, or the whole Transformer? This is ambiguous and could be confusing for users. I still do not know how we should clearly define this and expose this to users. I personally feel like using the module name like expert_fc1, moe_act is more straightforward.

timmoon10 and others added 2 commits June 11, 2026 00:36
Handle inclusion and exclusion in same function. Check whether CPU offloading is enabled internally. Tweak documentation and style.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Use BasicOperation.mark_for_cpu_offload_if_needed at op call sites and keep explicit offload synchronization checks where needed.

Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Comment thread transformer_engine/pytorch/ops/basic/grouped_linear.py
Comment thread tests/pytorch/test_fusible_ops.py Outdated
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>

@timmoon10 timmoon10 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made some changes so that we can use the same helper function to include and exclude tensors from offloading.

@timmoon10

Copy link
Copy Markdown
Member

/te-ci pytorch

@timmoon10 timmoon10 merged commit 9b06f26 into NVIDIA:main Jun 11, 2026
22 of 25 checks passed
timmoon10 added a commit that referenced this pull request Jun 11, 2026
This reverts commit 9b06f26.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
timmoon10 added a commit that referenced this pull request Jun 11, 2026
Revert "[PyTorch] Add op-level activation offload opt-out API (#3108)"

This reverts commit 9b06f26.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PRs from external contributor outside the core maintainers, representing community-driven work.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants