[PyTorch] Add op-level activation offload opt-out API by lhb8125 · Pull Request #3108 · NVIDIA/TransformerEngine

lhb8125 · 2026-06-09T12:27:28Z

Summary

Follow-up to #3047.

This PR adds an op-level activation offload policy for saved activation tensors so downstream fused grouped MLP users can opt individual TE ops out of activation offloading without changing the surrounding CPU offload context.

add BasicOperation.set_activation_offloading(enabled)
route activation offload marking through BasicOperation.maybe_mark_activation_offload
use mark_not_offload for ops whose saved activations are opted out
keep global CPU offload context checks and start_offload calls at their original call sites
add unit coverage for the per-op mark/opt-out behavior

Testing

git diff --check
python3.12 -m py_compile on the changed TE op files and tests/pytorch/test_fusible_ops.py

Signed-off-by: hongbinl <hongbinl@nvidia.com>

greptile-apps · 2026-06-09T12:33:28Z

Greptile Summary

This PR introduces a per-op activation CPU offload opt-out API on BasicOperation, routing all activation offload marking through a new mark_for_cpu_offload_if_needed helper and replacing scattered if is_cpu_offload_enabled(): mark_activation_offload(...) patterns across 14 files.

BasicOperation._activation_offloading_enabled and set_activation_offloading(enabled) let callers opt individual ops out of CPU offloading without altering the surrounding context; FusedOperation.set_activation_offloading delegates to its constituent basic_ops.
mark_for_cpu_offload_if_needed centralises the enabled-check and routes tensors to mark_activation_offload (opt-in) or mark_not_offload (opt-out), correctly handling None, plain torch.Tensor, QuantizedTensorStorage, and GroupedTensorStorage (including list slices) through filter_supported_and_extend.
The unit test correctly patches transformer_engine.pytorch.ops.op.mark_activation_offload and mark_not_offload (the names as imported into op.py's namespace), ensuring the monkeypatch is effective.

Confidence Score: 5/5

Safe to merge. The opt-out flag is additive and defaults to enabled, so existing call sites are unaffected. The V2 and V1 offloading paths both handle the new mark_not_offload correctly relative to the existing push_tensor/_check_if_offload mechanism.

All changed ops call mark_for_cpu_offload_if_needed before save_for_backward, preserving the required ordering. The filter_supported_and_extend helper correctly handles None, single tensors, and list slices. The test correctly patches the names in op.py's namespace. The only finding is the unused exclude parameter, which has no runtime impact.

No files require special attention.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/ops/op.py	Core change: adds `_activation_offloading_enabled`, `set_activation_offloading`, and `mark_for_cpu_offload_if_needed` to `BasicOperation`; adds `set_activation_offloading` delegation to `FusedOperation`. The `exclude` kwarg in `mark_for_cpu_offload_if_needed` is defined but never used by any caller in the codebase.
transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py	Replaces single bulk `mark_activation_offload(*activation_tensors)` with per-op `mark_for_cpu_offload_if_needed` calls; `start_offload` is still called on all non-None tensors first (correct ordering before `save_for_backward`).
transformer_engine/pytorch/ops/basic/grouped_linear.py	Removes `maybe_mark_and_start_activation_offload`; `_save_ctx` now calls `mark_for_cpu_offload_if_needed` directly; `start_offload` is still called at compute time in each `_fuser_forward_*` variant. List slices passed to `mark_for_cpu_offload_if_needed` are handled correctly by `filter_supported_and_extend` (iterated element-by-element).
tests/pytorch/test_fusible_ops.py	New test `test_activation_offloading_policy` correctly patches `op_module.mark_activation_offload` and `op_module.mark_not_offload` (names in `op.py`'s namespace) and validates the enable/disable toggle.
transformer_engine/pytorch/ops/fused/forward_linear_bias_activation.py	Simple refactor: `if is_cpu_offload_enabled(): mark_activation_offload(saved_input)` → `linear_op.mark_for_cpu_offload_if_needed(saved_input)`. Correct and no ordering issues.
transformer_engine/pytorch/ops/basic/basic_linear.py	Removes `is_cpu_offload_enabled` import and guard; the check is now inside `mark_for_cpu_offload_if_needed`. Correctly called before `ctx.save_for_backward`.

Sequence Diagram

sequenceDiagram
    participant Caller as Caller / Fuser
    participant Op as BasicOperation
    participant CPUOffload as cpu_offload module

    Caller->>Op: set_activation_offloading(False)
    note over Op: _activation_offloading_enabled = False

    Caller->>Op: fuser_forward(...)
    Op->>CPUOffload: "start_offload(*tensors)  [if cpu_offloading]"
    Op->>Op: "mark_for_cpu_offload_if_needed(*tensors)"
    alt op opted in
        Op->>CPUOffload: "mark_activation_offload(*include_tensors)"
    else op opted out
        Op->>CPUOffload: "mark_not_offload(*exclude_tensors)"
        note over CPUOffload: sets _TE_do_not_offload=True on component tensors (V2)
    end
    Op->>Op: "ctx.save_for_backward(*tensors)"
    note over CPUOffload: push_tensor hook: _check_if_offload respects _TE_do_not_offload

_{Reviews (8): Last reviewed commit: "Debug failure with grouped linear" | Re-trigger Greptile}

Signed-off-by: hongbinl <hongbinl@nvidia.com>

ptrendx

A general question about the motivation of this feature - I believe we already have a heuristics to not offload tensors which are too small, is that not enough?

CC @pggPL

Signed-off-by: hongbinl <hongbinl@nvidia.com>

lhb8125 · 2026-06-09T13:18:40Z

A general question about the motivation of this feature - I believe we already have a heuristics to not offload tensors which are too small, is that not enough?

CC @pggPL

This is to support the selective offloading for Nemotron model training. If using fused MLP, MCore doesn't know which tensor is from expert_fc1 or moe_act or expert_fc2, so we need to expose an API to manually set offloading strategy for different ops. cc @timmoon10

Signed-off-by: hongbinl <hongbinl@nvidia.com>

ptrendx · 2026-06-09T15:29:00Z

Ok, but does it actually matter to you that a specific tensor gets offloaded rather than getting the right amount of data to be offloaded?

xrennvidia · 2026-06-10T01:12:05Z

Ok, but does it actually matter to you that a specific tensor gets offloaded rather than getting the right amount of data to be offloaded?

Different activation tensors have different amount of data, selectively offloading activations is the way how we control the amount of data to be offloaded. If we offload too much, offloading latency can be exposed, if we offload too few, we get OOM. I think the essence of fine-grained offloading is to allow users to control the offloading of each tensor separately, the we can make the better perf tradeoffs.

ptrendx · 2026-06-10T12:07:34Z

Well, but that could be controlled by a coarser mechanism. E.g. if we provided you with a way to say "offload up to X MBs per layer" and then choose the tensors out of all eligible in that layer wouldn't that also fulfill the same purpose, but potentially be easier to integrate/not require exposing the full control?

xrennvidia · 2026-06-10T15:16:36Z

Well, but that could be controlled by a coarser mechanism. E.g. if we provided you with a way to say "offload up to X MBs per layer" and then choose the tensors out of all eligible in that layer wouldn't that also fulfill the same purpose, but potentially be easier to integrate/not require exposing the full control?

I talked to @lhb8125 , he told me there is fraction config which controls the number of layers that do offloading. The layers that do offloading will offload all activations, other layers don't do offloading at all. And the method can have the following drawbacks:

Assuming there are 100 layers and the first 50 layers will offload 2x activation compared with baseline, if we reload the activations of 49th layer from the 100th layer's bprop and so on, the computing window is long enough but the reloaded tensors will accumulate and have a chance of running out of GPU memory.
If we reload the activations of 49th layer from the 50th layer's bprop and so on, the computing window will be only one layer, which may be not enough to overlap the reload of 2x activations.

Your idea is to config the number of bytes to offload per layer, it can fulfill the same purpose also. But what's the definition of a layer? Is it MoE, or the whole Transformer? This is ambiguous and could be confusing for users. I still do not know how we should clearly define this and expose this to users. I personally feel like using the module name like expert_fc1, moe_act is more straightforward.

Handle inclusion and exclusion in same function. Check whether CPU offloading is enabled internally. Tweak documentation and style. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Use BasicOperation.mark_for_cpu_offload_if_needed at op call sites and keep explicit offload synchronization checks where needed. Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10

I've made some changes so that we can use the same helper function to include and exclude tensors from offloading.

timmoon10 · 2026-06-11T01:43:47Z

/te-ci pytorch

This reverts commit 9b06f26. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Revert "[PyTorch] Add op-level activation offload opt-out API (#3108)" This reverts commit 9b06f26. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add TE op CPU offload opt-out API

b9e1f62

Signed-off-by: hongbinl <hongbinl@nvidia.com>

lhb8125 requested a review from timmoon10 as a code owner June 9, 2026 12:27

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label Jun 9, 2026

greptile-apps Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/basic/grouped_linear.py Outdated

Comment thread transformer_engine/pytorch/ops/op.py Outdated

Rename op API to activation offloading

3c11784

Signed-off-by: hongbinl <hongbinl@nvidia.com>

lhb8125 changed the title ~~[PyTorch] Add op-level CPU offload opt-out API~~ [PyTorch] Add op-level activation offload opt-out API Jun 9, 2026

lhb8125 added 3 commits June 9, 2026 05:53

Move CPU offload gating to TE op call sites

02dc770

Signed-off-by: hongbinl <hongbinl@nvidia.com>

Preserve grouped linear offload start semantics

b93a9af

Signed-off-by: hongbinl <hongbinl@nvidia.com>

Use setter for activation offload policy

0b41302

Signed-off-by: hongbinl <hongbinl@nvidia.com>

ptrendx reviewed Jun 9, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/op.py Outdated

ptrendx reviewed Jun 9, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/op.py Outdated

ptrendx reviewed Jun 9, 2026

View reviewed changes

Limit activation offload policy helper to marking

1b81786

Signed-off-by: hongbinl <hongbinl@nvidia.com>

Move CPU offload imports to op module scope

df5e715

Signed-off-by: hongbinl <hongbinl@nvidia.com>

greptile-apps Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread tests/pytorch/test_fusible_ops.py Outdated

Patch activation offload test bound symbols

f26534d

Signed-off-by: hongbinl <hongbinl@nvidia.com>

lhb8125 mentioned this pull request Jun 10, 2026

[feat] Support fine-grained activation offloading in fused group mlp NVIDIA/Megatron-LM#5082

Open

timmoon10 and others added 2 commits June 11, 2026 00:36

Refactor base class offloading infrastructure

7d9128c

Handle inclusion and exclusion in same function. Check whether CPU offloading is enabled internally. Tweak documentation and style. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Propagate activation offload policy helper

9c3d262

Use BasicOperation.mark_for_cpu_offload_if_needed at op call sites and keep explicit offload synchronization checks where needed. Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/basic/grouped_linear.py

timmoon10 reviewed Jun 11, 2026

View reviewed changes

Comment thread tests/pytorch/test_fusible_ops.py Outdated

timmoon10 added 2 commits June 11, 2026 01:21

Move test into TestFuser suite

62b3da9

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Debug failure with grouped linear

d858ba2

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 approved these changes Jun 11, 2026

View reviewed changes

timmoon10 merged commit 9b06f26 into NVIDIA:main Jun 11, 2026
22 of 25 checks passed

timmoon10 mentioned this pull request Jun 11, 2026

Revert "[PyTorch] Add op-level activation offload opt-out API" #3120

Merged

timmoon10 added a commit that referenced this pull request Jun 11, 2026

Revert "[PyTorch] Add op-level activation offload opt-out API (#3108)"

4d0bbae

This reverts commit 9b06f26. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Conversation

lhb8125 commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

greptile-apps Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ptrendx left a comment

Choose a reason for hiding this comment

Uh oh!

lhb8125 commented Jun 9, 2026

Uh oh!

Uh oh!

ptrendx commented Jun 9, 2026

Uh oh!

xrennvidia commented Jun 10, 2026

Uh oh!

ptrendx commented Jun 10, 2026

Uh oh!

xrennvidia commented Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lhb8125 commented Jun 9, 2026 •

edited

Loading

greptile-apps Bot commented Jun 9, 2026 •

edited

Loading