[PyTorch] Add op-level activation offload opt-out API#3108
Conversation
Signed-off-by: hongbinl <hongbinl@nvidia.com>
Greptile SummaryThis PR introduces a per-op activation CPU offload opt-out API on
Confidence Score: 5/5Safe to merge. The opt-out flag is additive and defaults to enabled, so existing call sites are unaffected. The V2 and V1 offloading paths both handle the new All changed ops call No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller as Caller / Fuser
participant Op as BasicOperation
participant CPUOffload as cpu_offload module
Caller->>Op: set_activation_offloading(False)
note over Op: _activation_offloading_enabled = False
Caller->>Op: fuser_forward(...)
Op->>CPUOffload: "start_offload(*tensors) [if cpu_offloading]"
Op->>Op: "mark_for_cpu_offload_if_needed(*tensors)"
alt op opted in
Op->>CPUOffload: "mark_activation_offload(*include_tensors)"
else op opted out
Op->>CPUOffload: "mark_not_offload(*exclude_tensors)"
note over CPUOffload: sets _TE_do_not_offload=True on component tensors (V2)
end
Op->>Op: "ctx.save_for_backward(*tensors)"
note over CPUOffload: push_tensor hook: _check_if_offload respects _TE_do_not_offload
Reviews (8): Last reviewed commit: "Debug failure with grouped linear" | Re-trigger Greptile |
Signed-off-by: hongbinl <hongbinl@nvidia.com>
Signed-off-by: hongbinl <hongbinl@nvidia.com>
Signed-off-by: hongbinl <hongbinl@nvidia.com>
Signed-off-by: hongbinl <hongbinl@nvidia.com>
Signed-off-by: hongbinl <hongbinl@nvidia.com>
This is to support the selective offloading for Nemotron model training. If using fused MLP, MCore doesn't know which tensor is from expert_fc1 or moe_act or expert_fc2, so we need to expose an API to manually set offloading strategy for different ops. cc @timmoon10 |
Signed-off-by: hongbinl <hongbinl@nvidia.com>
Signed-off-by: hongbinl <hongbinl@nvidia.com>
|
Ok, but does it actually matter to you that a specific tensor gets offloaded rather than getting the right amount of data to be offloaded? |
Different activation tensors have different amount of data, selectively offloading activations is the way how we control the amount of data to be offloaded. If we offload too much, offloading latency can be exposed, if we offload too few, we get OOM. I think the essence of fine-grained offloading is to allow users to control the offloading of each tensor separately, the we can make the better perf tradeoffs. |
|
Well, but that could be controlled by a coarser mechanism. E.g. if we provided you with a way to say "offload up to X MBs per layer" and then choose the tensors out of all eligible in that layer wouldn't that also fulfill the same purpose, but potentially be easier to integrate/not require exposing the full control? |
I talked to @lhb8125 , he told me there is fraction config which controls the number of layers that do offloading. The layers that do offloading will offload all activations, other layers don't do offloading at all. And the method can have the following drawbacks:
Your idea is to config the number of bytes to offload per layer, it can fulfill the same purpose also. But what's the definition of a layer? Is it MoE, or the whole Transformer? This is ambiguous and could be confusing for users. I still do not know how we should clearly define this and expose this to users. I personally feel like using the module name like |
Handle inclusion and exclusion in same function. Check whether CPU offloading is enabled internally. Tweak documentation and style. Signed-off-by: Tim Moon <tmoon@nvidia.com>
Use BasicOperation.mark_for_cpu_offload_if_needed at op call sites and keep explicit offload synchronization checks where needed. Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
timmoon10
left a comment
There was a problem hiding this comment.
I've made some changes so that we can use the same helper function to include and exclude tensors from offloading.
|
/te-ci pytorch |
This reverts commit 9b06f26. Signed-off-by: Tim Moon <tmoon@nvidia.com>
Summary
Follow-up to #3047.
This PR adds an op-level activation offload policy for saved activation tensors so downstream fused grouped MLP users can opt individual TE ops out of activation offloading without changing the surrounding CPU offload context.
Testing