pack_segments backward CUDA: zero-init gradient buffer to fix uninit-memory NaNs (#5754)#5754
Closed
guanghou-ml wants to merge 1 commit into
Closed
pack_segments backward CUDA: zero-init gradient buffer to fix uninit-memory NaNs (#5754)#5754guanghou-ml wants to merge 1 commit into
guanghou-ml wants to merge 1 commit into
Conversation
Contributor
|
@guanghou-ml has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104184777. |
guanghou-ml
pushed a commit
to guanghou-ml/FBGEMM
that referenced
this pull request
May 9, 2026
…memory NaNs (pytorch#5754) Summary: ## TL;DR `pack_segments_backward_cuda` allocates the input-gradient tensor with `at::empty(...)`. When any `lengths[seq] > max_length`, the unpack kernel never writes the tail rows of that segment, leaving them as **uninitialized memory** that propagates into upstream gradients and causes NaN cascades. Switch the allocator to `at::zeros`. ## Bug The unpack kernel writes only positions `cumsum[seq] + cell` for `cell < min(lengths[seq], max_length)`. When `lengths[seq] > max_length`, positions `[cumsum[seq] + max_length, cumsum[seq] + lengths[seq])` are **never written** and retain whatever was in the freshly-allocated buffer. These rows correspond to events that the forward pass truncated, so they MUST receive zero gradient. With `at::empty` they instead receive garbage. The garbage flows upstream and triggers NaN/Inf cascades in deep networks — for example, LayerNorm backward amplifies random O(1) magnitude values via `1/sqrt(var+eps)` into Inf/NaN within a few hundred steps. ## Fix `at::empty(shape, ...)` → `at::zeros(shape, ...)` for the output tensor. One-line change. The added cost is one device-side memset over the gradient buffer per backward call, which is negligible relative to the unpack kernel and downstream backward work. ## Repro ``` import torch lengths = torch.tensor([10, 5, 8], dtype=torch.int32, device="cuda") t_in = torch.randn(23, 8, device="cuda", requires_grad=True) out = torch.ops.fbgemm.pack_segments(t_in, lengths, max_length=4) out.backward(torch.ones_like(out)) # Print abs-max of rows 0..9 (seq 0 has length 10 > max_length=4, # so rows 4..9 are the truncated tail). ``` Real values captured by running the snippet 5 times. Each row shows abs-max across the cell dimension. `lengths[0] = 10`, `max_length = 4`, so rows 0..3 are in-bounds (expected `≈1`) and rows 4..9 are truncated (expected `0`). **BEFORE fix (`at::empty`)** — rows 4..9 vary wildly across trials, confirming uninitialized memory: ``` row: 0 1 2 3 4 5 6 7 8 9 trial 0: 1.0000 1.0000 1.0000 1.0000 1.8152 1.9762 0.8584 2.3934 2.4721 0.0000 trial 1: 1.0000 1.0000 1.0000 1.0000 2.2231 1.6936 1.8451 1.9498 1.6774 0.5991 trial 2: 1.0000 1.0000 1.0000 1.0000 1.7331 1.6970 1.5790 1.6874 2.4351 1.9974 trial 3: 1.0000 1.0000 1.0000 1.0000 1.1584 2.8627 1.8524 3.2550 1.2574 1.0000 trial 4: 1.0000 1.0000 1.0000 1.0000 2.0911 1.8118 1.6238 1.3304 1.3858 1.6397 ``` **AFTER fix (`at::zeros`)** — rows 4..9 are exactly 0 and identical across trials: ``` row: 0 1 2 3 4 5 6 7 8 9 trial 0: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 1: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 2: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 3: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 4: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ``` Differential Revision: D104184777
43bb6fe to
87c1279
Compare
87c1279 to
d277c44
Compare
guanghou-ml
pushed a commit
to guanghou-ml/FBGEMM
that referenced
this pull request
May 25, 2026
…memory NaNs (pytorch#5754) Summary: X-link: facebookresearch/FBGEMM#2685 ## TL;DR `pack_segments_backward_cuda` allocates the input-gradient tensor with `at::empty(...)`. When any `lengths[seq] > max_length`, the unpack kernel never writes the tail rows of that segment, leaving them as **uninitialized memory** that propagates into upstream gradients and causes NaN cascades. Switch the allocator to `at::zeros`. ## Bug The unpack kernel writes only positions `cumsum[seq] + cell` for `cell < min(lengths[seq], max_length)`. When `lengths[seq] > max_length`, positions `[cumsum[seq] + max_length, cumsum[seq] + lengths[seq])` are **never written** and retain whatever was in the freshly-allocated buffer. These rows correspond to events that the forward pass truncated, so they MUST receive zero gradient. With `at::empty` they instead receive garbage. The garbage flows upstream and triggers NaN/Inf cascades in deep networks — for example, LayerNorm backward amplifies random O(1) magnitude values via `1/sqrt(var+eps)` into Inf/NaN within a few hundred steps. ## Fix `at::empty(shape, ...)` → `at::zeros(shape, ...)` for the output tensor. One-line change. The added cost is one device-side memset over the gradient buffer per backward call, which is negligible relative to the unpack kernel and downstream backward work. ## Repro ``` import torch lengths = torch.tensor([10, 5, 8], dtype=torch.int32, device="cuda") t_in = torch.randn(23, 8, device="cuda", requires_grad=True) out = torch.ops.fbgemm.pack_segments(t_in, lengths, max_length=4) out.backward(torch.ones_like(out)) # Print abs-max of rows 0..9 (seq 0 has length 10 > max_length=4, # so rows 4..9 are the truncated tail). ``` Real values captured by running the snippet 5 times. Each row shows abs-max across the cell dimension. `lengths[0] = 10`, `max_length = 4`, so rows 0..3 are in-bounds (expected `≈1`) and rows 4..9 are truncated (expected `0`). **BEFORE fix (`at::empty`)** — rows 4..9 vary wildly across trials, confirming uninitialized memory: ``` row: 0 1 2 3 4 5 6 7 8 9 trial 0: 1.0000 1.0000 1.0000 1.0000 1.8152 1.9762 0.8584 2.3934 2.4721 0.0000 trial 1: 1.0000 1.0000 1.0000 1.0000 2.2231 1.6936 1.8451 1.9498 1.6774 0.5991 trial 2: 1.0000 1.0000 1.0000 1.0000 1.7331 1.6970 1.5790 1.6874 2.4351 1.9974 trial 3: 1.0000 1.0000 1.0000 1.0000 1.1584 2.8627 1.8524 3.2550 1.2574 1.0000 trial 4: 1.0000 1.0000 1.0000 1.0000 2.0911 1.8118 1.6238 1.3304 1.3858 1.6397 ``` **AFTER fix (`at::zeros`)** — rows 4..9 are exactly 0 and identical across trials: ``` row: 0 1 2 3 4 5 6 7 8 9 trial 0: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 1: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 2: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 3: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 4: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ``` Reviewed By: q10 Differential Revision: D104184777
…memory NaNs (pytorch#5754) Summary: X-link: facebookresearch/FBGEMM#2685 ## TL;DR `pack_segments_backward_cuda` allocates the input-gradient tensor with `at::empty(...)`. When any `lengths[seq] > max_length`, the unpack kernel never writes the tail rows of that segment, leaving them as **uninitialized memory** that propagates into upstream gradients and causes NaN cascades. Switch the allocator to `at::zeros`. ## Bug The unpack kernel writes only positions `cumsum[seq] + cell` for `cell < min(lengths[seq], max_length)`. When `lengths[seq] > max_length`, positions `[cumsum[seq] + max_length, cumsum[seq] + lengths[seq])` are **never written** and retain whatever was in the freshly-allocated buffer. These rows correspond to events that the forward pass truncated, so they MUST receive zero gradient. With `at::empty` they instead receive garbage. The garbage flows upstream and triggers NaN/Inf cascades in deep networks — for example, LayerNorm backward amplifies random O(1) magnitude values via `1/sqrt(var+eps)` into Inf/NaN within a few hundred steps. ## Fix `at::empty(shape, ...)` → `at::zeros(shape, ...)` for the output tensor. One-line change. The added cost is one device-side memset over the gradient buffer per backward call, which is negligible relative to the unpack kernel and downstream backward work. ## Repro ``` import torch lengths = torch.tensor([10, 5, 8], dtype=torch.int32, device="cuda") t_in = torch.randn(23, 8, device="cuda", requires_grad=True) out = torch.ops.fbgemm.pack_segments(t_in, lengths, max_length=4) out.backward(torch.ones_like(out)) # Print abs-max of rows 0..9 (seq 0 has length 10 > max_length=4, # so rows 4..9 are the truncated tail). ``` Real values captured by running the snippet 5 times. Each row shows abs-max across the cell dimension. `lengths[0] = 10`, `max_length = 4`, so rows 0..3 are in-bounds (expected `≈1`) and rows 4..9 are truncated (expected `0`). **BEFORE fix (`at::empty`)** — rows 4..9 vary wildly across trials, confirming uninitialized memory: ``` row: 0 1 2 3 4 5 6 7 8 9 trial 0: 1.0000 1.0000 1.0000 1.0000 1.8152 1.9762 0.8584 2.3934 2.4721 0.0000 trial 1: 1.0000 1.0000 1.0000 1.0000 2.2231 1.6936 1.8451 1.9498 1.6774 0.5991 trial 2: 1.0000 1.0000 1.0000 1.0000 1.7331 1.6970 1.5790 1.6874 2.4351 1.9974 trial 3: 1.0000 1.0000 1.0000 1.0000 1.1584 2.8627 1.8524 3.2550 1.2574 1.0000 trial 4: 1.0000 1.0000 1.0000 1.0000 2.0911 1.8118 1.6238 1.3304 1.3858 1.6397 ``` **AFTER fix (`at::zeros`)** — rows 4..9 are exactly 0 and identical across trials: ``` row: 0 1 2 3 4 5 6 7 8 9 trial 0: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 1: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 2: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 3: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 4: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ``` Reviewed By: q10 Differential Revision: D104184777
d277c44 to
0d234b2
Compare
Contributor
|
This pull request has been merged in b25d23c. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2685
TL;DR
pack_segments_backward_cudaallocates the input-gradient tensor withat::empty(...). When anylengths[seq] > max_length, the unpack kernel never writes the tail rows of that segment, leaving them as uninitialized memory that propagates into upstream gradients and causes NaN cascades. Switch the allocator toat::zeros.Bug
The unpack kernel writes only positions
cumsum[seq] + cellforcell < min(lengths[seq], max_length). Whenlengths[seq] > max_length, positions[cumsum[seq] + max_length, cumsum[seq] + lengths[seq])are never written and retain whatever was in the freshly-allocated buffer.These rows correspond to events that the forward pass truncated, so they MUST receive zero gradient. With
at::emptythey instead receive garbage. The garbage flows upstream and triggers NaN/Inf cascades in deep networks — for example, LayerNorm backward amplifies random O(1) magnitude values via1/sqrt(var+eps)into Inf/NaN within a few hundred steps.Fix
at::empty(shape, ...)→at::zeros(shape, ...)for the output tensor. One-line change. The added cost is one device-side memset over the gradient buffer per backward call, which is negligible relative to the unpack kernel and downstream backward work.Repro
Real values captured by running the snippet 5 times. Each row shows abs-max across the cell dimension.
lengths[0] = 10,max_length = 4, so rows 0..3 are in-bounds (expected≈1) and rows 4..9 are truncated (expected0).BEFORE fix (
at::empty) — rows 4..9 vary wildly across trials, confirming uninitialized memory:AFTER fix (
at::zeros) — rows 4..9 are exactly 0 and identical across trials:Reviewed By: q10
Differential Revision: D104184777