pack_segments backward CUDA: zero-init gradient buffer to fix uninit-memory NaNs (#5754) by guanghou-ml · Pull Request #5754 · pytorch/FBGEMM

guanghou-ml · 2026-05-09T18:19:06Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2685

TL;DR

pack_segments_backward_cuda allocates the input-gradient tensor with at::empty(...). When any lengths[seq] > max_length, the unpack kernel never writes the tail rows of that segment, leaving them as uninitialized memory that propagates into upstream gradients and causes NaN cascades. Switch the allocator to at::zeros.

Bug

The unpack kernel writes only positions cumsum[seq] + cell for cell < min(lengths[seq], max_length). When lengths[seq] > max_length, positions [cumsum[seq] + max_length, cumsum[seq] + lengths[seq]) are never written and retain whatever was in the freshly-allocated buffer.

These rows correspond to events that the forward pass truncated, so they MUST receive zero gradient. With at::empty they instead receive garbage. The garbage flows upstream and triggers NaN/Inf cascades in deep networks — for example, LayerNorm backward amplifies random O(1) magnitude values via 1/sqrt(var+eps) into Inf/NaN within a few hundred steps.

Fix

at::empty(shape, ...) → at::zeros(shape, ...) for the output tensor. One-line change. The added cost is one device-side memset over the gradient buffer per backward call, which is negligible relative to the unpack kernel and downstream backward work.

Repro

import torch
lengths = torch.tensor([10, 5, 8], dtype=torch.int32, device="cuda")
t_in = torch.randn(23, 8, device="cuda", requires_grad=True)
out = torch.ops.fbgemm.pack_segments(t_in, lengths, max_length=4)
out.backward(torch.ones_like(out))
# Print abs-max of rows 0..9 (seq 0 has length 10 > max_length=4,
# so rows 4..9 are the truncated tail).

Real values captured by running the snippet 5 times. Each row shows abs-max across the cell dimension. lengths[0] = 10, max_length = 4, so rows 0..3 are in-bounds (expected ≈1) and rows 4..9 are truncated (expected 0).

BEFORE fix (at::empty) — rows 4..9 vary wildly across trials, confirming uninitialized memory:

        row:    0      1      2      3      4      5      6      7      8      9
trial 0:    1.0000 1.0000 1.0000 1.0000 1.8152 1.9762 0.8584 2.3934 2.4721 0.0000
trial 1:    1.0000 1.0000 1.0000 1.0000 2.2231 1.6936 1.8451 1.9498 1.6774 0.5991
trial 2:    1.0000 1.0000 1.0000 1.0000 1.7331 1.6970 1.5790 1.6874 2.4351 1.9974
trial 3:    1.0000 1.0000 1.0000 1.0000 1.1584 2.8627 1.8524 3.2550 1.2574 1.0000
trial 4:    1.0000 1.0000 1.0000 1.0000 2.0911 1.8118 1.6238 1.3304 1.3858 1.6397

AFTER fix (at::zeros) — rows 4..9 are exactly 0 and identical across trials:

        row:    0      1      2      3      4      5      6      7      8      9
trial 0:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 1:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 2:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 3:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 4:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

Reviewed By: q10

Differential Revision: D104184777

meta-codesync · 2026-05-09T18:19:14Z

@guanghou-ml has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104184777.

…memory NaNs (pytorch#5754) Summary: ## TL;DR `pack_segments_backward_cuda` allocates the input-gradient tensor with `at::empty(...)`. When any `lengths[seq] > max_length`, the unpack kernel never writes the tail rows of that segment, leaving them as **uninitialized memory** that propagates into upstream gradients and causes NaN cascades. Switch the allocator to `at::zeros`. ## Bug The unpack kernel writes only positions `cumsum[seq] + cell` for `cell < min(lengths[seq], max_length)`. When `lengths[seq] > max_length`, positions `[cumsum[seq] + max_length, cumsum[seq] + lengths[seq])` are **never written** and retain whatever was in the freshly-allocated buffer. These rows correspond to events that the forward pass truncated, so they MUST receive zero gradient. With `at::empty` they instead receive garbage. The garbage flows upstream and triggers NaN/Inf cascades in deep networks — for example, LayerNorm backward amplifies random O(1) magnitude values via `1/sqrt(var+eps)` into Inf/NaN within a few hundred steps. ## Fix `at::empty(shape, ...)` → `at::zeros(shape, ...)` for the output tensor. One-line change. The added cost is one device-side memset over the gradient buffer per backward call, which is negligible relative to the unpack kernel and downstream backward work. ## Repro ``` import torch lengths = torch.tensor([10, 5, 8], dtype=torch.int32, device="cuda") t_in = torch.randn(23, 8, device="cuda", requires_grad=True) out = torch.ops.fbgemm.pack_segments(t_in, lengths, max_length=4) out.backward(torch.ones_like(out)) # Print abs-max of rows 0..9 (seq 0 has length 10 > max_length=4, # so rows 4..9 are the truncated tail). ``` Real values captured by running the snippet 5 times. Each row shows abs-max across the cell dimension. `lengths[0] = 10`, `max_length = 4`, so rows 0..3 are in-bounds (expected `≈1`) and rows 4..9 are truncated (expected `0`). **BEFORE fix (`at::empty`)** — rows 4..9 vary wildly across trials, confirming uninitialized memory: ``` row: 0 1 2 3 4 5 6 7 8 9 trial 0: 1.0000 1.0000 1.0000 1.0000 1.8152 1.9762 0.8584 2.3934 2.4721 0.0000 trial 1: 1.0000 1.0000 1.0000 1.0000 2.2231 1.6936 1.8451 1.9498 1.6774 0.5991 trial 2: 1.0000 1.0000 1.0000 1.0000 1.7331 1.6970 1.5790 1.6874 2.4351 1.9974 trial 3: 1.0000 1.0000 1.0000 1.0000 1.1584 2.8627 1.8524 3.2550 1.2574 1.0000 trial 4: 1.0000 1.0000 1.0000 1.0000 2.0911 1.8118 1.6238 1.3304 1.3858 1.6397 ``` **AFTER fix (`at::zeros`)** — rows 4..9 are exactly 0 and identical across trials: ``` row: 0 1 2 3 4 5 6 7 8 9 trial 0: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 1: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 2: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 3: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 4: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ``` Differential Revision: D104184777

…memory NaNs (pytorch#5754) Summary: X-link: facebookresearch/FBGEMM#2685 ## TL;DR `pack_segments_backward_cuda` allocates the input-gradient tensor with `at::empty(...)`. When any `lengths[seq] > max_length`, the unpack kernel never writes the tail rows of that segment, leaving them as **uninitialized memory** that propagates into upstream gradients and causes NaN cascades. Switch the allocator to `at::zeros`. ## Bug The unpack kernel writes only positions `cumsum[seq] + cell` for `cell < min(lengths[seq], max_length)`. When `lengths[seq] > max_length`, positions `[cumsum[seq] + max_length, cumsum[seq] + lengths[seq])` are **never written** and retain whatever was in the freshly-allocated buffer. These rows correspond to events that the forward pass truncated, so they MUST receive zero gradient. With `at::empty` they instead receive garbage. The garbage flows upstream and triggers NaN/Inf cascades in deep networks — for example, LayerNorm backward amplifies random O(1) magnitude values via `1/sqrt(var+eps)` into Inf/NaN within a few hundred steps. ## Fix `at::empty(shape, ...)` → `at::zeros(shape, ...)` for the output tensor. One-line change. The added cost is one device-side memset over the gradient buffer per backward call, which is negligible relative to the unpack kernel and downstream backward work. ## Repro ``` import torch lengths = torch.tensor([10, 5, 8], dtype=torch.int32, device="cuda") t_in = torch.randn(23, 8, device="cuda", requires_grad=True) out = torch.ops.fbgemm.pack_segments(t_in, lengths, max_length=4) out.backward(torch.ones_like(out)) # Print abs-max of rows 0..9 (seq 0 has length 10 > max_length=4, # so rows 4..9 are the truncated tail). ``` Real values captured by running the snippet 5 times. Each row shows abs-max across the cell dimension. `lengths[0] = 10`, `max_length = 4`, so rows 0..3 are in-bounds (expected `≈1`) and rows 4..9 are truncated (expected `0`). **BEFORE fix (`at::empty`)** — rows 4..9 vary wildly across trials, confirming uninitialized memory: ``` row: 0 1 2 3 4 5 6 7 8 9 trial 0: 1.0000 1.0000 1.0000 1.0000 1.8152 1.9762 0.8584 2.3934 2.4721 0.0000 trial 1: 1.0000 1.0000 1.0000 1.0000 2.2231 1.6936 1.8451 1.9498 1.6774 0.5991 trial 2: 1.0000 1.0000 1.0000 1.0000 1.7331 1.6970 1.5790 1.6874 2.4351 1.9974 trial 3: 1.0000 1.0000 1.0000 1.0000 1.1584 2.8627 1.8524 3.2550 1.2574 1.0000 trial 4: 1.0000 1.0000 1.0000 1.0000 2.0911 1.8118 1.6238 1.3304 1.3858 1.6397 ``` **AFTER fix (`at::zeros`)** — rows 4..9 are exactly 0 and identical across trials: ``` row: 0 1 2 3 4 5 6 7 8 9 trial 0: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 1: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 2: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 3: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 4: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ``` Reviewed By: q10 Differential Revision: D104184777

meta-codesync · 2026-05-27T18:05:46Z

This pull request has been merged in b25d23c.

meta-cla Bot added the cla signed label May 9, 2026

meta-codesync Bot added fb-exported meta-exported labels May 9, 2026

meta-codesync Bot changed the title ~~pack_segments backward CUDA: zero-init gradient buffer to fix uninit-memory NaNs~~ pack_segments backward CUDA: zero-init gradient buffer to fix uninit-memory NaNs (#5754) May 9, 2026

guanghou-ml force-pushed the export-D104184777 branch from 43bb6fe to 87c1279 Compare May 9, 2026 19:51

guanghou-ml force-pushed the export-D104184777 branch from 87c1279 to d277c44 Compare May 25, 2026 03:21

guanghou-ml force-pushed the export-D104184777 branch from d277c44 to 0d234b2 Compare May 26, 2026 02:40

meta-codesync Bot closed this in b25d23c May 27, 2026

facebook-github-tools Bot added the Merged label May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pack_segments backward CUDA: zero-init gradient buffer to fix uninit-memory NaNs (#5754)#5754

pack_segments backward CUDA: zero-init gradient buffer to fix uninit-memory NaNs (#5754)#5754
guanghou-ml wants to merge 1 commit into
pytorch:mainfrom
guanghou-ml:export-D104184777

guanghou-ml commented May 9, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented May 9, 2026

Uh oh!

meta-codesync Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

guanghou-ml commented May 9, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Bug

Fix

Repro

Uh oh!

meta-codesync Bot commented May 9, 2026

Uh oh!

meta-codesync Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

guanghou-ml commented May 9, 2026 •

edited by meta-codesync Bot

Loading