Skip to content

pack_segments backward CUDA: zero-init gradient buffer to fix uninit-memory NaNs (#5754)#5754

Closed
guanghou-ml wants to merge 1 commit into
pytorch:mainfrom
guanghou-ml:export-D104184777
Closed

pack_segments backward CUDA: zero-init gradient buffer to fix uninit-memory NaNs (#5754)#5754
guanghou-ml wants to merge 1 commit into
pytorch:mainfrom
guanghou-ml:export-D104184777

Conversation

@guanghou-ml
Copy link
Copy Markdown

@guanghou-ml guanghou-ml commented May 9, 2026

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2685

TL;DR

pack_segments_backward_cuda allocates the input-gradient tensor with at::empty(...). When any lengths[seq] > max_length, the unpack kernel never writes the tail rows of that segment, leaving them as uninitialized memory that propagates into upstream gradients and causes NaN cascades. Switch the allocator to at::zeros.

Bug

The unpack kernel writes only positions cumsum[seq] + cell for cell < min(lengths[seq], max_length). When lengths[seq] > max_length, positions [cumsum[seq] + max_length, cumsum[seq] + lengths[seq]) are never written and retain whatever was in the freshly-allocated buffer.

These rows correspond to events that the forward pass truncated, so they MUST receive zero gradient. With at::empty they instead receive garbage. The garbage flows upstream and triggers NaN/Inf cascades in deep networks — for example, LayerNorm backward amplifies random O(1) magnitude values via 1/sqrt(var+eps) into Inf/NaN within a few hundred steps.

Fix

at::empty(shape, ...)at::zeros(shape, ...) for the output tensor. One-line change. The added cost is one device-side memset over the gradient buffer per backward call, which is negligible relative to the unpack kernel and downstream backward work.

Repro

import torch
lengths = torch.tensor([10, 5, 8], dtype=torch.int32, device="cuda")
t_in = torch.randn(23, 8, device="cuda", requires_grad=True)
out = torch.ops.fbgemm.pack_segments(t_in, lengths, max_length=4)
out.backward(torch.ones_like(out))
# Print abs-max of rows 0..9 (seq 0 has length 10 > max_length=4,
# so rows 4..9 are the truncated tail).

Real values captured by running the snippet 5 times. Each row shows abs-max across the cell dimension. lengths[0] = 10, max_length = 4, so rows 0..3 are in-bounds (expected ≈1) and rows 4..9 are truncated (expected 0).

BEFORE fix (at::empty) — rows 4..9 vary wildly across trials, confirming uninitialized memory:

        row:    0      1      2      3      4      5      6      7      8      9
trial 0:    1.0000 1.0000 1.0000 1.0000 1.8152 1.9762 0.8584 2.3934 2.4721 0.0000
trial 1:    1.0000 1.0000 1.0000 1.0000 2.2231 1.6936 1.8451 1.9498 1.6774 0.5991
trial 2:    1.0000 1.0000 1.0000 1.0000 1.7331 1.6970 1.5790 1.6874 2.4351 1.9974
trial 3:    1.0000 1.0000 1.0000 1.0000 1.1584 2.8627 1.8524 3.2550 1.2574 1.0000
trial 4:    1.0000 1.0000 1.0000 1.0000 2.0911 1.8118 1.6238 1.3304 1.3858 1.6397

AFTER fix (at::zeros) — rows 4..9 are exactly 0 and identical across trials:

        row:    0      1      2      3      4      5      6      7      8      9
trial 0:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 1:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 2:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 3:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 4:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

Reviewed By: q10

Differential Revision: D104184777

@meta-cla meta-cla Bot added the cla signed label May 9, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 9, 2026

@guanghou-ml has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104184777.

@meta-codesync meta-codesync Bot changed the title pack_segments backward CUDA: zero-init gradient buffer to fix uninit-memory NaNs pack_segments backward CUDA: zero-init gradient buffer to fix uninit-memory NaNs (#5754) May 9, 2026
guanghou-ml pushed a commit to guanghou-ml/FBGEMM that referenced this pull request May 9, 2026
…memory NaNs (pytorch#5754)

Summary:

## TL;DR

`pack_segments_backward_cuda` allocates the input-gradient tensor with `at::empty(...)`. When any `lengths[seq] > max_length`, the unpack kernel never writes the tail rows of that segment, leaving them as **uninitialized memory** that propagates into upstream gradients and causes NaN cascades. Switch the allocator to `at::zeros`.

## Bug

The unpack kernel writes only positions `cumsum[seq] + cell` for `cell < min(lengths[seq], max_length)`. When `lengths[seq] > max_length`, positions `[cumsum[seq] + max_length, cumsum[seq] + lengths[seq])` are **never written** and retain whatever was in the freshly-allocated buffer.

These rows correspond to events that the forward pass truncated, so they MUST receive zero gradient. With `at::empty` they instead receive garbage. The garbage flows upstream and triggers NaN/Inf cascades in deep networks — for example, LayerNorm backward amplifies random O(1) magnitude values via `1/sqrt(var+eps)` into Inf/NaN within a few hundred steps.

## Fix

`at::empty(shape, ...)` → `at::zeros(shape, ...)` for the output tensor. One-line change. The added cost is one device-side memset over the gradient buffer per backward call, which is negligible relative to the unpack kernel and downstream backward work.

## Repro

```
import torch
lengths = torch.tensor([10, 5, 8], dtype=torch.int32, device="cuda")
t_in = torch.randn(23, 8, device="cuda", requires_grad=True)
out = torch.ops.fbgemm.pack_segments(t_in, lengths, max_length=4)
out.backward(torch.ones_like(out))
# Print abs-max of rows 0..9 (seq 0 has length 10 > max_length=4,
# so rows 4..9 are the truncated tail).
```

Real values captured by running the snippet 5 times. Each row shows abs-max across the cell dimension. `lengths[0] = 10`, `max_length = 4`, so rows 0..3 are in-bounds (expected `≈1`) and rows 4..9 are truncated (expected `0`).

**BEFORE fix (`at::empty`)** — rows 4..9 vary wildly across trials, confirming uninitialized memory:

```
        row:    0      1      2      3      4      5      6      7      8      9
trial 0:    1.0000 1.0000 1.0000 1.0000 1.8152 1.9762 0.8584 2.3934 2.4721 0.0000
trial 1:    1.0000 1.0000 1.0000 1.0000 2.2231 1.6936 1.8451 1.9498 1.6774 0.5991
trial 2:    1.0000 1.0000 1.0000 1.0000 1.7331 1.6970 1.5790 1.6874 2.4351 1.9974
trial 3:    1.0000 1.0000 1.0000 1.0000 1.1584 2.8627 1.8524 3.2550 1.2574 1.0000
trial 4:    1.0000 1.0000 1.0000 1.0000 2.0911 1.8118 1.6238 1.3304 1.3858 1.6397
```

**AFTER fix (`at::zeros`)** — rows 4..9 are exactly 0 and identical across trials:

```
        row:    0      1      2      3      4      5      6      7      8      9
trial 0:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 1:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 2:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 3:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 4:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
```

Differential Revision: D104184777
@guanghou-ml guanghou-ml force-pushed the export-D104184777 branch from 43bb6fe to 87c1279 Compare May 9, 2026 19:51
guanghou-ml pushed a commit to guanghou-ml/FBGEMM that referenced this pull request May 25, 2026
…memory NaNs (pytorch#5754)

Summary:
X-link: facebookresearch/FBGEMM#2685


## TL;DR

`pack_segments_backward_cuda` allocates the input-gradient tensor with `at::empty(...)`. When any `lengths[seq] > max_length`, the unpack kernel never writes the tail rows of that segment, leaving them as **uninitialized memory** that propagates into upstream gradients and causes NaN cascades. Switch the allocator to `at::zeros`.

## Bug

The unpack kernel writes only positions `cumsum[seq] + cell` for `cell < min(lengths[seq], max_length)`. When `lengths[seq] > max_length`, positions `[cumsum[seq] + max_length, cumsum[seq] + lengths[seq])` are **never written** and retain whatever was in the freshly-allocated buffer.

These rows correspond to events that the forward pass truncated, so they MUST receive zero gradient. With `at::empty` they instead receive garbage. The garbage flows upstream and triggers NaN/Inf cascades in deep networks — for example, LayerNorm backward amplifies random O(1) magnitude values via `1/sqrt(var+eps)` into Inf/NaN within a few hundred steps.

## Fix

`at::empty(shape, ...)` → `at::zeros(shape, ...)` for the output tensor. One-line change. The added cost is one device-side memset over the gradient buffer per backward call, which is negligible relative to the unpack kernel and downstream backward work.

## Repro

```
import torch
lengths = torch.tensor([10, 5, 8], dtype=torch.int32, device="cuda")
t_in = torch.randn(23, 8, device="cuda", requires_grad=True)
out = torch.ops.fbgemm.pack_segments(t_in, lengths, max_length=4)
out.backward(torch.ones_like(out))
# Print abs-max of rows 0..9 (seq 0 has length 10 > max_length=4,
# so rows 4..9 are the truncated tail).
```

Real values captured by running the snippet 5 times. Each row shows abs-max across the cell dimension. `lengths[0] = 10`, `max_length = 4`, so rows 0..3 are in-bounds (expected `≈1`) and rows 4..9 are truncated (expected `0`).

**BEFORE fix (`at::empty`)** — rows 4..9 vary wildly across trials, confirming uninitialized memory:

```
        row:    0      1      2      3      4      5      6      7      8      9
trial 0:    1.0000 1.0000 1.0000 1.0000 1.8152 1.9762 0.8584 2.3934 2.4721 0.0000
trial 1:    1.0000 1.0000 1.0000 1.0000 2.2231 1.6936 1.8451 1.9498 1.6774 0.5991
trial 2:    1.0000 1.0000 1.0000 1.0000 1.7331 1.6970 1.5790 1.6874 2.4351 1.9974
trial 3:    1.0000 1.0000 1.0000 1.0000 1.1584 2.8627 1.8524 3.2550 1.2574 1.0000
trial 4:    1.0000 1.0000 1.0000 1.0000 2.0911 1.8118 1.6238 1.3304 1.3858 1.6397
```

**AFTER fix (`at::zeros`)** — rows 4..9 are exactly 0 and identical across trials:

```
        row:    0      1      2      3      4      5      6      7      8      9
trial 0:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 1:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 2:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 3:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 4:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
```

Reviewed By: q10

Differential Revision: D104184777
…memory NaNs (pytorch#5754)

Summary:
X-link: facebookresearch/FBGEMM#2685


## TL;DR

`pack_segments_backward_cuda` allocates the input-gradient tensor with `at::empty(...)`. When any `lengths[seq] > max_length`, the unpack kernel never writes the tail rows of that segment, leaving them as **uninitialized memory** that propagates into upstream gradients and causes NaN cascades. Switch the allocator to `at::zeros`.

## Bug

The unpack kernel writes only positions `cumsum[seq] + cell` for `cell < min(lengths[seq], max_length)`. When `lengths[seq] > max_length`, positions `[cumsum[seq] + max_length, cumsum[seq] + lengths[seq])` are **never written** and retain whatever was in the freshly-allocated buffer.

These rows correspond to events that the forward pass truncated, so they MUST receive zero gradient. With `at::empty` they instead receive garbage. The garbage flows upstream and triggers NaN/Inf cascades in deep networks — for example, LayerNorm backward amplifies random O(1) magnitude values via `1/sqrt(var+eps)` into Inf/NaN within a few hundred steps.

## Fix

`at::empty(shape, ...)` → `at::zeros(shape, ...)` for the output tensor. One-line change. The added cost is one device-side memset over the gradient buffer per backward call, which is negligible relative to the unpack kernel and downstream backward work.

## Repro

```
import torch
lengths = torch.tensor([10, 5, 8], dtype=torch.int32, device="cuda")
t_in = torch.randn(23, 8, device="cuda", requires_grad=True)
out = torch.ops.fbgemm.pack_segments(t_in, lengths, max_length=4)
out.backward(torch.ones_like(out))
# Print abs-max of rows 0..9 (seq 0 has length 10 > max_length=4,
# so rows 4..9 are the truncated tail).
```

Real values captured by running the snippet 5 times. Each row shows abs-max across the cell dimension. `lengths[0] = 10`, `max_length = 4`, so rows 0..3 are in-bounds (expected `≈1`) and rows 4..9 are truncated (expected `0`).

**BEFORE fix (`at::empty`)** — rows 4..9 vary wildly across trials, confirming uninitialized memory:

```
        row:    0      1      2      3      4      5      6      7      8      9
trial 0:    1.0000 1.0000 1.0000 1.0000 1.8152 1.9762 0.8584 2.3934 2.4721 0.0000
trial 1:    1.0000 1.0000 1.0000 1.0000 2.2231 1.6936 1.8451 1.9498 1.6774 0.5991
trial 2:    1.0000 1.0000 1.0000 1.0000 1.7331 1.6970 1.5790 1.6874 2.4351 1.9974
trial 3:    1.0000 1.0000 1.0000 1.0000 1.1584 2.8627 1.8524 3.2550 1.2574 1.0000
trial 4:    1.0000 1.0000 1.0000 1.0000 2.0911 1.8118 1.6238 1.3304 1.3858 1.6397
```

**AFTER fix (`at::zeros`)** — rows 4..9 are exactly 0 and identical across trials:

```
        row:    0      1      2      3      4      5      6      7      8      9
trial 0:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 1:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 2:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 3:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
trial 4:    1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
```

Reviewed By: q10

Differential Revision: D104184777
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 27, 2026

This pull request has been merged in b25d23c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant