pack_segments backward CUDA: zero-init gradient buffer to fix uninit-memory NaNs (#5754)

Guang Hou · meta-codesync[bot] · commit b25d23c3e1cc · 2026-05-27T11:02:27.000-07:00
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/2685 Pull Request resolved: #5754 ## TL;DR `pack_segments_backward_cuda` allocates the input-gradient tensor with `at::empty(...)`. When any `lengths[seq] > max_length`, the unpack kernel never writes the tail rows of that segment, leaving them as **uninitialized memory** that propagates into upstream gradients and causes NaN cascades. Switch the allocator to `at::zeros`. ## Bug The unpack kernel writes only positions `cumsum[seq] + cell` for `cell < min(lengths[seq], max_length)`. When `lengths[seq] > max_length`, positions `[cumsum[seq] + max_length, cumsum[seq] + lengths[seq])` are **never written** and retain whatever was in the freshly-allocated buffer. These rows correspond to events that the forward pass truncated, so they MUST receive zero gradient. With `at::empty` they instead receive garbage. The garbage flows upstream and triggers NaN/Inf cascades in deep networks — for example, LayerNorm backward amplifies random O(1) magnitude values via `1/sqrt(var+eps)` into Inf/NaN within a few hundred steps. ## Fix `at::empty(shape, ...)` → `at::zeros(shape, ...)` for the output tensor. One-line change. The added cost is one device-side memset over the gradient buffer per backward call, which is negligible relative to the unpack kernel and downstream backward work. ## Repro ``` import torch lengths = torch.tensor([10, 5, 8], dtype=torch.int32, device="cuda") t_in = torch.randn(23, 8, device="cuda", requires_grad=True) out = torch.ops.fbgemm.pack_segments(t_in, lengths, max_length=4) out.backward(torch.ones_like(out)) # Print abs-max of rows 0..9 (seq 0 has length 10 > max_length=4, # so rows 4..9 are the truncated tail). ``` Real values captured by running the snippet 5 times. Each row shows abs-max across the cell dimension. `lengths[0] = 10`, `max_length = 4`, so rows 0..3 are in-bounds (expected `≈1`) and rows 4..9 are truncated (expected `0`). **BEFORE fix (`at::empty`)** — rows 4..9 vary wildly across trials, confirming uninitialized memory: ``` row: 0 1 2 3 4 5 6 7 8 9 trial 0: 1.0000 1.0000 1.0000 1.0000 1.8152 1.9762 0.8584 2.3934 2.4721 0.0000 trial 1: 1.0000 1.0000 1.0000 1.0000 2.2231 1.6936 1.8451 1.9498 1.6774 0.5991 trial 2: 1.0000 1.0000 1.0000 1.0000 1.7331 1.6970 1.5790 1.6874 2.4351 1.9974 trial 3: 1.0000 1.0000 1.0000 1.0000 1.1584 2.8627 1.8524 3.2550 1.2574 1.0000 trial 4: 1.0000 1.0000 1.0000 1.0000 2.0911 1.8118 1.6238 1.3304 1.3858 1.6397 ``` **AFTER fix (`at::zeros`)** — rows 4..9 are exactly 0 and identical across trials: ``` row: 0 1 2 3 4 5 6 7 8 9 trial 0: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 1: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 2: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 3: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 trial 4: 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ``` Reviewed By: q10 Differential Revision: D104184777 fbshipit-source-id: 848b007ed535b884256d50af3095ebc5c7181028
diff --git a/fbgemm_gpu/src/sparse_ops/sparse_pack_segments_backward.cu b/fbgemm_gpu/src/sparse_ops/sparse_pack_segments_backward.cu
@@ -70,11 +70,18 @@ DLL_PUBLIC Tensor pack_segments_backward_cuda(
   AT_DISPATCH_INDEX_TYPES(lengths_c.scalar_type(), "unpack_segments_cuda", [&] {
     const auto* const lengths_data = lengths_c.const_data_ptr<index_t>();
 
-    // Create output tensor of appropriate dimensions
+    // Create output tensor of appropriate dimensions.
+    // Use at::zeros (not at::empty): when lengths[seq] > max_length, the
+    // unpack kernel only writes positions cumsum[seq]+cell for cell<max_length,
+    // leaving positions [cumsum[seq]+max_length, cumsum[seq]+lengths[seq])
+    // uninitialized. Those rows correspond to events that were truncated by
+    // forward pack_segments and so MUST receive zero gradient. With at::empty
+    // they would receive uninitialized memory, corrupting upstream gradients
+    // and causing NaN cascades in deep networks.
     auto shape = data_contig->sizes().vec();
     shape.erase(shape.begin());
     shape[0] = total_length;
-    unpacked_tensor = at::empty(shape, data_contig->options());
+    unpacked_tensor = at::zeros(shape, data_contig->options());
 
     if (!(data_contig->size(0) &&
           data_contig->size(1))) { // TODO: What does this mean?
diff --git a/fbgemm_gpu/test/sparse/pack_segments_test.py b/fbgemm_gpu/test/sparse/pack_segments_test.py
@@ -553,6 +553,77 @@ def test_pack_segments_noncontig(
             msg="Expected input gradients to be equal but they are not",
         )
 
+    @unittest.skipIf(*gpu_unavailable)
+    @given(
+        dtype=st.sampled_from(
+            [
+                torch.float,
+                torch.half,
+                torch.bfloat16,
+            ]
+        ),
+    )
+    @settings(deadline=None)
+    def test_pack_segments_backward_truncated(self, dtype: torch.dtype) -> None:
+        """
+        Regression test: when lengths[seq] > max_length, the backward kernel
+        previously left positions [cumsum[seq]+max_length, cumsum[seq]+lengths[seq])
+        in the input gradient as uninitialized memory (allocated via at::empty).
+
+        After the fix (at::empty -> at::zeros), those positions must be exactly 0
+        because they correspond to events that were truncated by the forward pass
+        and so cannot influence the loss.
+
+        Without the fix, these positions contain garbage, which propagates upstream
+        and can cause NaN cascades in deep networks (LayerNorm backward amplification).
+        """
+        # Choose lengths intentionally larger than max_length for some segments
+        max_length = 4
+        lengths_cpu = torch.tensor([10, 5, 8, 2], dtype=torch.int)
+        total_length = int(lengths_cpu.sum().item())
+        cell_size = 8
+
+        # Run multiple trials to detect uninitialized memory:
+        # if positions are uninit, values change across trials.
+        observed_grads = []
+        for _ in range(5):
+            input_data = (
+                torch.randn(total_length, cell_size, dtype=dtype)
+                .cuda()  # noqa: CITRINE(redundant_cuda_to_device)
+                .requires_grad_(True)
+            )
+            lengths = lengths_cpu.cuda()
+
+            packed = torch.ops.fbgemm.pack_segments(
+                t_in=input_data, lengths=lengths, max_length=max_length
+            )
+            grad_out = torch.ones_like(packed)
+            packed.backward(grad_out)
+
+            # pyre-ignore[16]
+            observed_grads.append(input_data.grad.detach().cpu().clone())
+
+        # Verify: positions where cell < min(lengths[seq], max_length) get grad=1
+        # positions where cell >= max_length but cell < lengths[seq] get grad=0
+        cumsum = 0
+        for seq, L in enumerate(lengths_cpu.tolist()):
+            for cell in range(L):
+                row = cumsum + cell
+                expected = 1.0 if cell < max_length else 0.0
+                for trial, grad in enumerate(observed_grads):
+                    actual = grad[row].abs().max().item()
+                    self.assertAlmostEqual(
+                        actual,
+                        expected,
+                        places=2,
+                        msg=(
+                            f"trial={trial} seq={seq} cell={cell} row={row}: "
+                            f"expected grad abs.max={expected}, got {actual}. "
+                            "Truncated rows must receive zero gradient (not uninit memory)."
+                        ),
+                    )
+            cumsum += L
+
 
 extend_test_class(PackedSegmentsTest)