Name the OOM-skip threshold and explain the 128*bHSS workspace observation

sudhakarsingh27 · sudhakarsingh27 · commit 1563b1056a9e · 2026-05-22T17:05:58.000-07:00
Address review nits on the deterministic THD-backward OOM guard:
1. Replace the magic number 1_000_000_000 with the named constant
   SM90_DET_FUSED_THD_BWD_MAX_BHSS = 1 &lt;&lt; 30, so the value is searchable
   and labeled.
2. Replace the prefatory comment with a short note tying the number to
   cuDNN's actual workspace request (~128 * bHSS bytes, measured on
   cuDNN 9.21.0 sm90 — see local sweep). At bHSS = 1&lt;&lt;30 the request is
   128 GiB, which doesn't fit on H100's 80 GB.
3. Flag the b&gt;=3 caveat for future readers: cuDNN rounds the batch up
   internally so workspace grows super-linearly past b=2 (b=4 asks for
   4x the b=2 workspace, not 2x). The current fused-essential matrix is
   all b=2, so the threshold stays correct for what the test exercises;
   the note is there so the next person doesn't have to rediscover it.

Skip set is unchanged — cp_2_0, cp_2_1, cp_3_1, cp_4_2, cp_4_3.

Signed-off-by: Sudhakar Singh &lt;sudhakars@nvidia.com&gt;
diff --git a/tests/pytorch/attention/test_attention_with_cp.py b/tests/pytorch/attention/test_attention_with_cp.py
@@ -639,15 +639,17 @@ def test_cp_with_fused_attention(
         pytest.skip("Deterministic mode does not support non-vanilla softmax with FusedAttention")
     if _deterministic and config.attn_bias_type == "post_scale_bias" and is_training:
         pytest.skip("Deterministic mode does not support post_scale_bias with requires_grad")
-    # Det FusedAttention backward with THD on sm90 OOMs because cuDNN reserves
-    # workspace proportional to b*H*S*S. Gate on that product, not num_heads,
-    # so the skip stays correct if a new config has small b/S but H >= 20.
+    # cuDNN det THD backward workspace on sm90 is ~128 * bHSS bytes; at 1<<30
+    # that's 128 GiB, won't fit on H100's 80 GB. Exact at b=2 + power-of-2 S;
+    # for b>=3 cuDNN rounds batch up internally so workspace grows super-linearly
+    # (e.g. b=4 wants 4x b=2's workspace, not 2x) — revisit if a config uses b>2.
+    SM90_DET_FUSED_THD_BWD_MAX_BHSS = 1 << 30
     if (
         _deterministic
         and qkv_format == "thd"
         and get_device_compute_capability() == (9, 0)
         and config.batch_size * config.num_heads * config.max_seqlen_q * config.max_seqlen_kv
-        >= 1_000_000_000
+        >= SM90_DET_FUSED_THD_BWD_MAX_BHSS
     ):
         pytest.skip(
             "Deterministic FusedAttention backward with THD format OOMs on sm90"