Fix skip-softmax threshold formula: remove erroneous * sm_scale factor

yeyu-nvidia · claude · yeyu-nvidia · commit 8996ef19497f · 2026-04-07T11:29:50.000-07:00
The BLASST (https://arxiv.org/pdf/2512.12087) criterion checks ln(lambda) on the sm_scale-SCALED attention logits a_ij = q·k/sqrt(d). The Triton kernel stores scores as x = a * log2(e), so the correct threshold in kernel (log2) space is log2(lambda), not log2(lambda)*sm_scale. Previous code multiplied by sm_scale (~0.088 for head_dim=128), making every threshold 11× too aggressive. With lambda=0.1 the kernel-space threshold was -0.29 instead of the correct -3.32, skipping most attention tiles and producing garbage output (PSNR~11 dB). Even lambda=0.0001 was still too aggressive (-1.18 vs correct -13.29). Fix: use `log2(lambda)` directly as SKIP_THRESHOLD_LOG2, and restore the default threshold to 0.1 (the standard BLASST value). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
diff --git a/examples/diffusers/quantization/wan2_sage_attention.py b/examples/diffusers/quantization/wan2_sage_attention.py
@@ -450,7 +450,7 @@ def attention_kernel_ctx(kernel: str = KERNEL_FP8):
     }
 }
 
-_TRITON_SKIP_DEFAULT_THRESHOLD = 0.01
+_TRITON_SKIP_DEFAULT_THRESHOLD = 0.1
 
 _TRITON_SKIP_CONFIG = {
     "sparse_cfg": {
diff --git a/modelopt/torch/kernels/triton_fa.py b/modelopt/torch/kernels/triton_fa.py
@@ -996,14 +996,25 @@ def forward(
         # Triton tiles must be powers of 2; pad head dim
         BLOCK_D = triton.next_power_of_2(HEAD_DIM)
 
-        # Skip-softmax: convert threshold to scaled log2 space for the kernel.
-        # The BLASST reference (https://arxiv.org/pdf/2512.12087) checks
-        # ln(lambda) on unscaled scores. Our kernel works in log2-scaled space
-        # (scores pre-multiplied by qk_scale = sm_scale * LOG2E), so we
-        # pre-scale: threshold_scaled = log2(lambda) * sm_scale.
+        # Skip-softmax: convert lambda threshold to log2 space for the kernel.
+        #
+        # BLASST (https://arxiv.org/pdf/2512.12087) checks the criterion on the
+        # sm_scale-SCALED attention logits a_ij = q·k / sqrt(d):
+        #
+        #   tile_max_a < running_max_a + ln(lambda)
+        #
+        # The Triton kernel stores scores as x = a * log2(e) (for exp2 efficiency),
+        # so a = x * ln(2).  Substituting:
+        #
+        #   tile_max_x * ln(2) < running_max_x * ln(2) + ln(lambda)
+        #   tile_max_x         < running_max_x + log2(lambda)
+        #
+        # Therefore the threshold in kernel (log2) space is simply log2(lambda).
+        # Do NOT multiply by sm_scale — that factor is already absorbed into the
+        # log2(e) conversion above.
         apply_skip = skip_softmax_threshold is not None and skip_softmax_threshold > 0.0
         if apply_skip:
-            skip_threshold_log2 = math.log2(skip_softmax_threshold) * sm_scale
+            skip_threshold_log2 = math.log2(skip_softmax_threshold)
         else:
             skip_threshold_log2 = 0.0
 

Original file line number	Diff line number	Diff line change
`@@ -450,7 +450,7 @@ def attention_kernel_ctx(kernel: str = KERNEL_FP8):`
`450`	`450`	`}`
`451`	`451`	`}`
`452`	`452`
`453`		`-_TRITON_SKIP_DEFAULT_THRESHOLD = 0.01`
	`453`	`+_TRITON_SKIP_DEFAULT_THRESHOLD = 0.1`
`454`	`454`
`455`	`455`	`_TRITON_SKIP_CONFIG = {`
`456`	`456`	`"sparse_cfg": {`