NVIDIA
diff --git a/‎examples/diffusers/sparsity/README.md‎
Lines changed: 8 additions & 8 deletions b/‎examples/diffusers/sparsity/README.md‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎examples/diffusers/sparsity/wan22_skip_softmax.py‎
Lines changed: 22 additions & 19 deletions b/‎examples/diffusers/sparsity/wan22_skip_softmax.py‎
Lines changed: 22 additions & 19 deletions
diff --git a/‎examples/llm_sparsity/attention_sparsity/README.md‎
Lines changed: 36 additions & 7 deletions b/‎examples/llm_sparsity/attention_sparsity/README.md‎
Lines changed: 36 additions & 7 deletions
diff --git a/‎examples/llm_sparsity/attention_sparsity/hf_sa.py‎
Lines changed: 33 additions & 1 deletion b/‎examples/llm_sparsity/attention_sparsity/hf_sa.py‎
Lines changed: 33 additions & 1 deletion
diff --git a/‎examples/vllm_serve/README.md‎
Lines changed: 22 additions & 0 deletions b/‎examples/vllm_serve/README.md‎
Lines changed: 22 additions & 0 deletions
@@ -18,8 +18,8 @@ tiles whose attention scores are negligible during the FlashAttention computatio
 reducing FLOPs without retraining.
 
 Two modes are supported:
-- **Fixed raw threshold** — pass a log2-space threshold directly to the Triton
-  kernel. No calibration needed. Good for quick testing and sweeps.
+- **Fixed threshold** — pass a BLASST lambda threshold directly. No calibration
+  needed. Good for quick testing and sweeps.
 - **Calibrated threshold** — an exponential model
   (`scale_factor = a * exp(b * target_sparsity)`) is calibrated once via the
   Triton calibration kernel, then the target sparsity can be adjusted at runtime
@@ -37,10 +37,10 @@ Two modes are supported:
 ## Quick Start
 
 ```bash
-# Fixed raw threshold (no calibration, fast)
+# Fixed threshold (no calibration, fast)
 python wan22_skip_softmax.py \
     --model-path /path/to/Wan2.2-T2V-A14B-Diffusers \
-    --raw-threshold -0.7 \
+    --skip-softmax-threshold 0.61557 \
     --prompt "A cat playing piano" --output out.mp4
 
 # With calibration
@@ -58,17 +58,17 @@ python wan22_skip_softmax.py \
 # Report runtime sparsity (per-layer tile skip ratios)
 python wan22_skip_softmax.py \
     --model-path /path/to/Wan2.2-T2V-A14B-Diffusers \
-    --raw-threshold -0.7 --report-avg-sparsity \
+    --skip-softmax-threshold 0.61557 --report-avg-sparsity \
     --prompt "A cat playing piano" --output out.mp4
 ```
 
 ## Threshold Modes
 
 | Mode | How threshold reaches the kernel | Use case |
 |------|----------------------------------|----------|
-| **Raw threshold** (`--raw-threshold -0.7`) | Passed directly as `skip_threshold_log2` — no conversion | Quick testing, sweeps |
-| **Calibrated** (`--calibrate --target-sparsity 0.5`) | `scale_factor = a * exp(b * target)`, then backend computes `threshold = scale_factor / seq_k`, then kernel converts `log2(threshold) * sm_scale` | Production use with automatic seqlen adaptation |
-| **Static lambda** (default `skip_softmax_threshold=0.1`) | `log2(lambda) * sm_scale` | Fallback when neither raw nor calibrated |
+| **Fixed threshold** (`--skip-softmax-threshold 0.61557`) | Kernel converts the lambda threshold with `log2(lambda)` | Quick testing, sweeps |
+| **Calibrated** (`--calibrate --target-sparsity 0.5`) | `scale_factor = a * exp(b * target)`, then backend computes `threshold = scale_factor / seq_k`, then kernel converts `log2(threshold)` | Production use with automatic seqlen adaptation |
+| **Static lambda** (default `skip_softmax_threshold=0.1`) | Kernel converts `log2(lambda)` | Fallback when neither fixed nor calibrated |
 
 ## Known Issues
 
 
@@ -21,8 +21,8 @@
 1. **Baseline** — pass ``--baseline`` for dense inference (default diffusers backend).
 2. **Triton baseline** — pass ``--triton-baseline`` for dense Triton FA kernel
    (no skip-softmax, same kernel as sparse runs for apples-to-apples comparison).
-3. **Fixed raw threshold** — pass ``--raw-threshold`` to supply a log2-space
-   threshold directly to the Triton kernel. No calibration data is needed.
+3. **Fixed skip-softmax threshold** — pass ``--skip-softmax-threshold`` to
+   supply the BLASST lambda threshold. No calibration data is needed.
 4. **Calibrated threshold** — pass ``--calibrate`` to run exponential-model
    calibration (``scale_factor = a * exp(b * target_sparsity)``).
 
@@ -40,8 +40,8 @@
     python wan22_skip_softmax.py --baseline --prompt "A cat playing piano" \\
         --output baseline.mp4
 
-    # Fixed raw threshold (no calibration needed)
-    python wan22_skip_softmax.py --raw-threshold -5.0 --report-avg-sparsity \\
+    # Fixed skip-softmax threshold (no calibration needed)
+    python wan22_skip_softmax.py --skip-softmax-threshold 0.03125 --report-avg-sparsity \\
         --prompt "A cat playing piano" --output out.mp4
 
     # With calibration
@@ -150,12 +150,12 @@ def parse_args() -> argparse.Namespace:
         "apples-to-apples comparison with sparse runs)",
     )
     parser.add_argument(
-        "--raw-threshold",
+        "--skip-softmax-threshold",
         type=float,
         default=None,
-        help="Raw skip_threshold_log2 value passed directly to the Triton kernel. "
-        "Negative values (e.g., -5.0 means tile must be within 5 units of running max). "
-        "Bypasses calibration and lambda conversion. Typical range: -1 to -30.",
+        help="Fixed BLASST lambda threshold passed as skip_softmax_threshold. "
+        "Example: 0.03125 keeps tiles within 5 log2-score units of the running max. "
+        "Bypasses calibration. Typical range: 1e-6 to 0.5.",
     )
     parser.add_argument(
         "--skip-first-last",
@@ -214,8 +214,8 @@ def build_sparse_config(args: argparse.Namespace, num_blocks: int) -> dict:
     """Build sparse attention config from CLI args.
 
     Two modes:
-    - **Raw threshold**: ``--raw-threshold`` sets ``skip_softmax_raw_threshold``
-      directly on the Triton kernel — no calibration needed.
+    - **Fixed threshold**: ``--skip-softmax-threshold`` sets
+      ``skip_softmax_threshold`` directly — no calibration needed.
     - **Calibrated**: ``--calibrate`` collects multi-threshold sparsity statistics
       via the Triton calibration kernel, then fits an exponential model:
       ``scale_factor = a * exp(b * sparsity)``.
@@ -229,9 +229,9 @@ def build_sparse_config(args: argparse.Namespace, num_blocks: int) -> dict:
         "enable": True,
     }
 
-    # Raw threshold bypasses calibration and lambda conversion
-    if args.raw_threshold is not None:
-        attn_cfg["skip_softmax_raw_threshold"] = args.raw_threshold
+    # Fixed threshold bypasses calibration.
+    if args.skip_softmax_threshold is not None:
+        attn_cfg["skip_softmax_threshold"] = args.skip_softmax_threshold
 
     sparse_cfg: dict = {
         "*.attn1*": attn_cfg,  # Self-attention only
@@ -246,8 +246,8 @@ def build_sparse_config(args: argparse.Namespace, num_blocks: int) -> dict:
 
     config: dict = {"sparse_cfg": sparse_cfg}
 
-    # Add calibration config only when calibrating (not with raw threshold)
-    if args.calibrate and args.raw_threshold is None:
+    # Add calibration config only when calibrating (not with a fixed threshold)
+    if args.calibrate and args.skip_softmax_threshold is None:
         sparse_cfg["calibration"] = {
             "target_sparse_ratio": {"prefill": args.target_sparsity},
             "threshold_trials": DEFAULT_THRESHOLD_TRIALS,
@@ -407,10 +407,13 @@ def main() -> None:
     else:
         # Build calibration forward loop if needed
         forward_loop = None
-        if args.raw_threshold is not None:
-            print(f"Using fixed raw threshold: {args.raw_threshold} (skipping calibration)")
+        if args.skip_softmax_threshold is not None:
+            print(
+                f"Using fixed skip-softmax threshold: {args.skip_softmax_threshold} "
+                "(skipping calibration)"
+            )
             if args.calibrate:
-                print("Warning: --calibrate is ignored when --raw-threshold is set")
+                print("Warning: --calibrate is ignored when --skip-softmax-threshold is set")
         elif args.calibrate:
             forward_loop = build_calibration_forward_loop(
                 pipe,
@@ -426,7 +429,7 @@ def main() -> None:
             )
         else:
             print(
-                "Warning: neither --baseline, --raw-threshold, nor --calibrate specified; "
+                "Warning: neither --baseline, --skip-softmax-threshold, nor --calibrate specified; "
                 "using default static threshold"
             )
 
 
@@ -58,7 +58,7 @@ model = mtsa.sparsify(model, config=SKIP_SOFTMAX_CALIB)
 
 ### N:M Sparse Softmax (SPARSE_SOFTMAX_DEFAULT)
 
-Applies N:M structured sparsity to attention scores using the Triton backend. For every M consecutive key positions, keeps only the top-N scores and sets the rest to -inf. Supports M=4 (N=1,2,3) and M=8 (N=1..7). Attention sinks and a local dense window can be configured to preserve important positions.
+Applies N:M structured sparsity to attention scores using the Triton backend. For every M consecutive key positions, keeps only the top-N scores and sets the rest to -inf. Supports M=4 (N=1,2,3) and M=8 (N=1..7). Attention sinks and a local recent-token window can be configured to preserve important positions.
 
 ```python
 from modelopt.torch.sparsity.attention_sparsity.config import SPARSE_SOFTMAX_DEFAULT
@@ -81,8 +81,8 @@ sparse_cfg = {
             "method": "triton_sparse_softmax",
             "sparsity_n": 2,            # Keep top-2 of every 4
             "sparsity_m": 4,            # Group size
-            "num_sink_tokens": 4,       # Keep first 4 tokens dense (attention sinks)
-            "dense_window_size": 128,   # Keep tokens within distance 128 dense
+            "dense_sink_tokens": 4,       # Exclude first 4 tokens from N:M and keep dense
+            "dense_recent_tokens": 128,   # Exclude recent 128 tokens from N:M and keep dense
             "backend": "triton",
             "enable": True,
         },
@@ -125,7 +125,7 @@ Apply sparse attention with a fixed threshold:
 ```bash
 python hf_sa.py \
     --pyt_ckpt_path Qwen/Qwen3-8B \
-    --sparse_attn skip_softmax
+    --sparse_attn sparse_softmax
 ```
 
 ### With RULER Calibration
@@ -144,15 +144,19 @@ The calibration process:
 2. Collects attention statistics during forward passes
 3. Determines optimal threshold scale factor for target sparsity ratio
 
+Set the target sparsity ratio in the selected sparse attention config, or override
+both prefill and decode targets from the example script with `--target_sparse_ratio`.
+
 ### Command Line Arguments
 
 | Argument | Default | Description |
 |----------|---------|-------------|
 | `--pyt_ckpt_path` | Required | HuggingFace model path or name |
-| `--sparse_attn` | `skip_softmax` | Configuration: `skip_softmax`, `skip_softmax_calib`, or `sparse_softmax` |
-| `--backend` | `pytorch` | Backend: `pytorch` (skip-softmax) or `triton` (N:M sparse softmax) |
+| `--sparse_attn` | `skip_softmax_calib` | Configuration: `skip_softmax_calib`, `sparse_softmax`, or `skip_softmax_calib_sparse24` |
+| `--backend` | selected config | Backend: `pytorch` (skip-softmax) or `triton` (N:M sparse softmax) |
 | `--seq_len` | `2048` | Maximum sequence length for input prompts |
 | `--export_dir` | `None` | Directory to export the sparsified model |
+| `--target_sparse_ratio` | selected config | Target sparsity ratio for skip-softmax calibration |
 
 ## Output Comparison
 
@@ -175,7 +179,27 @@ python hf_sa.py \
     --export_dir ./exported_sparse_model
 ```
 
-The exported model can be loaded and used with standard HuggingFace APIs.
+Export a 2:4 sparse-softmax checkpoint for vLLM restore:
+
+```bash
+python hf_sa.py \
+    --pyt_ckpt_path Qwen/Qwen3-8B \
+    --sparse_attn sparse_softmax \
+    --export_dir ./exported_sparse24_model
+```
+
+Export calibrated skip-softmax plus 2:4 sparse-softmax metadata for combined vLLM restore:
+
+```bash
+python hf_sa.py \
+    --pyt_ckpt_path Qwen/Qwen3-8B \
+    --sparse_attn skip_softmax_calib_sparse24 \
+    --export_dir ./exported_skip_sparse24_model
+```
+
+The exported checkpoint writes `sparse_attention_config` into `config.json`. For combined
+export, the skip-softmax calibration and 2:4 sparse-softmax metadata are defined in the
+selected config rather than CLI overrides.
 
 ## Custom Configuration
 
@@ -198,6 +222,11 @@ custom_config = {
             "bc": 128,          # Flash Attention block columns
             "backend": "pytorch",
             "collect_stats": True,
+            "sparsity_n": 2,              # Export top-2 of every 4 for vLLM restore
+            "sparsity_m": 4,
+            "dense_sink_tokens": 0,
+            "dense_recent_tokens": 64,
+            "export_sparse_softmax": True,
             "enable": True,
         },
         "default": {"enable": False},
 
@@ -28,7 +28,11 @@
 import modelopt.torch.opt as mto
 import modelopt.torch.sparsity.attention_sparsity as mtsa
 from modelopt.torch.export import export_hf_checkpoint
-from modelopt.torch.sparsity.attention_sparsity.config import SKIP_SOFTMAX_CALIB
+from modelopt.torch.sparsity.attention_sparsity.config import (
+    SKIP_SOFTMAX_CALIB,
+    SKIP_SOFTMAX_CALIB_SPARSE24,
+    SPARSE_SOFTMAX_DEFAULT,
+)
 from modelopt.torch.utils.memory_monitor import launch_memory_monitor
 
 RAND_SEED = 1234
@@ -39,6 +43,8 @@
 # Sparse attention configuration choices
 SPARSE_ATTN_CFG_CHOICES = {
     "skip_softmax_calib": SKIP_SOFTMAX_CALIB,
+    "skip_softmax_calib_sparse24": SKIP_SOFTMAX_CALIB_SPARSE24,
+    "sparse_softmax": SPARSE_SOFTMAX_DEFAULT,
 }
 
 
@@ -172,6 +178,14 @@ def main(args):
             "prefill": args.target_sparse_ratio,
             "decode": args.target_sparse_ratio,
         }
+    calib = sparse_cfg.get("calibration")
+    if isinstance(calib, dict):
+        if args.calib_samples is not None:
+            calib["samples"] = args.calib_samples
+        if args.calib_max_seqlen is not None:
+            calib["max_seqlen"] = args.calib_max_seqlen
+        if args.calib_chunk_size is not None:
+            calib["chunk_size"] = args.calib_chunk_size
 
     model = mtsa.sparsify(model, config=sparse_config)
     print("Sparse attention applied successfully!")
@@ -270,6 +284,24 @@ def main(args):
         default=None,
         help="Target sparsity ratio for calibration (0.0 to 1.0). Overrides config value.",
     )
+    parser.add_argument(
+        "--calib_samples",
+        type=int,
+        default=None,
+        help="Number of RULER samples for calibration. Overrides config value.",
+    )
+    parser.add_argument(
+        "--calib_max_seqlen",
+        type=int,
+        default=None,
+        help="Maximum sequence length for calibration. Overrides config value.",
+    )
+    parser.add_argument(
+        "--calib_chunk_size",
+        type=int,
+        default=None,
+        help="Chunk size for calibration prefill. Overrides config value.",
+    )
 
     args = parser.parse_args()
     main(args)
@@ -95,6 +95,28 @@ MODELOPT_STATE_PATH=<vllm_fq_modelopt_state.pth> python vllm_serve_fakequant.py
 QUANT_CFG=<quant_cfg> QUANT_FILE_PATH=<quantizer_state.pth> python vllm_serve_fakequant.py <model_path> -tp 8 --host 0.0.0.0 --port 8000
 ```
 
+## Serve a model with sparse attention in vLLM
+
+Apply ModelOpt sparse attention at serve time. The launcher replaces vLLM's `FlashAttentionImpl` with `ModelOptSparseAttentionImpl` (Triton kernel with paged KV cache support) on every attention layer right after model load.
+
+The configuration is read from the checkpoint's `config.json` `sparse_attention_config` block, written by ModelOpt's HF export. The launcher restores calibrated skip-softmax metadata and N:M sparse-softmax metadata (`sparsity_n`, `sparsity_m`, `dense_sink_tokens`, `dense_recent_tokens`). Checkpoints exported with both metadata entries use ModelOpt Triton for sparse prefill launches; decode-only launches and launches without active sparse work delegate back to vLLM FlashAttention.
+
+Workflow:
+
+1. Calibrate and export the model with `examples/llm_sparsity/attention_sparsity/hf_sa.py`. This writes `sparse_attention_config` into the exported checkpoint's `config.json`.
+2. Serve the exported checkpoint with `--enforce-eager` (CUDA graph capture is not yet validated with the sparse attention kernel — see Known Problems):
+
+   ```bash
+   python vllm_serve_sparse_attn.py <EXPORT_DIR> --enforce-eager -tp 8 --host 0.0.0.0 --port 8000
+   ```
+
+If the checkpoint has no `sparse_attention_config`, the worker logs a message and passes through — vLLM runs unchanged. Quant-only flows are handled by `vllm_serve_fakequant.py`; combined sparse + quant will land in a follow-up PR.
+
+Limitations:
+
+- vLLM V1 chunked prefill and prefix-cache suffix attention are supported by offsetting query positions into the longer KV span.
+- CUDA graph capture is not validated yet — use `--enforce-eager`.
+
 ## Known Problems
 
 1. **MCore reload does not use `MODELOPT_STATE_PATH`**; use `QUANT_FILE_PATH` and make sure `QUANT_CFG` matches the quantization recipe used for the original MCore model (otherwise quantizer keys/config won’t align).