You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[4/n] Add vLLM integration for modelopt sparse attention (#1127)
### What does this PR do?
Type of change: New feature, new example, new tests, documentation.
Adds vLLM integration for ModelOpt sparse attention with paged KV cache
support.
This PR extends the ModelOpt Triton flash attention path so K/V can be
read directly from vLLM's paged KV cache through `block_table` lookup.
This avoids gather-to-contiguous copies when serving exported
sparse-attention checkpoints with vLLM.
The vLLM integration swaps vLLM's `FlashAttentionImpl` with
`ModelOptSparseAttentionImpl` after model load. The sparse configuration
is read from the exported checkpoint's `config.json`
`sparse_attention_config` block, written by
`examples/llm_sparsity/attention_sparsity/hf_sa.py`.
The restored checkpoint metadata supports:
- calibrated skip-softmax metadata (`threshold_scale_factor`,
`target_sparse_ratio`)
- N:M sparse-softmax metadata (`sparsity_n`, `sparsity_m`)
- dense token preservation metadata (`dense_sink_tokens`,
`dense_recent_tokens`)
The vLLM path uses ModelOpt Triton for sparse prefill launches.
Decode-only launches, cascade/prefix-cache metadata, and launches
without active sparse work delegate back to vLLM FlashAttention.
### Limitations
- Sparse attention is enabled for sparse prefill only.
- Decode-only launches currently fall back to vLLM FlashAttention.
- Attention sinks from vLLM FlashAttention are rejected until the
ModelOpt Triton path supports them.
- CUDA graph capture is not validated with this sparse attention path
yet; use `--enforce-eager`.
- Quant-only serving remains covered by `vllm_serve_fakequant.py`.
- Combined sparse attention + quantization serving is not handled by
this launcher in this PR and is planned as follow-up work.
### Usage
Export a checkpoint with calibrated skip-softmax and sparse24 metadata:
```bash
python examples/llm_sparsity/attention_sparsity/hf_sa.py \
--pyt_ckpt_path /path/to/hf-model \
--sparse_attn skip_softmax_calib_sparse24 \
--target_sparse_ratio 0.5 \
--calib_samples 64 \
--calib_max_seqlen 16384 \
--calib_chunk_size 4096 \
--seq_len 2048 \
--export_dir /path/to/modelopt-skipsoftmax-sparse24-export
```
Serve the exported checkpoint with the vLLM sparse-attention launcher:
```bash
PYTHONPATH=$PWD python examples/vllm_serve/vllm_serve_sparse_attn.py \
/path/to/modelopt-skipsoftmax-sparse24-export \
--tensor-parallel-size 8 \
--host 0.0.0.0 \
--port 8000 \
--trust-remote-code \
--enforce-eager
```
Send a request through the OpenAI-compatible endpoint:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/path/to/modelopt-skipsoftmax-sparse24-export",
"messages": [{"role": "user", "content": "Explain sparse attention in one paragraph."}],
"max_tokens": 128
}'
```
### Testing
GitHub CI on the latest commit is green:
- DCO
- code-quality
- docs build / deploy preview
- unit tests, including Linux, Windows, multi-version, partial-install,
and launcher jobs
- example tests
- GPU tests, including required GPU gate
- regression tests, including required regression gate
- `codecov/project`
Focused test coverage added/updated for this PR includes:
-
`tests/unit/torch/sparsity/attention_sparsity/test_sparse_attn_worker.py`
-
`tests/unit/torch/sparsity/attention_sparsity/test_sparse_attn_config.py`
-
`tests/unit/torch/sparsity/attention_sparsity/test_sparse_attention_conversion.py`
-
`tests/unit/torch/sparsity/attention_sparsity/test_triton_skip_softmax.py`
- `tests/gpu/torch/sparsity/attention_sparsity/test_vllm_plugin.py`
- `tests/gpu/torch/kernels/common/attention/test_triton_fa_paged.py`
-
`tests/gpu/torch/kernels/sparsity/attention/test_triton_fa_skip_softmax.py`
-
`tests/gpu/torch/kernels/sparsity/attention/test_triton_fa_sparse_nm.py`
-
`tests/gpu/torch/kernels/sparsity/attention/test_triton_fa_calibrate.py`
Manual / NEL eval validation:
- Served a ModelOpt exported sparse-attention checkpoint through
`examples/vllm_serve/vllm_serve_sparse_attn.py`.
- Launched RULER64K NEL evals on DFW with `coreai_nvfm_llm`.
- Current partial RULER64K prediction scores, before final `results.yml`
is written:
- `skipsoftmax-only`: 98.59% over 4500 flushed samples
- `skipsoftmax-r0.7`: 99.70% over 1000 flushed samples
- `skipsoftmax-r0.9`: 99.70% over 1000 flushed samples
### Before your PR is "*Ready for review*"
Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).
Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).
- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - no new
PIP dependency.
- Did you write any new necessary tests?: ✅
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A - documentation, examples, and tests are updated for this vLLM
integration path.
### Additional Information
Follow-up work:
- Validate and enable CUDA graph capture for the sparse vLLM path.
- Add combined sparse attention + quantization serving once the combined
path is tested.
- Investigate whether skip-softmax should also be enabled during decode.
---------
Signed-off-by: Kai Xu <kaix@nvidia.com>
|**Raw threshold** (`--raw-threshold -0.7`) |Passed directly as `skip_threshold_log2` — no conversion| Quick testing, sweeps |
70
-
|**Calibrated** (`--calibrate --target-sparsity 0.5`) |`scale_factor = a * exp(b * target)`, then backend computes `threshold = scale_factor / seq_k`, then kernel converts `log2(threshold) * sm_scale`| Production use with automatic seqlen adaptation |
71
-
|**Static lambda** (default `skip_softmax_threshold=0.1`) |`log2(lambda) * sm_scale`| Fallback when neither raw nor calibrated |
69
+
|**Fixed threshold** (`--skip-softmax-threshold 0.61557`) |Kernel converts the lambda threshold with `log2(lambda)`| Quick testing, sweeps |
70
+
|**Calibrated** (`--calibrate --target-sparsity 0.5`) |`scale_factor = a * exp(b * target)`, then backend computes `threshold = scale_factor / seq_k`, then kernel converts `log2(threshold)`| Production use with automatic seqlen adaptation |
71
+
|**Static lambda** (default `skip_softmax_threshold=0.1`) |Kernel converts `log2(lambda)`| Fallback when neither fixed nor calibrated |
Copy file name to clipboardExpand all lines: examples/llm_sparsity/attention_sparsity/README.md
+36-7Lines changed: 36 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -58,7 +58,7 @@ model = mtsa.sparsify(model, config=SKIP_SOFTMAX_CALIB)
58
58
59
59
### N:M Sparse Softmax (SPARSE_SOFTMAX_DEFAULT)
60
60
61
-
Applies N:M structured sparsity to attention scores using the Triton backend. For every M consecutive key positions, keeps only the top-N scores and sets the rest to -inf. Supports M=4 (N=1,2,3) and M=8 (N=1..7). Attention sinks and a local dense window can be configured to preserve important positions.
61
+
Applies N:M structured sparsity to attention scores using the Triton backend. For every M consecutive key positions, keeps only the top-N scores and sets the rest to -inf. Supports M=4 (N=1,2,3) and M=8 (N=1..7). Attention sinks and a local recent-token window can be configured to preserve important positions.
62
62
63
63
```python
64
64
from modelopt.torch.sparsity.attention_sparsity.config importSPARSE_SOFTMAX_DEFAULT
@@ -81,8 +81,8 @@ sparse_cfg = {
81
81
"method": "triton_sparse_softmax",
82
82
"sparsity_n": 2, # Keep top-2 of every 4
83
83
"sparsity_m": 4, # Group size
84
-
"num_sink_tokens": 4, #Keep first 4 tokens dense (attention sinks)
85
-
"dense_window_size": 128, #Keep tokens within distance 128 dense
84
+
"dense_sink_tokens": 4, #Exclude first 4 tokens from N:M and keep dense
85
+
"dense_recent_tokens": 128, #Exclude recent 128 tokens from N:M and keep dense
86
86
"backend": "triton",
87
87
"enable": True,
88
88
},
@@ -125,7 +125,7 @@ Apply sparse attention with a fixed threshold:
125
125
```bash
126
126
python hf_sa.py \
127
127
--pyt_ckpt_path Qwen/Qwen3-8B \
128
-
--sparse_attn skip_softmax
128
+
--sparse_attn sparse_softmax
129
129
```
130
130
131
131
### With RULER Calibration
@@ -144,15 +144,19 @@ The calibration process:
144
144
2. Collects attention statistics during forward passes
145
145
3. Determines optimal threshold scale factor for target sparsity ratio
146
146
147
+
Set the target sparsity ratio in the selected sparse attention config, or override
148
+
both prefill and decode targets from the example script with `--target_sparse_ratio`.
149
+
147
150
### Command Line Arguments
148
151
149
152
| Argument | Default | Description |
150
153
|----------|---------|-------------|
151
154
|`--pyt_ckpt_path`| Required | HuggingFace model path or name |
152
-
|`--sparse_attn`|`skip_softmax`| Configuration: `skip_softmax`, `skip_softmax_calib`, or `sparse_softmax`|
153
-
|`--backend`|`pytorch`| Backend: `pytorch` (skip-softmax) or `triton` (N:M sparse softmax) |
155
+
|`--sparse_attn`|`skip_softmax_calib`| Configuration: `skip_softmax_calib`, `sparse_softmax`, or `skip_softmax_calib_sparse24`|
Apply ModelOpt sparse attention at serve time. The launcher replaces vLLM's `FlashAttentionImpl` with `ModelOptSparseAttentionImpl` (Triton kernel with paged KV cache support) on every attention layer right after model load.
101
+
102
+
The configuration is read from the checkpoint's `config.json``sparse_attention_config` block, written by ModelOpt's HF export. The launcher restores calibrated skip-softmax metadata and N:M sparse-softmax metadata (`sparsity_n`, `sparsity_m`, `dense_sink_tokens`, `dense_recent_tokens`). Checkpoints exported with both metadata entries use ModelOpt Triton for sparse prefill launches; decode-only launches and launches without active sparse work delegate back to vLLM FlashAttention.
103
+
104
+
Workflow:
105
+
106
+
1. Calibrate and export the model with `examples/llm_sparsity/attention_sparsity/hf_sa.py`. This writes `sparse_attention_config` into the exported checkpoint's `config.json`.
107
+
2. Serve the exported checkpoint with `--enforce-eager` (CUDA graph capture is not yet validated with the sparse attention kernel — see Known Problems):
If the checkpoint has no `sparse_attention_config`, the worker logs a message and passes through — vLLM runs unchanged. Quant-only flows are handled by `vllm_serve_fakequant.py`; combined sparse + quant will land in a follow-up PR.
114
+
115
+
Limitations:
116
+
117
+
- vLLM V1 chunked prefill and prefix-cache suffix attention are supported by offsetting query positions into the longer KV span.
118
+
- CUDA graph capture is not validated yet — use `--enforce-eager`.
119
+
98
120
## Known Problems
99
121
100
122
1.**MCore reload does not use `MODELOPT_STATE_PATH`**; use `QUANT_FILE_PATH` and make sure `QUANT_CFG` matches the quantization recipe used for the original MCore model (otherwise quantizer keys/config won’t align).
0 commit comments