Skip to content

Commit c9098b6

Browse files
authored
[4/n] Add vLLM integration for modelopt sparse attention (#1127)
### What does this PR do? Type of change: New feature, new example, new tests, documentation. Adds vLLM integration for ModelOpt sparse attention with paged KV cache support. This PR extends the ModelOpt Triton flash attention path so K/V can be read directly from vLLM's paged KV cache through `block_table` lookup. This avoids gather-to-contiguous copies when serving exported sparse-attention checkpoints with vLLM. The vLLM integration swaps vLLM's `FlashAttentionImpl` with `ModelOptSparseAttentionImpl` after model load. The sparse configuration is read from the exported checkpoint's `config.json` `sparse_attention_config` block, written by `examples/llm_sparsity/attention_sparsity/hf_sa.py`. The restored checkpoint metadata supports: - calibrated skip-softmax metadata (`threshold_scale_factor`, `target_sparse_ratio`) - N:M sparse-softmax metadata (`sparsity_n`, `sparsity_m`) - dense token preservation metadata (`dense_sink_tokens`, `dense_recent_tokens`) The vLLM path uses ModelOpt Triton for sparse prefill launches. Decode-only launches, cascade/prefix-cache metadata, and launches without active sparse work delegate back to vLLM FlashAttention. ### Limitations - Sparse attention is enabled for sparse prefill only. - Decode-only launches currently fall back to vLLM FlashAttention. - Attention sinks from vLLM FlashAttention are rejected until the ModelOpt Triton path supports them. - CUDA graph capture is not validated with this sparse attention path yet; use `--enforce-eager`. - Quant-only serving remains covered by `vllm_serve_fakequant.py`. - Combined sparse attention + quantization serving is not handled by this launcher in this PR and is planned as follow-up work. ### Usage Export a checkpoint with calibrated skip-softmax and sparse24 metadata: ```bash python examples/llm_sparsity/attention_sparsity/hf_sa.py \ --pyt_ckpt_path /path/to/hf-model \ --sparse_attn skip_softmax_calib_sparse24 \ --target_sparse_ratio 0.5 \ --calib_samples 64 \ --calib_max_seqlen 16384 \ --calib_chunk_size 4096 \ --seq_len 2048 \ --export_dir /path/to/modelopt-skipsoftmax-sparse24-export ``` Serve the exported checkpoint with the vLLM sparse-attention launcher: ```bash PYTHONPATH=$PWD python examples/vllm_serve/vllm_serve_sparse_attn.py \ /path/to/modelopt-skipsoftmax-sparse24-export \ --tensor-parallel-size 8 \ --host 0.0.0.0 \ --port 8000 \ --trust-remote-code \ --enforce-eager ``` Send a request through the OpenAI-compatible endpoint: ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/path/to/modelopt-skipsoftmax-sparse24-export", "messages": [{"role": "user", "content": "Explain sparse attention in one paragraph."}], "max_tokens": 128 }' ``` ### Testing GitHub CI on the latest commit is green: - DCO - code-quality - docs build / deploy preview - unit tests, including Linux, Windows, multi-version, partial-install, and launcher jobs - example tests - GPU tests, including required GPU gate - regression tests, including required regression gate - `codecov/project` Focused test coverage added/updated for this PR includes: - `tests/unit/torch/sparsity/attention_sparsity/test_sparse_attn_worker.py` - `tests/unit/torch/sparsity/attention_sparsity/test_sparse_attn_config.py` - `tests/unit/torch/sparsity/attention_sparsity/test_sparse_attention_conversion.py` - `tests/unit/torch/sparsity/attention_sparsity/test_triton_skip_softmax.py` - `tests/gpu/torch/sparsity/attention_sparsity/test_vllm_plugin.py` - `tests/gpu/torch/kernels/common/attention/test_triton_fa_paged.py` - `tests/gpu/torch/kernels/sparsity/attention/test_triton_fa_skip_softmax.py` - `tests/gpu/torch/kernels/sparsity/attention/test_triton_fa_sparse_nm.py` - `tests/gpu/torch/kernels/sparsity/attention/test_triton_fa_calibrate.py` Manual / NEL eval validation: - Served a ModelOpt exported sparse-attention checkpoint through `examples/vllm_serve/vllm_serve_sparse_attn.py`. - Launched RULER64K NEL evals on DFW with `coreai_nvfm_llm`. - Current partial RULER64K prediction scores, before final `results.yml` is written: - `skipsoftmax-only`: 98.59% over 4500 flushed samples - `skipsoftmax-r0.7`: 99.70% over 1000 flushed samples - `skipsoftmax-r0.9`: 99.70% over 1000 flushed samples ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - no new PIP dependency. - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A - documentation, examples, and tests are updated for this vLLM integration path. ### Additional Information Follow-up work: - Validate and enable CUDA graph capture for the sparse vLLM path. - Add combined sparse attention + quantization serving once the combined path is tested. - Investigate whether skip-softmax should also be enabled during decode. --------- Signed-off-by: Kai Xu <kaix@nvidia.com>
1 parent 910dc49 commit c9098b6

33 files changed

Lines changed: 3069 additions & 295 deletions

examples/diffusers/sparsity/README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,8 @@ tiles whose attention scores are negligible during the FlashAttention computatio
1818
reducing FLOPs without retraining.
1919

2020
Two modes are supported:
21-
- **Fixed raw threshold** — pass a log2-space threshold directly to the Triton
22-
kernel. No calibration needed. Good for quick testing and sweeps.
21+
- **Fixed threshold** — pass a BLASST lambda threshold directly. No calibration
22+
needed. Good for quick testing and sweeps.
2323
- **Calibrated threshold** — an exponential model
2424
(`scale_factor = a * exp(b * target_sparsity)`) is calibrated once via the
2525
Triton calibration kernel, then the target sparsity can be adjusted at runtime
@@ -37,10 +37,10 @@ Two modes are supported:
3737
## Quick Start
3838

3939
```bash
40-
# Fixed raw threshold (no calibration, fast)
40+
# Fixed threshold (no calibration, fast)
4141
python wan22_skip_softmax.py \
4242
--model-path /path/to/Wan2.2-T2V-A14B-Diffusers \
43-
--raw-threshold -0.7 \
43+
--skip-softmax-threshold 0.61557 \
4444
--prompt "A cat playing piano" --output out.mp4
4545

4646
# With calibration
@@ -58,17 +58,17 @@ python wan22_skip_softmax.py \
5858
# Report runtime sparsity (per-layer tile skip ratios)
5959
python wan22_skip_softmax.py \
6060
--model-path /path/to/Wan2.2-T2V-A14B-Diffusers \
61-
--raw-threshold -0.7 --report-avg-sparsity \
61+
--skip-softmax-threshold 0.61557 --report-avg-sparsity \
6262
--prompt "A cat playing piano" --output out.mp4
6363
```
6464

6565
## Threshold Modes
6666

6767
| Mode | How threshold reaches the kernel | Use case |
6868
|------|----------------------------------|----------|
69-
| **Raw threshold** (`--raw-threshold -0.7`) | Passed directly as `skip_threshold_log2` — no conversion | Quick testing, sweeps |
70-
| **Calibrated** (`--calibrate --target-sparsity 0.5`) | `scale_factor = a * exp(b * target)`, then backend computes `threshold = scale_factor / seq_k`, then kernel converts `log2(threshold) * sm_scale` | Production use with automatic seqlen adaptation |
71-
| **Static lambda** (default `skip_softmax_threshold=0.1`) | `log2(lambda) * sm_scale` | Fallback when neither raw nor calibrated |
69+
| **Fixed threshold** (`--skip-softmax-threshold 0.61557`) | Kernel converts the lambda threshold with `log2(lambda)` | Quick testing, sweeps |
70+
| **Calibrated** (`--calibrate --target-sparsity 0.5`) | `scale_factor = a * exp(b * target)`, then backend computes `threshold = scale_factor / seq_k`, then kernel converts `log2(threshold)` | Production use with automatic seqlen adaptation |
71+
| **Static lambda** (default `skip_softmax_threshold=0.1`) | Kernel converts `log2(lambda)` | Fallback when neither fixed nor calibrated |
7272

7373
## Known Issues
7474

examples/diffusers/sparsity/wan22_skip_softmax.py

Lines changed: 22 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,8 @@
2121
1. **Baseline** — pass ``--baseline`` for dense inference (default diffusers backend).
2222
2. **Triton baseline** — pass ``--triton-baseline`` for dense Triton FA kernel
2323
(no skip-softmax, same kernel as sparse runs for apples-to-apples comparison).
24-
3. **Fixed raw threshold** — pass ``--raw-threshold`` to supply a log2-space
25-
threshold directly to the Triton kernel. No calibration data is needed.
24+
3. **Fixed skip-softmax threshold** — pass ``--skip-softmax-threshold`` to
25+
supply the BLASST lambda threshold. No calibration data is needed.
2626
4. **Calibrated threshold** — pass ``--calibrate`` to run exponential-model
2727
calibration (``scale_factor = a * exp(b * target_sparsity)``).
2828
@@ -40,8 +40,8 @@
4040
python wan22_skip_softmax.py --baseline --prompt "A cat playing piano" \\
4141
--output baseline.mp4
4242
43-
# Fixed raw threshold (no calibration needed)
44-
python wan22_skip_softmax.py --raw-threshold -5.0 --report-avg-sparsity \\
43+
# Fixed skip-softmax threshold (no calibration needed)
44+
python wan22_skip_softmax.py --skip-softmax-threshold 0.03125 --report-avg-sparsity \\
4545
--prompt "A cat playing piano" --output out.mp4
4646
4747
# With calibration
@@ -150,12 +150,12 @@ def parse_args() -> argparse.Namespace:
150150
"apples-to-apples comparison with sparse runs)",
151151
)
152152
parser.add_argument(
153-
"--raw-threshold",
153+
"--skip-softmax-threshold",
154154
type=float,
155155
default=None,
156-
help="Raw skip_threshold_log2 value passed directly to the Triton kernel. "
157-
"Negative values (e.g., -5.0 means tile must be within 5 units of running max). "
158-
"Bypasses calibration and lambda conversion. Typical range: -1 to -30.",
156+
help="Fixed BLASST lambda threshold passed as skip_softmax_threshold. "
157+
"Example: 0.03125 keeps tiles within 5 log2-score units of the running max. "
158+
"Bypasses calibration. Typical range: 1e-6 to 0.5.",
159159
)
160160
parser.add_argument(
161161
"--skip-first-last",
@@ -214,8 +214,8 @@ def build_sparse_config(args: argparse.Namespace, num_blocks: int) -> dict:
214214
"""Build sparse attention config from CLI args.
215215
216216
Two modes:
217-
- **Raw threshold**: ``--raw-threshold`` sets ``skip_softmax_raw_threshold``
218-
directly on the Triton kernel — no calibration needed.
217+
- **Fixed threshold**: ``--skip-softmax-threshold`` sets
218+
``skip_softmax_threshold`` directly — no calibration needed.
219219
- **Calibrated**: ``--calibrate`` collects multi-threshold sparsity statistics
220220
via the Triton calibration kernel, then fits an exponential model:
221221
``scale_factor = a * exp(b * sparsity)``.
@@ -229,9 +229,9 @@ def build_sparse_config(args: argparse.Namespace, num_blocks: int) -> dict:
229229
"enable": True,
230230
}
231231

232-
# Raw threshold bypasses calibration and lambda conversion
233-
if args.raw_threshold is not None:
234-
attn_cfg["skip_softmax_raw_threshold"] = args.raw_threshold
232+
# Fixed threshold bypasses calibration.
233+
if args.skip_softmax_threshold is not None:
234+
attn_cfg["skip_softmax_threshold"] = args.skip_softmax_threshold
235235

236236
sparse_cfg: dict = {
237237
"*.attn1*": attn_cfg, # Self-attention only
@@ -246,8 +246,8 @@ def build_sparse_config(args: argparse.Namespace, num_blocks: int) -> dict:
246246

247247
config: dict = {"sparse_cfg": sparse_cfg}
248248

249-
# Add calibration config only when calibrating (not with raw threshold)
250-
if args.calibrate and args.raw_threshold is None:
249+
# Add calibration config only when calibrating (not with a fixed threshold)
250+
if args.calibrate and args.skip_softmax_threshold is None:
251251
sparse_cfg["calibration"] = {
252252
"target_sparse_ratio": {"prefill": args.target_sparsity},
253253
"threshold_trials": DEFAULT_THRESHOLD_TRIALS,
@@ -407,10 +407,13 @@ def main() -> None:
407407
else:
408408
# Build calibration forward loop if needed
409409
forward_loop = None
410-
if args.raw_threshold is not None:
411-
print(f"Using fixed raw threshold: {args.raw_threshold} (skipping calibration)")
410+
if args.skip_softmax_threshold is not None:
411+
print(
412+
f"Using fixed skip-softmax threshold: {args.skip_softmax_threshold} "
413+
"(skipping calibration)"
414+
)
412415
if args.calibrate:
413-
print("Warning: --calibrate is ignored when --raw-threshold is set")
416+
print("Warning: --calibrate is ignored when --skip-softmax-threshold is set")
414417
elif args.calibrate:
415418
forward_loop = build_calibration_forward_loop(
416419
pipe,
@@ -426,7 +429,7 @@ def main() -> None:
426429
)
427430
else:
428431
print(
429-
"Warning: neither --baseline, --raw-threshold, nor --calibrate specified; "
432+
"Warning: neither --baseline, --skip-softmax-threshold, nor --calibrate specified; "
430433
"using default static threshold"
431434
)
432435

examples/llm_sparsity/attention_sparsity/README.md

Lines changed: 36 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ model = mtsa.sparsify(model, config=SKIP_SOFTMAX_CALIB)
5858

5959
### N:M Sparse Softmax (SPARSE_SOFTMAX_DEFAULT)
6060

61-
Applies N:M structured sparsity to attention scores using the Triton backend. For every M consecutive key positions, keeps only the top-N scores and sets the rest to -inf. Supports M=4 (N=1,2,3) and M=8 (N=1..7). Attention sinks and a local dense window can be configured to preserve important positions.
61+
Applies N:M structured sparsity to attention scores using the Triton backend. For every M consecutive key positions, keeps only the top-N scores and sets the rest to -inf. Supports M=4 (N=1,2,3) and M=8 (N=1..7). Attention sinks and a local recent-token window can be configured to preserve important positions.
6262

6363
```python
6464
from modelopt.torch.sparsity.attention_sparsity.config import SPARSE_SOFTMAX_DEFAULT
@@ -81,8 +81,8 @@ sparse_cfg = {
8181
"method": "triton_sparse_softmax",
8282
"sparsity_n": 2, # Keep top-2 of every 4
8383
"sparsity_m": 4, # Group size
84-
"num_sink_tokens": 4, # Keep first 4 tokens dense (attention sinks)
85-
"dense_window_size": 128, # Keep tokens within distance 128 dense
84+
"dense_sink_tokens": 4, # Exclude first 4 tokens from N:M and keep dense
85+
"dense_recent_tokens": 128, # Exclude recent 128 tokens from N:M and keep dense
8686
"backend": "triton",
8787
"enable": True,
8888
},
@@ -125,7 +125,7 @@ Apply sparse attention with a fixed threshold:
125125
```bash
126126
python hf_sa.py \
127127
--pyt_ckpt_path Qwen/Qwen3-8B \
128-
--sparse_attn skip_softmax
128+
--sparse_attn sparse_softmax
129129
```
130130

131131
### With RULER Calibration
@@ -144,15 +144,19 @@ The calibration process:
144144
2. Collects attention statistics during forward passes
145145
3. Determines optimal threshold scale factor for target sparsity ratio
146146

147+
Set the target sparsity ratio in the selected sparse attention config, or override
148+
both prefill and decode targets from the example script with `--target_sparse_ratio`.
149+
147150
### Command Line Arguments
148151

149152
| Argument | Default | Description |
150153
|----------|---------|-------------|
151154
| `--pyt_ckpt_path` | Required | HuggingFace model path or name |
152-
| `--sparse_attn` | `skip_softmax` | Configuration: `skip_softmax`, `skip_softmax_calib`, or `sparse_softmax` |
153-
| `--backend` | `pytorch` | Backend: `pytorch` (skip-softmax) or `triton` (N:M sparse softmax) |
155+
| `--sparse_attn` | `skip_softmax_calib` | Configuration: `skip_softmax_calib`, `sparse_softmax`, or `skip_softmax_calib_sparse24` |
156+
| `--backend` | selected config | Backend: `pytorch` (skip-softmax) or `triton` (N:M sparse softmax) |
154157
| `--seq_len` | `2048` | Maximum sequence length for input prompts |
155158
| `--export_dir` | `None` | Directory to export the sparsified model |
159+
| `--target_sparse_ratio` | selected config | Target sparsity ratio for skip-softmax calibration |
156160

157161
## Output Comparison
158162

@@ -175,7 +179,27 @@ python hf_sa.py \
175179
--export_dir ./exported_sparse_model
176180
```
177181

178-
The exported model can be loaded and used with standard HuggingFace APIs.
182+
Export a 2:4 sparse-softmax checkpoint for vLLM restore:
183+
184+
```bash
185+
python hf_sa.py \
186+
--pyt_ckpt_path Qwen/Qwen3-8B \
187+
--sparse_attn sparse_softmax \
188+
--export_dir ./exported_sparse24_model
189+
```
190+
191+
Export calibrated skip-softmax plus 2:4 sparse-softmax metadata for combined vLLM restore:
192+
193+
```bash
194+
python hf_sa.py \
195+
--pyt_ckpt_path Qwen/Qwen3-8B \
196+
--sparse_attn skip_softmax_calib_sparse24 \
197+
--export_dir ./exported_skip_sparse24_model
198+
```
199+
200+
The exported checkpoint writes `sparse_attention_config` into `config.json`. For combined
201+
export, the skip-softmax calibration and 2:4 sparse-softmax metadata are defined in the
202+
selected config rather than CLI overrides.
179203

180204
## Custom Configuration
181205

@@ -198,6 +222,11 @@ custom_config = {
198222
"bc": 128, # Flash Attention block columns
199223
"backend": "pytorch",
200224
"collect_stats": True,
225+
"sparsity_n": 2, # Export top-2 of every 4 for vLLM restore
226+
"sparsity_m": 4,
227+
"dense_sink_tokens": 0,
228+
"dense_recent_tokens": 64,
229+
"export_sparse_softmax": True,
201230
"enable": True,
202231
},
203232
"default": {"enable": False},

examples/llm_sparsity/attention_sparsity/hf_sa.py

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,11 @@
2828
import modelopt.torch.opt as mto
2929
import modelopt.torch.sparsity.attention_sparsity as mtsa
3030
from modelopt.torch.export import export_hf_checkpoint
31-
from modelopt.torch.sparsity.attention_sparsity.config import SKIP_SOFTMAX_CALIB
31+
from modelopt.torch.sparsity.attention_sparsity.config import (
32+
SKIP_SOFTMAX_CALIB,
33+
SKIP_SOFTMAX_CALIB_SPARSE24,
34+
SPARSE_SOFTMAX_DEFAULT,
35+
)
3236
from modelopt.torch.utils.memory_monitor import launch_memory_monitor
3337

3438
RAND_SEED = 1234
@@ -39,6 +43,8 @@
3943
# Sparse attention configuration choices
4044
SPARSE_ATTN_CFG_CHOICES = {
4145
"skip_softmax_calib": SKIP_SOFTMAX_CALIB,
46+
"skip_softmax_calib_sparse24": SKIP_SOFTMAX_CALIB_SPARSE24,
47+
"sparse_softmax": SPARSE_SOFTMAX_DEFAULT,
4248
}
4349

4450

@@ -172,6 +178,14 @@ def main(args):
172178
"prefill": args.target_sparse_ratio,
173179
"decode": args.target_sparse_ratio,
174180
}
181+
calib = sparse_cfg.get("calibration")
182+
if isinstance(calib, dict):
183+
if args.calib_samples is not None:
184+
calib["samples"] = args.calib_samples
185+
if args.calib_max_seqlen is not None:
186+
calib["max_seqlen"] = args.calib_max_seqlen
187+
if args.calib_chunk_size is not None:
188+
calib["chunk_size"] = args.calib_chunk_size
175189

176190
model = mtsa.sparsify(model, config=sparse_config)
177191
print("Sparse attention applied successfully!")
@@ -270,6 +284,24 @@ def main(args):
270284
default=None,
271285
help="Target sparsity ratio for calibration (0.0 to 1.0). Overrides config value.",
272286
)
287+
parser.add_argument(
288+
"--calib_samples",
289+
type=int,
290+
default=None,
291+
help="Number of RULER samples for calibration. Overrides config value.",
292+
)
293+
parser.add_argument(
294+
"--calib_max_seqlen",
295+
type=int,
296+
default=None,
297+
help="Maximum sequence length for calibration. Overrides config value.",
298+
)
299+
parser.add_argument(
300+
"--calib_chunk_size",
301+
type=int,
302+
default=None,
303+
help="Chunk size for calibration prefill. Overrides config value.",
304+
)
273305

274306
args = parser.parse_args()
275307
main(args)

examples/vllm_serve/README.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,28 @@ MODELOPT_STATE_PATH=<vllm_fq_modelopt_state.pth> python vllm_serve_fakequant.py
9595
QUANT_CFG=<quant_cfg> QUANT_FILE_PATH=<quantizer_state.pth> python vllm_serve_fakequant.py <model_path> -tp 8 --host 0.0.0.0 --port 8000
9696
```
9797

98+
## Serve a model with sparse attention in vLLM
99+
100+
Apply ModelOpt sparse attention at serve time. The launcher replaces vLLM's `FlashAttentionImpl` with `ModelOptSparseAttentionImpl` (Triton kernel with paged KV cache support) on every attention layer right after model load.
101+
102+
The configuration is read from the checkpoint's `config.json` `sparse_attention_config` block, written by ModelOpt's HF export. The launcher restores calibrated skip-softmax metadata and N:M sparse-softmax metadata (`sparsity_n`, `sparsity_m`, `dense_sink_tokens`, `dense_recent_tokens`). Checkpoints exported with both metadata entries use ModelOpt Triton for sparse prefill launches; decode-only launches and launches without active sparse work delegate back to vLLM FlashAttention.
103+
104+
Workflow:
105+
106+
1. Calibrate and export the model with `examples/llm_sparsity/attention_sparsity/hf_sa.py`. This writes `sparse_attention_config` into the exported checkpoint's `config.json`.
107+
2. Serve the exported checkpoint with `--enforce-eager` (CUDA graph capture is not yet validated with the sparse attention kernel — see Known Problems):
108+
109+
```bash
110+
python vllm_serve_sparse_attn.py <EXPORT_DIR> --enforce-eager -tp 8 --host 0.0.0.0 --port 8000
111+
```
112+
113+
If the checkpoint has no `sparse_attention_config`, the worker logs a message and passes through — vLLM runs unchanged. Quant-only flows are handled by `vllm_serve_fakequant.py`; combined sparse + quant will land in a follow-up PR.
114+
115+
Limitations:
116+
117+
- vLLM V1 chunked prefill and prefix-cache suffix attention are supported by offsetting query positions into the longer KV span.
118+
- CUDA graph capture is not validated yet — use `--enforce-eager`.
119+
98120
## Known Problems
99121

100122
1. **MCore reload does not use `MODELOPT_STATE_PATH`**; use `QUANT_FILE_PATH` and make sure `QUANT_CFG` matches the quantization recipe used for the original MCore model (otherwise quantizer keys/config won’t align).

0 commit comments

Comments
 (0)