selective_mixed_precision: QKV-aware overrides, AUTO memory mode, MULTI_GPU dispatch by hanbitmyths · Pull Request #2473 · microsoft/Olive

hanbitmyths · 2026-05-22T23:45:08Z

This PR hardens SelectiveMixedPrecision (SMP) for real-world LLMs targeting ONNX Runtime GenAI:

QKV-aware quant config overrides (olive/passes/pytorch/quant_utils.py): Normalize the per-layer override dict so that the Q, K, and V projections in the same attention block always share precision. ModelBuilder's GQA fusion requires this; without it, partial overrides silently break export on Qwen-style models.
AUTO kld_memory_mode (olive/passes/pytorch/selective_mixed_precision.py): A new auto setting selects among full, multi_gpu, low_memory, and offload based on visible GPU memory and estimated model footprint, and logs the decision (e.g. KLD memory mode auto-selected: multi_gpu (gpus=3, full=145.14GB, multi_budget=215.86GB, ...)).
New multi_gpu mode: Uses accelerate.dispatch_model + infer_auto_device_map with _no_split_modules honored. After infer_auto_device_map, every model.layers.N.* entry is coalesced to the first device assigned for that layer, and a defensive check falls back to low_memory if a decoder layer still spans devices. A diagnostic info log reports the per-device layer counts.

Validation (A100 VM)

Qwen3-0.6B old vs new export: tokens identical (124 vs 116 overrides, new_missing_qkv_partners=[]), same 657 MB output, ~301 vs 309 tok/s.
Qwen2.5-1.5B-Instruct export + ort-genai: 1.34 GB int4, 290 tok/s.
Qwen2.5-14B-Instruct AUTO → MULTI_GPU (3×A100), 9.44 GB int4, 95 tok/s.

MMLU 0-shot (HF fp16 vs ort-genai int4, greedy)

Model	N	PyTorch	ort-genai	Δ
Qwen3-0.6B	500	36.6%	28.6%	−8.0 pp
Qwen2.5-1.5B-Instruct	500	60.2%	54.2%	−6.0 pp
Qwen2.5-14B-Instruct	250	74.8%	77.2%	+2.4 pp (within ±5.5 pp CI)

14B is essentially lossless; the small-model deltas are inherent to int4 SMP on sub-2B parameters, not regressions introduced here.

Checklist before requesting a review

Add unit tests for this change.
Make sure all tests can pass. (24 passed, 1 skipped in test_selective_mixed_precision.py)
Update documents if necessary.
Lint and apply fixes to your code by running lintrunner -a
Is this a user-facing change? If yes, give a description of this change to be included in the release notes.

Release note: SelectiveMixedPrecision now supports an auto setting for kld_memory_mode and a new multi_gpu mode that shards the KLD-scored forward across visible GPUs via Accelerate. Quant config overrides are normalized so Q/K/V projections in the same attention block share precision, ensuring compatibility with ModelBuilder GQA fusion.

…TI_GPU dispatch - Normalize per-layer quant config overrides so Q/K/V projections in the same attention block share precision, required by ModelBuilder for GQA fusion. - Add AUTO setting for kld_memory_mode that picks among FULL, MULTI_GPU, LOW_MEMORY, OFFLOAD based on available GPU memory and model size. - Add MULTI_GPU mode that uses Accelerate's dispatch_model with _no_split_modules honored, plus a coalescing pass that pins every model.layers.N.* entry to a single device and falls back to LOW_MEMORY if a decoder layer still spans devices. - Tests: 24 unit tests covering QKV grouping, AUTO selection thresholds, and the MULTI_GPU device-map coalescing path.

Copilot

Pull request overview

This PR strengthens the SelectiveMixedPrecision (SMP) PyTorch pass for LLMs targeting ONNX Runtime GenAI by (a) enforcing Q/K/V consistency in both scored selection and quantization overrides, and (b) adding an auto/multi_gpu KLD-gradient scoring memory mode selection to make scoring practical on large models.

Changes:

Add Q/K/V-aware grouping so scored selection promotes attention input projections together, and normalize quantization overrides so Q/K/V share the most-precise config.
Introduce kld_memory_mode with auto resolution plus a new multi_gpu mode using Accelerate dispatch and device-map coalescing/validation.
Expand unit tests to cover QKV grouping/normalization, KLD scoring equivalence across memory modes, and AUTO/MULTI_GPU selection behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`olive/passes/pytorch/selective_mixed_precision.py`	Adds QKV grouping in scored overrides and implements AUTO/FULL/MULTI_GPU/LOW_MEMORY/OFFLOAD KLD scoring paths with heuristics and Accelerate-based sharding.
`olive/passes/pytorch/quant_utils.py`	Adds QKV group discovery + override normalization to ensure attention input projections share a consistent quant config, including support for excluded attention inputs.
`test/passes/pytorch/test_selective_mixed_precision.py`	Adds extensive unit tests for QKV grouping/normalization and KLD scoring/memory-mode behavior, including MULTI_GPU dispatch stubbing.

jambayk · 2026-05-27T20:10:27Z

+        scored_items = [
+            (
+                group,
+                sum(module_numels[module_name] for module_name in group),


thanks for opening this PR! I was thinking about needing to keep qkv the same settings but didn't get around to it.
I don't think summing the scores is a good way to aggregate the scores since this would just make the scores for qkv that are summed higher than those for single modules.

i have a commit on top your branch here at 8f722f3 and created a draft PR for an alternative + some refactor of the codebase to make it more modular. I also updated the prepare_model's normalize qkv config behavior to account for already quantized modules. #2475

Since your PR is created from a fork, we are unable to run the CI and need to create a copy of your branch to make it work. could you please create the branch and PR directly from the original repository? You should already have contributor access.

for this PR, we could also merge the copy PR i made if you are happy with my changes I made on top. Thanks!

hanbitmyths · 2026-05-27T22:03:03Z

Superseded by #2477.\n\nContinuing review and updates on the origin-branch PR to avoid fork-head workflow.

hanbitmyths · 2026-05-27T22:03:04Z

Closing as superseded by #2477.

…TI_GPU dispatch (#2475) ## Describe your changes Based on #2473 ## Checklist before requesting a review - [ ] Add unit tests for this change. - [ ] Make sure all tests can pass. - [ ] Update documents if necessary. - [ ] Lint and apply fixes to your code by running `lintrunner -a` - [ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes. ## (Optional) Issue link --------- Co-authored-by: Sunghoon Choi <sunghcho@microsoft.com> Co-authored-by: Sunghoon Choi <35605090+hanbitmyths@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

hanbitmyths and others added 3 commits May 22, 2026 16:44

docs: surface KLD memory modes and QKV grouping in pass docstring

cc52a7e

Merge branch 'main' into smp-qkv-aware-multi-gpu

a4a2b2a

hanbitmyths marked this pull request as ready for review May 22, 2026 23:51

Copilot AI review requested due to automatic review settings May 22, 2026 23:51

Copilot started reviewing on behalf of hanbitmyths May 22, 2026 23:51 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread olive/passes/pytorch/quant_utils.py

Comment thread olive/passes/pytorch/selective_mixed_precision.py Outdated

Address SMP review feedback

8e98a92

devang-ml requested a review from jambayk May 27, 2026 18:25

jambayk mentioned this pull request May 27, 2026

selective_mixed_precision: QKV-aware overrides, AUTO memory mode, MULTI_GPU dispatch #2475

Merged

5 tasks

jambayk reviewed May 27, 2026

View reviewed changes

hanbitmyths mentioned this pull request May 27, 2026

selective_mixed_precision: QKV-aware overrides, AUTO memory mode, MULTI_GPU dispatch #2477

Closed

5 tasks

hanbitmyths closed this May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

selective_mixed_precision: QKV-aware overrides, AUTO memory mode, MULTI_GPU dispatch#2473

selective_mixed_precision: QKV-aware overrides, AUTO memory mode, MULTI_GPU dispatch#2473
hanbitmyths wants to merge 4 commits into
microsoft:mainfrom
hanbitmyths:smp-qkv-aware-multi-gpu

hanbitmyths commented May 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

jambayk May 27, 2026 •

edited

Loading

Uh oh!

jambayk May 27, 2026

Uh oh!

hanbitmyths commented May 27, 2026

Uh oh!

hanbitmyths commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

hanbitmyths commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Validation (A100 VM)

MMLU 0-shot (HF fp16 vs ort-genai int4, greedy)

Checklist before requesting a review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

jambayk May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jambayk May 27, 2026

Choose a reason for hiding this comment

Uh oh!

hanbitmyths commented May 27, 2026

Uh oh!

hanbitmyths commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hanbitmyths commented May 22, 2026 •

edited

Loading

jambayk May 27, 2026 •

edited

Loading