Add active-MoE AutoQuant cost accounting#1497
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an active-MoE cost model option to auto-quantization: detects routed MoE modules, applies per-module cost weighting using an expert-activity ratio, threads cost_model and active_moe_expert_ratio through the searcher and API/CLI, and adds unit tests covering behavior and searcher selection. ChangesActive-MoE Cost Model Support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/quantization/model_quant.py`:
- Around line 300-315: _infer_active_moe_expert_ratio currently calls
_get_first_numeric_config_attr twice which can pick values from two different
config objects; instead iterate the same configs (use _iter_model_configs) and
for each config check both attribute groups (_ACTIVE_MOE_TOP_K_ATTRS and
_ACTIVE_MOE_NUM_EXPERTS_ATTRS) on that single config object, ensure both are
numeric and num_experts > 0, then return min(num_active_experts / num_experts,
1.0); if no single config contains both numeric values return None.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5b320520-fd7c-4c67-b182-efe01e721d39
📒 Files selected for processing (4)
examples/llm_ptq/hf_ptq.pymodelopt/torch/quantization/algorithms.pymodelopt/torch/quantization/model_quant.pytests/unit/torch/quantization/test_autoquant.py
b721f1d to
f681009
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1497 +/- ##
===========================================
- Coverage 77.38% 55.26% -22.12%
===========================================
Files 479 479
Lines 52435 52496 +61
===========================================
- Hits 40578 29014 -11564
- Misses 11857 23482 +11625
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
f681009 to
6f791d1
Compare
| ) | ||
| return config | ||
|
|
||
| def _get_cost_constraints(self) -> tuple[str, float | None]: |
There was a problem hiding this comment.
hi @meenchen - is it possible to unify (adding a hook or something) the check in the model_quant.py _normalize_auto_quantize_constraints and this function? it seems like there is a duplication of checking the constraints and it might be prone to missing as we scale and add more cost models
There was a problem hiding this comment.
Updated in fac22b7. I moved shared validation/normalization into _auto_quantize_cost.normalize_auto_quantize_constraints() and both the public API and the searcher call it, so adding a cost model has one validation path.
| def _normalize_auto_quantize_constraints( | ||
| model: nn.Module, constraints: dict[str, Any] | None | ||
| ) -> dict[str, Any]: | ||
| constraints = {"effective_bits": 4.8} if constraints is None else dict(constraints) |
There was a problem hiding this comment.
can we move this 4.8 magic number to some constant and add comments explain the meaning? it's scattered around in multiple places
There was a problem hiding this comment.
Done in fac22b7. DEFAULT_AUTO_QUANTIZE_EFFECTIVE_BITS is centralized in _auto_quantize_cost.py with a compatibility comment; auto_quantize() gets the default through the shared normalizer.
| ) -> dict[str, Any]: | ||
| constraints = {"effective_bits": 4.8} if constraints is None else dict(constraints) | ||
| cost_model = constraints.get("cost_model", "weight") | ||
| if cost_model not in ("weight", "active_moe"): |
There was a problem hiding this comment.
We should also move these constants somewhere centralized and add comments explain what are they.
There was a problem hiding this comment.
better define 2 classes and define their behaviour/properties using the class.
There was a problem hiding this comment.
Done in fac22b7. Added AutoQuantizeCostModel, WeightCostModel, and ActiveMoECostModel; the searcher now asks the selected class for module cost weights and total cost denominator.
There was a problem hiding this comment.
Done in fac22b7. Constants and cost-model-specific keys moved into _auto_quantize_cost.py, with comments/docstrings for the default target and active-MoE behavior.
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Adds an active_moe cost model for AutoQuantize effective-bits accounting. Overall structure is reasonable: separate _auto_quantize_cost.py with a small registry, propagation of cost_weight through QuantRecipeHparam, checkpoint compat (falls back to sum(costs[-1]) when cost_denominator is missing), and unit tests covering the routed-vs-shared expert weighting, config inference, and search-budget lower-bound ordering.
A few concrete things to look at before merging:
-
active_costslooks like dead code. Ininitialize_candidate_stats,hparam.get_cost(recipe)andhparam.get_cost(recipe, cost_weight=hparam.cost_weight)are equivalent becauseget_cost's default already substitutesself.cost_weightwhencost_weight is None. Soactive_costs == costsfor every recipe, and the only test assertion on it is"active_costs" in stats. Either compute something genuinely different (e.g. the unweighted physical cost withcost_weight=1.0, which would actually be useful for reporting) or drop the field and the correspondingcost_weight/state-dict plumbing. -
Constraints are normalized twice.
model_quant.auto_quantizecallsnormalize_auto_quantize_constraintsbeforeapply_mode, and_AutoQuantizeBaseSearcher.before_searchcalls it again onself.constraints. Idempotent today, but it's an easy footgun if either side changes shape — pick one site. -
_resolve_best_recipereconstructs a partial searcher config:searcher.config = {"cost_model": searcher.cost_model}(no"cost"key, no other defaults). Today_get_search_lower_boundsonly readscost_modelso this works, but it's fragile — any future code inrun_search_with_statsthat touchesself.config[...]will break the resolve path. Consider populating fromsearch_statemore completely, or routing throughdefault_search_config. -
Minor: in
modelopt/torch/quantization/model_quant.pythe newfrom ._auto_quantize_cost import normalize_auto_quantize_constraintsis placed afterfrom .algorithms import ...; isort would put the underscore-prefixed sibling first. Also the deleted blank line abovedef auto_quantize(...)makes the module-level constant_AUTO_QUANTIZE_SUPPORTED_ALGORITHMSvisually run into the function definition.
| score = hparam.get_score(recipe) # type: ignore [arg-type] | ||
| cost = hparam.get_cost(recipe) # type: ignore [arg-type] | ||
| active_cost = hparam.get_cost(recipe, cost_weight=hparam.cost_weight) # type: ignore [arg-type] | ||
|
|
There was a problem hiding this comment.
Bot comment.
active_cost = hparam.get_cost(recipe, cost_weight=hparam.cost_weight) is identical to hparam.get_cost(recipe) because get_cost defaults cost_weight to self.cost_weight (which is hparam.cost_weight). So active_costs ends up equal to costs for every entry, and the test only checks key presence. Either drop the field, or compute something actually distinct — e.g. the unweighted physical cost via cost_weight=1.0 for reporting routed-MoE savings vs. total weight size.
There was a problem hiding this comment.
Fixed in 34a6baf. I dropped active_costs from newly generated candidate stats instead of keeping a duplicate of costs; the active-MoE test now asserts the duplicate field is absent while preserving cost_weight for the actual cost accounting.
| ) | ||
|
|
||
| searcher.candidate_stats = candidate_stats | ||
| searcher.cost_model = search_state.get("cost_model", COST_MODEL_WEIGHT) |
There was a problem hiding this comment.
Bot comment.
This builds a stub searcher.config = {"cost_model": ...} with no "cost" key and none of the other default search-config entries. It works today only because _get_search_lower_bounds reads just cost_model. If anything in run_search_with_stats (or anything it calls) starts touching self.config["cost"] or other keys, the get_auto_quantize_config(..., constraints=...) re-solve path will silently break. Consider initializing through default_search_config or persisting/restoring the full config from search_state.
There was a problem hiding this comment.
Fixed in 34a6baf. _resolve_best_recipe() now starts from default_search_config and fills the persisted cost fields (cost_model, cost, active_moe_expert_ratio) instead of using a one-key stub config.
| else: | ||
| raise ValueError(f"Invalid method: {method}. Valid options are 'gradient' or 'kl_div'.") | ||
|
|
||
| constraints = normalize_auto_quantize_constraints(model, constraints) |
There was a problem hiding this comment.
Bot comment.
Constraints are now normalized here and again in _AutoQuantizeBaseSearcher.before_search. The function is idempotent today but it's an easy place for the two sites to drift. Pick one (probably before_search, so external callers using the searcher directly also get normalized constraints) and drop the other.
There was a problem hiding this comment.
Fixed in 34a6baf. Normalization now happens in _AutoQuantizeBaseSearcher.before_search() only, so direct searcher callers and the public API share the same normalization path.
|
Pushed follow-up fixes for the latest review comments: 34a6baf handles the three substantive AutoQuantize cost-model comments, and 3411e4a fixes the minor API formatting issue. Validation: compileall passed, |
| parser.error( | ||
| "--auto_quantize_active_moe_expert_ratio requires " | ||
| "--auto_quantize_cost_model active_moe." | ||
| ) |
There was a problem hiding this comment.
I think for this PR we should keep this, once I update the Autoquant YAML recipe PR to reflect the cost model, I will remove the CLI arg support for AutoQuant as we discussed yesterday, does that align? @meenchen , @shengliangxu ?
juhi10071998
left a comment
There was a problem hiding this comment.
LGTM, left a minor comment
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Re-review: all important previous review comments are addressed in the current diff.
Critical comments resolved:
- 💬 CodeRabbit's
_infer_active_moe_expert_ratiocross-config pairing bug — fixed; logic moved into_auto_quantize_cost.infer_active_moe_expert_ratiowhich iterates configs and checks both top-k and num-experts attributes on the same config object. New testtest_active_moe_ratio_requires_single_config_objectcovers the wrapper/text_configmismatch. - 💬 cjluo-nv's "
active_costsis dead code" — resolved; the duplicated field was dropped frominitialize_candidate_stats, and the test now asserts its absence while still validatingcost_weight. - 💬 cjluo-nv's "constraints normalized twice" — resolved; normalization now lives only in
_AutoQuantizeBaseSearcher.before_search, andauto_quantize()inmodel_quant.pyno longer pre-normalizes. - 💬 cjluo-nv's "
_resolve_best_recipepartial searcher config" — resolved; it now seeds fromdefault_search_configand populatescost_model,cost, andactive_moe_expert_ratiofromsearch_state. - 💬 realAsma / juhi10071998 design feedback (move cost knobs into
constraints, unify validation) — implemented;cost_model/costlive underconstraints, andnormalize_auto_quantize_constraintsis the single validation entry point. - 💬 shengliangxu's magic-number / class-refactor asks — resolved via
DEFAULT_AUTO_QUANTIZE_EFFECTIVE_BITSand theAutoQuantizeCostModel/WeightCostModel/ActiveMoECostModelhierarchy.
Minor remaining nits (non-blocking): default_state_dict hardcodes the literal "weight" instead of using COST_MODEL_WEIGHT, and _resolve_best_recipe reaches into search_state for cost fields that older checkpoints predating this PR will not contain (the .get(..., default) calls do degrade gracefully, but a checkpoint saved before this PR has no cost_denominator, so the or sum(...) fallback kicks in — that path is exercised but not unit-tested).
Test coverage is solid: routed-vs-shared expert weighting, num_experts vs num_local_experts config-attr fallback, single-config inference requirement, and lower-bound search-order preference for active-MoE are all covered.
3411e4a to
0d0cef1
Compare
cjluo-nv
left a comment
There was a problem hiding this comment.
left a comment. Need clarification on the added flags
|
/ok to test adbf9f0 |
adbf9f0 to
93c4138
Compare
|
/ok to test 93c4138 |
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
93c4138 to
b6b8be2
Compare
|
/ok to test b6b8be2 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1497 +/- ##
==========================================
- Coverage 77.38% 77.29% -0.10%
==========================================
Files 479 480 +1
Lines 52435 52564 +129
==========================================
+ Hits 40578 40628 +50
- Misses 11857 11936 +79
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What does this PR do?
• Type of change: new feature
Adds an active_moe cost model for auto_quantize effective-bits search. This lets AutoQuant account for routed MoE expert weights by active decode weight traffic instead of total checkpoint weight
size, using active_moe_expert_ratio = num_experts_per_tok / num_experts.
The default behavior is unchanged: cost_model="weight" still counts all quantizable weights equally.
Usage
import modelopt.torch.quantization as mtq
model, search_state = mtq.auto_quantize(
model,
constraints={"effective_bits": 5.0},
quantization_formats=[
mtq.NVFP4_DEFAULT_CFG,
mtq.FP8_DEFAULT_CFG,
],
data_loader=calib_dataloader,
forward_step=forward_step,
loss_func=loss_func,
cost_model="active_moe",
# Optional. If omitted, ModelOpt tries to infer this from model.config.
active_moe_expert_ratio=2 / 64,
)
The HF PTQ example also exposes:
--auto_quantize_cost_model active_moe
--auto_quantize_active_moe_expert_ratio 0.03125
Testing
python -m pytest tests/unit/torch/quantization/test_autoquant.py -q -k 'active_moe or quant_recipe_hparam_cost_weight'
python -m pytest tests/unit/torch/quantization/test_autoquant.py -q -k 'not data_parallel_auto_quantize'
Results:
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
New Features
Bug Fixes
Tests