-
Notifications
You must be signed in to change notification settings - Fork 364
Support force tokens to % of total experts during calibration #910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
3a123be
e16f044
4407981
f89ef8a
0f6f680
0ac598d
55eec3e
c0185d9
76b9765
777fcaf
f12c16f
e0118b2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -201,6 +201,7 @@ def build_quant_cfg( | |||||||||||||||||||||||||||||||||||||||||||
| model_type, | ||||||||||||||||||||||||||||||||||||||||||||
| quant_cfg_choices, | ||||||||||||||||||||||||||||||||||||||||||||
| kv_quant_cfg_choices, | ||||||||||||||||||||||||||||||||||||||||||||
| moe_calib_experts_ratio: float | None = None, | ||||||||||||||||||||||||||||||||||||||||||||
| ) -> dict[str, Any]: | ||||||||||||||||||||||||||||||||||||||||||||
| quant_cfg = {} | ||||||||||||||||||||||||||||||||||||||||||||
| assert qformat in quant_cfg_choices, ( | ||||||||||||||||||||||||||||||||||||||||||||
|
|
@@ -232,6 +233,20 @@ def build_quant_cfg( | |||||||||||||||||||||||||||||||||||||||||||
| getattr(mtq, kv_quant_cfg_choices[kv_cache_qformat])["quant_cfg"], | ||||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||
| if moe_calib_experts_ratio: | ||||||||||||||||||||||||||||||||||||||||||||
| assert 0 < moe_calib_experts_ratio <= 1, "moe_calib_experts_ratio must be between 0 and 1" | ||||||||||||||||||||||||||||||||||||||||||||
| if isinstance(quant_cfg["algorithm"], str): | ||||||||||||||||||||||||||||||||||||||||||||
| quant_cfg["algorithm"] = { | ||||||||||||||||||||||||||||||||||||||||||||
| "method": quant_cfg["algorithm"], | ||||||||||||||||||||||||||||||||||||||||||||
| "moe_calib_experts_ratio": moe_calib_experts_ratio, | ||||||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||||||
| elif isinstance(quant_cfg["algorithm"], dict): | ||||||||||||||||||||||||||||||||||||||||||||
| quant_cfg["algorithm"]["moe_calib_experts_ratio"] = moe_calib_experts_ratio | ||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+236
to
+244
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Find all algorithm: None entries in config files
echo "=== Looking for algorithm: None in config.py ==="
rg -n '"algorithm":\s*None' --type python
echo ""
echo "=== Check hf_ptq.py for moe_calib_experts_ratio default ==="
rg -n 'moe_calib_experts_ratio' examples/llm_ptq/hf_ptq.py | head -30
echo ""
echo "=== Check example_utils.py around lines 236-243 ==="
sed -n '230,250p' examples/llm_ptq/example_utils.pyRepository: NVIDIA/Model-Optimizer Length of output: 1204 🏁 Script executed: #!/bin/bash
# Look at the full context around build_quant_cfg to understand the parameter
echo "=== Looking at build_quant_cfg signature in example_utils.py ==="
rg -n 'def build_quant_cfg' examples/llm_ptq/example_utils.py -A 10
echo ""
echo "=== Check if algorithm can be None by looking at QUANT_CFG_CHOICES ==="
rg -n 'QUANT_CFG_CHOICES' examples/llm_ptq/hf_ptq.py -A 50 | head -100Repository: NVIDIA/Model-Optimizer Length of output: 4804 🏁 Script executed: #!/bin/bash
# Find where the default configs are defined
echo "=== Search for MXFP8_DEFAULT_CFG definition ==="
rg -n 'MXFP8_DEFAULT_CFG' --type-list | head -5
find . -name "*.py" -exec grep -l "MXFP8_DEFAULT_CFG" {} \;
echo ""
echo "=== Search for config.py or similar in modelopt ==="
find . -path "*/quantization/config.py" -o -path "*/quantization/*config*.py" | head -20
echo ""
echo "=== Look for algorithm definitions in config files ==="
find . -name "*config*.py" -path "*/quantization/*" -exec grep -l "algorithm" {} \;Repository: NVIDIA/Model-Optimizer Length of output: 1253 🏁 Script executed: #!/bin/bash
# Check the moe_calib_experts_ratio argument parser to see actual default
echo "=== Check moe_calib_experts_ratio argument definition ==="
sed -n '1130,1145p' examples/llm_ptq/hf_ptq.py
echo ""
echo "=== Check what mtq is imported as ==="
head -50 examples/llm_ptq/hf_ptq.py | grep -E "^import|^from"
echo ""
echo "=== Try to find MXFP8_DEFAULT_CFG in the codebase ==="
find . -type f -name "*.py" | xargs grep -l "MXFP8_DEFAULT_CFG" 2>/dev/nullRepository: NVIDIA/Model-Optimizer Length of output: 1339 🏁 Script executed: #!/bin/bash
# Look at the MXFP8_DEFAULT_CFG definition
echo "=== Search for MXFP8_DEFAULT_CFG in config.py ==="
rg -n 'MXFP8_DEFAULT_CFG' modelopt/torch/quantization/config.py -A 15
echo ""
echo "=== Also check other configs mentioned in review (MXFP6, MXFP4, W4A8_MXFP4, MXINT8) ==="
rg -n 'MXFP6_DEFAULT_CFG|MXFP4_DEFAULT_CFG|W4A8_MXFP4_FP8_CFG|MXINT8_DEFAULT_CFG' modelopt/torch/quantization/config.py -A 15 | head -100Repository: NVIDIA/Model-Optimizer Length of output: 4203 🏁 Script executed: #!/bin/bash
# Check if build_quant_cfg is called in hf_ptq.py
echo "=== Check main() function in hf_ptq.py ==="
sed -n '850,950p' examples/llm_ptq/hf_ptq.py | grep -n "build_quant_cfg"
echo ""
echo "=== Look for build_quant_cfg calls ==="
rg -n 'build_quant_cfg' examples/llm_ptq/hf_ptq.pyRepository: NVIDIA/Model-Optimizer Length of output: 253 🏁 Script executed: #!/bin/bash
# Verify the exact logic at line 243 more carefully
echo "=== Full context of lines 230-250 ==="
sed -n '230,250p' examples/llm_ptq/example_utils.py
echo ""
echo "=== Check if there's any guard preventing algorithm=None scenarios ==="
sed -n '197,250p' examples/llm_ptq/example_utils.pyRepository: NVIDIA/Model-Optimizer Length of output: 3153 Crash when The code will crash when Any user running with a None-algorithm format (e.g., Proposed fix if moe_calib_experts_ratio:
+ if quant_cfg["algorithm"] is None:
+ quant_cfg["algorithm"] = {
+ "method": None,
+ "moe_calib_experts_ratio": moe_calib_experts_ratio,
+ }
- if isinstance(quant_cfg["algorithm"], str):
+ elif isinstance(quant_cfg["algorithm"], str):
quant_cfg["algorithm"] = {
"method": quant_cfg["algorithm"],
"moe_calib_experts_ratio": moe_calib_experts_ratio,
}
else:
quant_cfg["algorithm"]["moe_calib_experts_ratio"] = moe_calib_experts_ratioAlternatively, only inject the ratio when the model is actually an MoE model, or change the CLI default to 📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||||||||||||||||||||||||||||||||||||||||||
| else: | ||||||||||||||||||||||||||||||||||||||||||||
| warnings.warn( | ||||||||||||||||||||||||||||||||||||||||||||
| f"Quantization algorithm: {quant_cfg['algorithm']} does not support setting moe_calib_experts_ratio" | ||||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||
| # Gemma 7B has accuracy regression using alpha 1. We set 0.5 instead. | ||||||||||||||||||||||||||||||||||||||||||||
| if model_type == "gemma" and "int8_sq" in qformat: | ||||||||||||||||||||||||||||||||||||||||||||
| quant_cfg["algorithm"] = {"method": "smoothquant", "alpha": 0.5} | ||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -906,6 +906,7 @@ def quantize_main( | |
| model_type, | ||
| QUANT_CFG_CHOICES, | ||
| KV_QUANT_CFG_CHOICES, | ||
| args.moe_calib_experts_ratio, | ||
| ) | ||
|
|
||
| # Exclude MTP layers from quantization if detected (e.g., GLM-4.7's layer 92) | ||
|
|
@@ -1126,8 +1127,21 @@ def parse_args() -> argparse.Namespace: | |
| "(sensitivity scores, costs, etc.). Only used when auto_quantize_bits is specified." | ||
| ), | ||
| ) | ||
| parser.add_argument( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we add something similar to mcore PTQ as well? cc @realAsma @jenchen13 @ChenhanYu |
||
| "--moe_calib_experts_ratio", | ||
| type=float, | ||
| default=1.0, | ||
| help=( | ||
| "Fraction of experts to calibrate during forward pass (ratio in (0.0, 1.0]). " | ||
| "Only used for MOE models; used to reduce the number of experts calibrated during the forward pass." | ||
| "Does not impact non-MOE models." | ||
| ), | ||
| ) | ||
|
Comment on lines
+1130
to
+1139
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Default Since the default is Consider defaulting to parser.add_argument(
"--moe_calib_experts_ratio",
type=float,
- default=1.0 / 4,
+ default=None,
help=(
- "Percentage of experts to calibrate during forward pass. Only used for MOE models. "
- "This is used to reduce the number of experts to calibrate during forward pass. "
+ "Ratio of experts to calibrate during forward pass (0, 1]. Only used for MOE models. "
+ "Default behavior routes to all experts if not specified. "
+ "Example: 0.25 calibrates 25%% of experts. "
),
)🤖 Prompt for AI Agents |
||
|
|
||
| return parser.parse_args() | ||
| args = parser.parse_args() | ||
| if not (0.0 < args.moe_calib_experts_ratio <= 1.0): | ||
| parser.error("--moe_calib_experts_ratio must be in the range (0.0, 1.0].") | ||
| return args | ||
|
|
||
|
|
||
| def main(args: argparse.Namespace): | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1091,6 +1091,16 @@ class QuantizeAlgorithmConfig(ModeloptBaseConfig): | |
| title="This field specifies the name of the calibration algorithm. If None, no calibration is performed.", | ||
| ) | ||
|
|
||
| moe_calib_experts_ratio: float | None = ModeloptField( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea this is good idea to put this here.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it necessary to put
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for example if we think
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jenchen13 this is a good question: why would we not include KV cache clamping as part of the quant config too or other non-essential PTQ features? If we've decided to make this tunable, we probably also want to add it here.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 on making the KVCache clamping here - this will help the QAD to be aligned with export path
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jenchen13 this is a great idea |
||
| default=None, | ||
| title="% of experts to calibrate during forward pass.", | ||
| description=( | ||
| "If specified, we force forward tokens to % of experts during the calibration" | ||
|
Comment on lines
+1094
to
+1098
|
||
| " pass. This forward is for calibration purpose only and will not affect the" | ||
| " actual inference." | ||
| ), | ||
| ) | ||
|
|
||
|
|
||
| class MaxCalibConfig(QuantizeAlgorithmConfig): | ||
| """The config for max calibration algorithm. | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -225,6 +225,15 @@ def wrapped_calib_func( | |||||||||||||||||||||||||||
| # For backward compatibility | ||||||||||||||||||||||||||||
| kwargs["algorithm"] = method | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| moe_calib_experts_ratio = kwargs.pop("moe_calib_experts_ratio", None) | ||||||||||||||||||||||||||||
| if moe_calib_experts_ratio is not None: | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
| if moe_calib_experts_ratio is not None: | |
| if moe_calib_experts_ratio is not None: | |
| # Validate early to avoid downstream assertion failures in the forward path. | |
| if not isinstance(moe_calib_experts_ratio, (int, float)): | |
| raise ValueError( | |
| f"Invalid moe_calib_experts_ratio {moe_calib_experts_ratio!r}: " | |
| "expected a numeric value in the range (0, 1]." | |
| ) | |
| if not (0 < moe_calib_experts_ratio <= 1): | |
| raise ValueError( | |
| f"Invalid moe_calib_experts_ratio {moe_calib_experts_ratio!r}: " | |
| "expected 0 < ratio <= 1." | |
| ) |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -458,8 +458,13 @@ def _setup(self): | |||||
| elif hasattr(self, "experts") and hasattr(self.experts, "num_experts"): | ||||||
| num_experts = self.experts.num_experts | ||||||
|
|
||||||
| self.expert_token_count = torch.zeros(num_experts, dtype=torch.long, device="cpu") | ||||||
| self.register_buffer( | ||||||
| "expert_token_count", | ||||||
| torch.zeros(num_experts, dtype=torch.long, device=next(self.parameters()).device), | ||||||
| persistent=False, | ||||||
| ) | ||||||
| self._count_expert_tokens = False | ||||||
| self._moe_calib_experts_ratio = None | ||||||
|
|
||||||
| if num_experts == 0: | ||||||
| warnings.warn( | ||||||
|
|
@@ -483,36 +488,48 @@ def _gate_forward_hook(self, module, input, output): | |||||
| logits = output if not isinstance(output, tuple) else output[0] | ||||||
| top_k = self.gate.top_k if hasattr(self.gate, "top_k") else self.top_k | ||||||
| _, indices = torch.topk(logits.float(), top_k, dim=-1) | ||||||
| counts = torch.bincount( | ||||||
| indices.reshape(-1).cpu(), minlength=len(self.expert_token_count) | ||||||
| ) | ||||||
| self.expert_token_count += counts | ||||||
| counts = torch.bincount(indices.reshape(-1), minlength=self.expert_token_count.shape[0]) | ||||||
| self.expert_token_count += counts.to(self.expert_token_count.device) | ||||||
|
|
||||||
| def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: | ||||||
| is_calib = any(getattr(m, "_if_calib", False) for m in self.experts.modules()) | ||||||
| if is_calib: | ||||||
| # If any of the experts are in calibration mode, we will forward all tokens to all experts | ||||||
| self._count_expert_tokens = is_calib | ||||||
| if is_calib and self._moe_calib_experts_ratio: | ||||||
| self._count_expert_tokens = True | ||||||
| assert 0 < self._moe_calib_experts_ratio <= 1, ( | ||||||
| "moe_calib_experts_ratio must be between 0 and 1" | ||||||
| ) | ||||||
|
Comment on lines
+497
to
+501
|
||||||
| # If any of the experts are in calibration mode, we will forward all tokens to | ||||||
| # self._moe_calib_experts_ratio % of the experts to improve the calibration coverage. | ||||||
| # This is used only for calibration, we need to re-calculate the actual outputs again using | ||||||
| # the original top_k | ||||||
| if TRANSFORMERS_VERSION_GE_5_0: | ||||||
| assert hasattr(self, "gate") and hasattr(self.gate, "top_k") | ||||||
| original_top_k = self.gate.top_k | ||||||
| self.gate.top_k = self.gate.num_experts | ||||||
| self.gate.top_k = max( | ||||||
| original_top_k, round(self.gate.num_experts * self._moe_calib_experts_ratio) | ||||||
| ) | ||||||
|
Comment on lines
508
to
+511
|
||||||
| super().forward(hidden_states) | ||||||
| self.gate.top_k = original_top_k | ||||||
| else: | ||||||
| # Path for transformers < 5.0 | ||||||
| original_top_k = self.top_k | ||||||
| if hasattr(self, "num_experts"): | ||||||
| self.top_k = self.num_experts | ||||||
| self.top_k = max( | ||||||
| original_top_k, round(self.num_experts * self._moe_calib_experts_ratio) | ||||||
| ) | ||||||
|
Comment on lines
516
to
+520
|
||||||
| elif hasattr(self, "experts"): | ||||||
| self.top_k = self.experts.num_experts | ||||||
| self.top_k = max( | ||||||
| original_top_k, | ||||||
| round(self.experts.num_experts * self._moe_calib_experts_ratio), | ||||||
| ) | ||||||
| else: | ||||||
| raise ValueError(f"Could not find num_experts in module {self}") | ||||||
| super().forward(hidden_states) | ||||||
| self.top_k = original_top_k | ||||||
| # Enable counting only for the real-routing forward during calibration | ||||||
| self._count_expert_tokens = is_calib | ||||||
| self._count_expert_tokens = False | ||||||
| else: | ||||||
| self._count_expert_tokens = True | ||||||
|
||||||
| self._count_expert_tokens = True | |
| self._count_expert_tokens = is_calib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's see recent commits to understand context
git log --oneline -10 -- modelopt/torch/quantization/plugins/huggingface.pyRepository: NVIDIA/Model-Optimizer
Length of output: 87
🏁 Script executed:
#!/bin/bash
# Get a diff to see what changed in recent commits
# Look at the last few commits for this file
git log -p -2 -- modelopt/torch/quantization/plugins/huggingface.py | head -300Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
#!/bin/bash
# Let's also check the full forward method and surrounding context
cat -n modelopt/torch/quantization/plugins/huggingface.py | sed -n '440,550p'Repository: NVIDIA/Model-Optimizer
Length of output: 6176
🏁 Script executed:
#!/bin/bash
# Search for any tests related to this forward method or _moe_calib_experts_ratio
find . -name "*.py" -type f | xargs grep -l "_moe_calib_experts_ratio" 2>/dev/nullRepository: NVIDIA/Model-Optimizer
Length of output: 156
🏁 Script executed:
#!/bin/bash
# Look for documentation or comments about the expected behavior
grep -r "all experts" --include="*.py" modelopt/torch/quantization/plugins/ 2>/dev/null | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 672
🏁 Script executed:
#!/bin/bash
# Let's check what happens in each scenario by examining the control flow
# Case analysis of the forward method:
# When is_calib=True and ratio is set (line 493 condition is True):
# - Line 528: _count_expert_tokens = False (before final forward)
# When is_calib=True and ratio is None (line 493 condition is False, goes to else):
# - Line 530: _count_expert_tokens = True (before final forward)
# When is_calib=False (line 493 condition is False, goes to else):
# - Line 530: _count_expert_tokens = True (before final forward) <-- PROBLEM!
# Check if there's any conditional that prevents token counting during inference
cat -n modelopt/torch/quantization/plugins/huggingface.py | sed -n '475,495p'Repository: NVIDIA/Model-Optimizer
Length of output: 1385
🏁 Script executed:
#!/bin/bash
# Check the gate_forward_hook to understand what happens when _count_expert_tokens is True
cat -n modelopt/torch/quantization/plugins/huggingface.py | sed -n '475,489p'Repository: NVIDIA/Model-Optimizer
Length of output: 982
🏁 Script executed:
#!/bin/bash
# Search for where _moe_calib_experts_ratio might be set to understand intended usage
grep -n "_moe_calib_experts_ratio" modelopt/torch/quantization/plugins/huggingface.pyRepository: NVIDIA/Model-Optimizer
Length of output: 699
🏁 Script executed:
#!/by
# Check if there are any tests that show expected behavior
find . -name "*test*.py" -type f | xargs grep -l "QuantSparseMoe\|_moe_calib_experts_ratio" 2>/dev/null | head -5Repository: NVIDIA/Model-Optimizer
Length of output: 124
🏁 Script executed:
#!/bin/bash
# Examine the test file to understand expected behavior
cat modelopt/torch/quantization/plugins/test_sparse_moe.pyRepository: NVIDIA/Model-Optimizer
Length of output: 152
🏁 Script executed:
#!/bin/bash
# Find the correct path for the test file
find . -name "test_sparse_moe.py" -type fRepository: NVIDIA/Model-Optimizer
Length of output: 124
🏁 Script executed:
#!/bin/bash
# If not found, search for any test file containing QuantSparseMoe or sparse moe tests
find . -name "*test*.py" -type f -exec grep -l "QuantSparseMoe\|sparse.*moe" {} \; 2>/dev/null | head -5Repository: NVIDIA/Model-Optimizer
Length of output: 124
🏁 Script executed:
#!/bin/bash
# Read the test file
cat ./tests/unit/torch/quantization/plugins/test_sparse_moe.pyRepository: NVIDIA/Model-Optimizer
Length of output: 12565
🏁 Script executed:
#!/bin/bash
# Check where _moe_calib_experts_ratio might be set (outside of _setup)
grep -rn "_moe_calib_experts_ratio" modelopt/torch/quantization/ --include="*.py" | grep -v "test_sparse"Repository: NVIDIA/Model-Optimizer
Length of output: 1321
🏁 Script executed:
#!/bin/bash
# Check the quantize function and INT8_DEFAULT_CFG to see if they set the ratio
grep -rn "INT8_DEFAULT_CFG\|_moe_calib_experts_ratio" modelopt/torch/quantization/ --include="*.py" | head -30Repository: NVIDIA/Model-Optimizer
Length of output: 1959
🏁 Script executed:
#!/bin/bash
# Look for mode.py to see how _moe_calib_experts_ratio is used
cat modelopt/torch/quantization/mode.py | head -100Repository: NVIDIA/Model-Optimizer
Length of output: 3247
🏁 Script executed:
#!/bin/bash
# Look at the full context around line 222 in mode.py to see the function signature
cat -n modelopt/torch/quantization/mode.py | sed -n '210,235p'Repository: NVIDIA/Model-Optimizer
Length of output: 1265
🏁 Script executed:
#!/bin/bash
# Search for moe_calib_experts_ratio parameter definition and default value
grep -B5 -A5 "moe_calib_experts_ratio" modelopt/torch/quantization/mode.py | head -40Repository: NVIDIA/Model-Optimizer
Length of output: 665
🏁 Script executed:
#!/bin/bash
# Check if there's a function signature or parameter with default
grep -rn "def.*moe_calib_experts_ratio\|moe_calib_experts_ratio\s*=" modelopt/torch/quantization/ --include="*.py" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 393
Clarify whether all-experts calibration should be the default during quantization.
The class docstring promises "During calibration, we forward all tokens to all experts so that all experts see sufficient tokens to calibrate" (line 445), but this behavior only activates when _moe_calib_experts_ratio is explicitly set in the quantization config. Since it defaults to None, users relying on the documented behavior will not get the expanded-expert forward pass.
Additionally, the else block at lines 529-530 enables token counting for both inference (is_calib=False) and calibration with unset ratio (is_calib=True, ratio=None), creating unnecessary overhead during inference when tokens should not be counted.
Either set a default ratio (e.g., 1.0 for all experts) when entering calibration mode, or update the docstring to clarify that expanded-expert forwarding requires explicit configuration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 490 - 533,
The forward method currently only expands experts when _moe_calib_experts_ratio
is set; change the logic so that when is_calib is True and
_moe_calib_experts_ratio is None you default it to 1.0 (i.e., all experts) to
match the class docstring; update forward to treat is_calib branches as: if
is_calib: if self._moe_calib_experts_ratio is None: use ratio = 1.0 (or set
self._moe_calib_experts_ratio = 1.0 temporarily), then perform the gate/top_k or
top_k adjustments (refer to forward, _moe_calib_experts_ratio, gate.top_k,
top_k, num_experts, experts) and ensure _count_expert_tokens is True only during
calibration and False for normal inference (remove the current else that sets
_count_expert_tokens=True for non-calibration).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if moe_calib_experts_ratio:uses truthiness and will silently ignore0.0(and will treat negative values as enabled). Preferif moe_calib_experts_ratio is not None:and validate0 < ratio <= 1so invalid values fail fast with a clear error.