Feat: Configurable Eagle ROPE scaling during export (#1238)

h-guo18 · web-flow · commit 6403389eb073 · 2026-04-13T16:06:26.000-07:00
### What does this PR do? JIRA ticket: https://jirasw.nvidia.com/browse/OMNIML-3469 Type of change: New feature Decouple EAGLE training rope configuration from export rope configuration, enabling separate YaRN rope scaling injection at export time for long-context inference. #### Changes **Configurable export rope scaling (`EagleConfig`)** - Add `eagle_export_rope_scaling` field to `EagleConfig` with default YaRN config (`factor=32.0`, `original_max_position_embeddings=2048`) - Set to `{}` to disable rope scaling injection at export **Simplified training defaults (`default_config.py`)** - Change default training rope from `llama3` (theta=500k) to `default` (theta=10k) — models now train with simple positional embeddings; rope scaling is applied only at export - Add `rope_theta` inside `rope_scaling` dict for transformers 5.x cross-version compatibility **Move config validation/rewriting into `EagleConfig` (`config.py`)** - `_derive_eagle_offline`: derives `eagle_offline` from `data_args.offline_data_path` via validation context, removing manual assignment in `main.py` - `_check_rope_scaling_consistency`: rejects configs where `eagle_export_rope_scaling` is set but training `rope_type` is not `"default"` - `_warn_rope_vs_training_seq_len`: warns when `original_max_position_embeddings` differs from `training_seq_len` **Export rope injection (`hf_spec_export.py`)** - Inject `eagle_export_rope_scaling` into the exported HF config when training rope_type is `"default"` - Fall back `rope_theta` from `rope_scaling` dict for transformers 5.x compatibility **Fix Megatron RotaryEmbedding crash (`megatron_eagle.py`)** - `dict_to_config()` set `rope_scaling=True` whenever the `rope_scaling` key existed, even without a `"factor"` — causing `RotaryEmbedding` to divide by `None` - Now only enables `rope_scaling` when the dict actually contains a `"factor"` key ### Usage Configure in YAML config (or use defaults from `eagle3.yaml`): ```yaml eagle: eagle_export_rope_scaling: rope_type: yarn factor: 32.0 original_max_position_embeddings: 2048 ``` Set to empty dict to disable export rope injection: ```yaml eagle: eagle_export_rope_scaling: {} ``` ### Testing - New unit tests: `tests/unit/torch/speculative/test_eagle_config.py` — rope consistency validator, seq_len warning, context-derived `eagle_offline` - New unit tests: `tests/unit/torch/export/test_hf_spec_rope_export.py` — export rope injection, fallback, and empty-config cases ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ (new field has sensible default; existing configs work unchanged) - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ❌ (should be added if merging as a feature)  ## Summary by CodeRabbit * **New Features** * Add export-time rope-scaling configuration for EAGLE models. * **Improvements** * Stronger validation and context-aware reconciliation between training and export configs. * Export now injects rope-scaling and rope-theta when appropriate. * Default rope-scaling values updated for EAGLE variants. * Model instances now expose export rope-scaling for downstream use. * **Tests** * Added unit tests covering rope-scaling export behavior and configuration validators.  --------- Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
diff --git a/examples/speculative_decoding/main.py b/examples/speculative_decoding/main.py
@@ -48,6 +48,7 @@
 
 import modelopt.torch.opt as mto
 import modelopt.torch.speculative as mtsp
+from modelopt.torch.speculative.config import EagleConfig
 from modelopt.torch.speculative.utils import load_vlm_or_llm, patch_transformers5_params_loading
 from modelopt.torch.utils import print_rank_0
 
@@ -266,8 +267,11 @@ def train():
             }
             mtsp.convert(model, [("medusa", config)])
         elif training_args.mode == "eagle3":
-            # eagle_cfg maps directly to EagleConfig fields; eagle_offline is derived here.
-            eagle_cfg["eagle_offline"] = use_offline_training
+            # Validate and rewrite eagle config fields
+            eagle_cfg = EagleConfig.model_validate(
+                eagle_cfg,
+                context={"training_args": training_args, "data_args": data_args},
+            ).model_dump()
             mtsp.convert(model, [("eagle", eagle_cfg)])
 
             # Load draft vocab cache if the draft model uses a compressed vocabulary
diff --git a/modelopt/torch/export/plugins/hf_spec_export.py b/modelopt/torch/export/plugins/hf_spec_export.py
@@ -187,6 +187,18 @@ def _get_config_from_draft_or_base(key: str, model: nn.Module):
                     new_value = str(new_value).replace("torch.", "")
                 template_config[key] = new_value
 
+        # Inject export rope scaling override when training rope_type is "default".
+        rope_cfg = self.model.eagle_config.rope_scaling or {}
+        training_rope_type = rope_cfg.get("rope_type") or rope_cfg.get("type")
+        eagle_export_rope_scaling = getattr(self.model, "eagle_export_rope_scaling", None)
+        if eagle_export_rope_scaling and training_rope_type == "default":
+            template_config["rope_scaling"] = eagle_export_rope_scaling
+
+        # In transformers 5.x, rope_theta is under rope_scaling, not the main config.
+        # Always source from the training rope config (rope_theta is not in export overrides).
+        if template_config.get("rope_theta") is None and rope_cfg:
+            template_config["rope_theta"] = rope_cfg.get("rope_theta")
+
         return template_config
 
     def export(
diff --git a/modelopt/torch/speculative/config.py b/modelopt/torch/speculative/config.py
@@ -15,7 +15,11 @@
 
 """Configurations for speculative decoding modes."""
 
+import warnings
 from copy import deepcopy
+from typing import Any
+
+from pydantic import ValidationInfo, model_validator
 
 from modelopt.torch.opt.config import ModeloptBaseConfig, ModeloptField
 
@@ -120,3 +124,51 @@ class EagleConfig(ModeloptBaseConfig):
         default=False,
         description="Whether to enable NVTX ranges for profiling eagle forward/loss methods.",
     )
+
+    eagle_export_rope_scaling: dict = ModeloptField(
+        default={"rope_type": "yarn", "factor": 32.0, "original_max_position_embeddings": 2048},
+        description=(
+            "The rope_scaling config to inject into the exported HuggingFace model config. "
+            "Applied when the training rope_type is 'default' (no scaling). "
+            "Set to empty dict {} to disable rope scaling injection at export."
+        ),
+    )
+
+    @model_validator(mode="before")
+    @classmethod
+    def _derive_eagle_offline(cls, data: Any, info: ValidationInfo) -> Any:
+        """Derive ``eagle_offline`` from ``data_args.offline_data_path`` when provided in context."""
+        ctx = info.context if info.context else {}
+        data_args = ctx.get("data_args")
+        if data_args is not None and isinstance(data, dict):
+            data["eagle_offline"] = data_args.offline_data_path is not None
+        return data
+
+    @model_validator(mode="after")
+    def _check_rope_scaling_consistency(self) -> "EagleConfig":
+        if not self.eagle_export_rope_scaling:
+            return self
+        rope_cfg = self.eagle_architecture_config.get("rope_scaling", {}) or {}
+        rope_type = rope_cfg.get("rope_type") or rope_cfg.get("type")
+        if rope_type is not None and rope_type != "default":
+            raise ValueError(
+                f"eagle_export_rope_scaling is set but eagle_architecture_config has "
+                f"rope_type='{rope_type}'. Export rope overwrite is only valid when the "
+                f"training rope_type is 'default' (no scaling)."
+            )
+        return self
+
+    @model_validator(mode="after")
+    def _warn_rope_vs_training_seq_len(self, info: ValidationInfo) -> "EagleConfig":
+        ctx = info.context if info.context else {}
+        training_args = ctx.get("training_args")
+        if training_args is None:
+            return self
+        orig_max_pos = self.eagle_export_rope_scaling.get("original_max_position_embeddings")
+        if orig_max_pos is not None and orig_max_pos != training_args.training_seq_len:
+            warnings.warn(
+                f"eagle_export_rope_scaling.original_max_position_embeddings ({orig_max_pos}) "
+                f"differs from training_seq_len ({training_args.training_seq_len}). "
+                f"This may affect long-context inference quality."
+            )
+        return self
diff --git a/modelopt/torch/speculative/eagle/default_config.py b/modelopt/torch/speculative/eagle/default_config.py
@@ -19,15 +19,10 @@
     "hidden_act": "silu",
     "torch_dtype": "bfloat16",
     "position_embedding_type": "rope",
-    "rope_scaling": {
-        "factor": 8.0,
-        "low_freq_factor": 1.0,
-        "high_freq_factor": 4.0,
-        "original_max_position_embeddings": 8192,
-        "rope_type": "llama3",
-        "rope_theta": 500000.0,
-    },
-    "rope_theta": 500000.0,
+    # rope_theta is set both inside rope_scaling and at the top level for cross-version
+    # compatibility: transformers 5.x reads it from rope_scaling, while 4.x reads it top-level.
+    "rope_scaling": {"rope_type": "default", "rope_theta": 10000},
+    "rope_theta": 10000,
     "num_hidden_layers": 1,
     "intermediate_size": 14336,
     "num_attention_heads": 32,
@@ -83,6 +78,8 @@
     "qk_nope_head_dim": 128,
     "qk_rope_head_dim": 64,
     "rms_norm_eps": 0.00001,
+    # rope_theta is set both inside rope_scaling and at the top level for cross-version
+    # compatibility: transformers 5.x reads it from rope_scaling, while 4.x reads it top-level.
     "rope_scaling": {
         "beta_fast": 1.0,
         "beta_slow": 1.0,
@@ -91,6 +88,7 @@
         "mscale_all_dim": 1.0,
         "original_max_position_embeddings": 4096,
         "type": "yarn",
+        "rope_theta": 50000.0,
     },
     "rope_theta": 50000.0,
     "routed_scaling_factor": 2.827,
diff --git a/modelopt/torch/speculative/eagle/eagle_model.py b/modelopt/torch/speculative/eagle/eagle_model.py
@@ -41,3 +41,4 @@ def modify(
         self.eagle_mix_hidden_states = config.eagle_mix_hidden_states
         self.eagle_use_torch_compile = config.eagle_use_torch_compile
         self.eagle_enable_nvtx = config.eagle_enable_nvtx
+        self.eagle_export_rope_scaling = config.eagle_export_rope_scaling
diff --git a/modelopt/torch/speculative/plugins/megatron_eagle.py b/modelopt/torch/speculative/plugins/megatron_eagle.py
@@ -107,12 +107,9 @@ def dict_to_config(
     config.position_embedding_type = architecture_config.get("position_embedding_type")
     config.rotary_percent = 1.0
     config.rotary_base = architecture_config.get("rope_theta")
-    config.rope_scaling = "rope_scaling" in architecture_config
-    config.rope_scaling_factor = (
-        architecture_config.get("rope_scaling").get("factor")
-        if "rope_scaling" in architecture_config
-        else None
-    )
+    _rope_scaling_dict = architecture_config.get("rope_scaling", {})
+    config.rope_scaling = isinstance(_rope_scaling_dict, dict) and "factor" in _rope_scaling_dict
+    config.rope_scaling_factor = _rope_scaling_dict.get("factor") if config.rope_scaling else None
 
     config.draft_vocab_size = architecture_config.get("draft_vocab_size")
     config.use_input_layernorm_in_first_layer = architecture_config.get(
diff --git a/modelopt_recipes/general/speculative_decoding/eagle3.yaml b/modelopt_recipes/general/speculative_decoding/eagle3.yaml
@@ -55,5 +55,11 @@ eagle:
   eagle_reuse_base_decoder: false
   eagle_report_acc: true
   eagle_enable_nvtx: false
+  # Rope scaling: disable during training (default_config.py uses rope_type=default),
+  # inject YaRN during export for long-context inference.
+  eagle_export_rope_scaling:
+    rope_type: yarn
+    factor: 32.0
+    original_max_position_embeddings: 2048
   # overwrite to modelopt/torch/speculative/eagle/default_config.py
   eagle_architecture_config: {}
diff --git a/tests/unit/torch/export/test_hf_spec_rope_export.py b/tests/unit/torch/export/test_hf_spec_rope_export.py
@@ -0,0 +1,73 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for EAGLE export rope scaling logic in hf_spec_export.py."""
+
+from unittest.mock import MagicMock
+
+from modelopt.torch.export.plugins.hf_spec_export import EagleExporter
+
+DEFAULT_ROPE_SCALING = {
+    "rope_type": "yarn",
+    "factor": 32.0,
+    "original_max_position_embeddings": 2048,
+}
+
+
+def _make_exporter(
+    rope_type="default",
+    rope_theta=10000,
+    eagle_export_rope_scaling=None,
+):
+    if eagle_export_rope_scaling is None:
+        eagle_export_rope_scaling = DEFAULT_ROPE_SCALING
+
+    model = MagicMock()
+    model.eagle_config.eagle_decoder_type = "llama"
+    model.eagle_config.rope_scaling = {"rope_type": rope_type, "rope_theta": rope_theta}
+    model.eagle_export_rope_scaling = eagle_export_rope_scaling
+    model._draft_model_config = None
+    model.config.rope_scaling = None
+    model.config.rope_theta = None
+
+    exporter = EagleExporter.__new__(EagleExporter)
+    exporter.model = model
+    exporter.eagle_decoder_type = "llama"
+    exporter.num_hidden_layers = 1
+    return exporter
+
+
+def test_yarn_rope_injected_with_correct_config():
+    """YaRN rope_scaling is injected as-is when training rope_type is 'default'."""
+    config = _make_exporter(rope_type="default")._export_config()
+    assert config["rope_scaling"] == DEFAULT_ROPE_SCALING
+
+
+def test_rope_not_injected_when_non_default_training_rope():
+    """rope_scaling is not overridden when training rope_type is not 'default'."""
+    config = _make_exporter(rope_type="llama3")._export_config()
+    assert config.get("rope_scaling") is None
+
+
+def test_rope_not_injected_when_eagle_export_rope_scaling_is_empty():
+    """rope_scaling is not injected when eagle_export_rope_scaling is empty dict."""
+    config = _make_exporter(eagle_export_rope_scaling={})._export_config()
+    assert config.get("rope_scaling") is None
+
+
+def test_rope_theta_fallback_from_rope_scaling():
+    """rope_theta is populated from rope_scaling when not available as top-level attr."""
+    config = _make_exporter(rope_type="default", rope_theta=500000)._export_config()
+    assert config["rope_theta"] == 500000
diff --git a/tests/unit/torch/speculative/test_eagle_config.py b/tests/unit/torch/speculative/test_eagle_config.py
@@ -0,0 +1,129 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for EagleConfig model validators."""
+
+import types
+import warnings
+
+import pytest
+from pydantic import ValidationError
+
+from modelopt.torch.speculative.config import EagleConfig
+
+# --- rope scaling consistency validator tests ---
+
+
+def test_rope_consistency_error_non_default_rope_type():
+    """Error when eagle_export_rope_scaling is set but training rope_type is not 'default'."""
+    cfg = {
+        "eagle_export_rope_scaling": {"rope_type": "yarn", "factor": 32.0},
+        "eagle_architecture_config": {"rope_scaling": {"rope_type": "llama3"}},
+    }
+    with pytest.raises(ValidationError, match="rope_type='llama3'"):
+        EagleConfig.model_validate(cfg)
+
+
+def test_rope_consistency_error_non_default_rope_type_alt_key():
+    """Error when rope_scaling uses 'type' key instead of 'rope_type' (kimik2-style)."""
+    cfg = {
+        "eagle_export_rope_scaling": {"rope_type": "yarn", "factor": 32.0},
+        "eagle_architecture_config": {"rope_scaling": {"type": "yarn"}},
+    }
+    with pytest.raises(ValidationError, match="rope_type='yarn'"):
+        EagleConfig.model_validate(cfg)
+
+
+def test_rope_consistency_ok_default_rope_type():
+    """No error when training rope_type is 'default'."""
+    cfg = {
+        "eagle_export_rope_scaling": {"rope_type": "yarn", "factor": 32.0},
+        "eagle_architecture_config": {"rope_scaling": {"rope_type": "default"}},
+    }
+    EagleConfig.model_validate(cfg)
+
+
+def test_rope_consistency_ok_no_rope_scaling_in_arch():
+    """No error when eagle_architecture_config has no rope_scaling (defaults to 'default')."""
+    cfg = {
+        "eagle_export_rope_scaling": {"rope_type": "yarn", "factor": 32.0},
+        "eagle_architecture_config": {},
+    }
+    EagleConfig.model_validate(cfg)
+
+
+def test_rope_consistency_ok_empty_export_rope():
+    """No error when eagle_export_rope_scaling is empty (disabled)."""
+    cfg = {
+        "eagle_export_rope_scaling": {},
+        "eagle_architecture_config": {"rope_scaling": {"rope_type": "llama3"}},
+    }
+    EagleConfig.model_validate(cfg)
+
+
+# --- rope vs training_seq_len warning tests ---
+
+
+def _make_training_args(training_seq_len: int):
+    return types.SimpleNamespace(training_seq_len=training_seq_len)
+
+
+def test_warn_rope_mismatch():
+    """Warning should fire when original_max_position_embeddings != training_seq_len."""
+    cfg = {
+        "eagle_export_rope_scaling": {
+            "rope_type": "yarn",
+            "factor": 32.0,
+            "original_max_position_embeddings": 2048,
+        },
+    }
+    with pytest.warns(UserWarning, match="differs from training_seq_len"):
+        EagleConfig.model_validate(cfg, context={"training_args": _make_training_args(4096)})
+
+
+def test_no_warn_rope_match():
+    """No warning when original_max_position_embeddings == training_seq_len."""
+    cfg = {
+        "eagle_export_rope_scaling": {
+            "rope_type": "yarn",
+            "factor": 32.0,
+            "original_max_position_embeddings": 2048,
+        },
+    }
+    with warnings.catch_warnings():
+        warnings.simplefilter("error")
+        EagleConfig.model_validate(cfg, context={"training_args": _make_training_args(2048)})
+
+
+def test_no_warn_without_context():
+    """No warning when context is not provided (e.g. inside convert chain)."""
+    with warnings.catch_warnings():
+        warnings.simplefilter("error")
+        EagleConfig.model_validate({})
+
+
+def test_no_warn_missing_orig_max_pos():
+    """No warning when original_max_position_embeddings is absent from rope scaling config."""
+    cfg = {"eagle_export_rope_scaling": {}}
+    with warnings.catch_warnings():
+        warnings.simplefilter("error")
+        EagleConfig.model_validate(cfg, context={"training_args": _make_training_args(4096)})
+
+
+def test_no_warn_empty_context():
+    """No warning when context dict has no training_args key."""
+    with warnings.catch_warnings():
+        warnings.simplefilter("error")
+        EagleConfig.model_validate({}, context={})