[OMNIML-4730] Support quantized nn.Embedding by ajrasane · Pull Request #1495 · NVIDIA/Model-Optimizer

ajrasane · 2026-05-14T17:54:45Z

What does this PR do?

Type of change: new feature

Register nn.Embedding in QuantModuleRegistry so the embedding table and lookup activations participate in quantization end-to-end:

New modelopt/torch/quantization/nn/modules/quant_embedding.py exposes weight_quantizer (embedding table), output_quantizer (lookup activations, off by default), and an input_quantizer placeholder. Embedding inputs are integer indices that cannot be fake-quantized, so direct enable() / enable_quant() / enable_calib() calls on input_quantizer raise, and forward() raises if _disabled is flipped via any back door. Wildcard configs (*input_quantizer) are accepted silently so the stock deny-all → enable-wildcards → opt-out pattern in NVFP4_DEFAULT_CFG and friends still works.
default_disabled_quantizers.yaml installs parent_class: nn.Embedding, enable: false so embedding quantization is opt-in and existing model behavior is unchanged.
is_quantized_linear in core_utils.py early-returns False for nn.Embedding so AWQ / SmoothQuant / SVDQuant don't treat it as a GEMM op.
_process_quantized_modules in unified_export_hf.py routes quantized nn.Embedding modules through _export_quantized_weight, so the exported checkpoint contains the packed NVFP4 / FP8 / INT bytes plus weight_scale* buffers, exactly like Linear layers.

Usage

import copy
import torch.nn as nn
import modelopt.torch.quantization as mtq
from modelopt.torch.export import export_hf_checkpoint

# Opt embeddings into the stock NVFP4 config — the YAML default is opt-out.
cfg = copy.deepcopy(mtq.NVFP4_DEFAULT_CFG)
cfg["quant_cfg"].append(
    {
        "parent_class": "nn.Embedding",
        "quantizer_name": "*weight_quantizer",
        "cfg": {"num_bits": (2, 1), "block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)}},
    }
)

model = mtq.quantize(model, cfg, forward_loop)
export_hf_checkpoint(model, export_dir="./out")
# out/model.safetensors contains: embedding.weight (uint8, NVFP4-packed),
# embedding.weight_scale (FP8 E4M3 per-block), embedding.weight_scale_2 (FP32).

Testing

New unit tests tests/unit/torch/quantization/test_quant_embedding.py cover: default quantizer state, no-quant identity, per-tensor and per-row weight fake quant against the manual tensor_quant.fake_tensor_quant reference, output quantizer activation, locked-mutator raises (parametrized over enable / enable_quant / enable_calib), forward-time guard for back-door _disabled = False, and the wildcard-then-opt-out pattern. All 9 cases pass.
Verified end-to-end on an embedding-only model: mtq.quantize with NVFP4_DEFAULT_CFG + the embedding opt-in produces embedding.weight (uint8), embedding.weight_scale (float8_e4m3fn), embedding.weight_scale_2 (float32) in the exported safetensors, with "quant_algo": "NVFP4" in hf_quant_config.json.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ — embedding quantizers are opt-in via parent_class: nn.Embedding, enable: false in default_disabled_quantizers.yaml, so existing model behavior is unchanged.
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: ✅
Did you update Changelog?: ✅
Did you get Claude approval on this PR?: ❌ — will run /claude review after the PR is up.

Additional Information

Summary by CodeRabbit

New Features
- Opt-in quantization for embedding layers: configurable weight quantization and optional output quantization; input quantization is permanently disabled by default.
Bug Fixes
- Preserve tied embedding weights during export (packing skipped with a warning) to avoid breaking weight ties.
Tests
- Added unit tests for embedding quantization behavior, export packing, calibration, and tied-weight scenarios.
Documentation
- Changelog updated with 0.45 embedding quantization notes.

coderabbitai · 2026-05-14T17:55:01Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 015cf74e-2bb9-4261-8b2d-0a09a77101ac

📥 Commits

Reviewing files that changed from the base of the PR and between 4c4db31 and a932284.

📒 Files selected for processing (7)

CHANGELOG.rst
modelopt/torch/export/unified_export_hf.py
modelopt/torch/quantization/nn/__init__.py
modelopt/torch/quantization/nn/modules/quant_embedding.py
modelopt/torch/quantization/utils/core_utils.py
modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml
tests/unit/torch/quantization/test_quant_embedding.py

✅ Files skipped from review due to trivial changes (2)

modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml
CHANGELOG.rst

🚧 Files skipped from review as they are similar to previous changes (4)

modelopt/torch/export/unified_export_hf.py
modelopt/torch/quantization/nn/init.py
modelopt/torch/quantization/utils/core_utils.py
modelopt/torch/quantization/nn/modules/quant_embedding.py

📝 Walkthrough

Walkthrough

Adds QuantEmbedding (quantized nn.Embedding) with gated weight quantization, a permanently disabled input-quantizer, optional output quantization, export packing support (with tied-weight skip), calibration exclusion, default-disabled config entry, unit tests, and a changelog entry.

Changes

Quantized Embedding Support

Layer / File(s)	Summary
QuantEmbedding Core Implementation `modelopt/torch/quantization/nn/modules/quant_embedding.py`	Introduces `_QuantEmbedding` with weight quantizer (gated by `quantize_weight()` / export-mode), an `_UnsettableInputQuantizer` that raises on enable attempts, optional `output_quantizer`, `_get_quantized_weight` dynamic backing for `weight`, and `forward()` enforcing disabled input quantizer and applying output quantization.
Integration, Export, and Calibration `modelopt/torch/quantization/nn/__init__.py`, `modelopt/torch/export/unified_export_hf.py`, `modelopt/torch/quantization/utils/core_utils.py`, `modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml`, `CHANGELOG.rst`	Re-exports `QuantEmbedding`, excludes `nn.Embedding` from linear-module detection (`is_quantized_linear`), adds export branch to pack quantized embedding weights under `fsdp2_aware_weight_update` (skips packing when `.weight` is tied), inserts default-disabled quantizer YAML entry for `nn.Embedding`, and documents the feature in the changelog.
Tests and Export Checks `tests/unit/torch/quantization/test_quant_embedding.py`	Adds unit tests validating default quantizer states, weight fake-quant (per-tensor and per-row), output-quantizer application and calibration behavior, input-quantizer enablement protections, wildcard/config behavior, and export packing vs tied-weight skipping.

sequenceDiagram
  participant Client
  participant QuantEmbedding
  participant WeightQuantizer
  participant OutputQuantizer
  Client->>QuantEmbedding: forward(input_indices)
  QuantEmbedding->>QuantEmbedding: ensure input_quantizer disabled
  QuantEmbedding->>WeightQuantizer: get quantized weight (if enabled or export)
  alt quantized weight returned
    QuantEmbedding->>QuantEmbedding: lookup with quantized weight
  else raw weight used
    QuantEmbedding->>QuantEmbedding: lookup with raw weight
  end
  QuantEmbedding->>OutputQuantizer: apply output quantizer if enabled
  QuantEmbedding-->>Client: return embeddings

Estimated code review effort:
🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs:

NVIDIA/Model-Optimizer#1313: Extends _export_quantized_weight to handle NVFP4 W4A16 packing/scaling logic that interacts with embedding weight packing added here.

Suggested reviewers:

cjluo-nv
meenchen
ChenhanYu
sychen52

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[OMNIML-4730] Support quantized nn.Embedding' directly and clearly summarizes the main change: adding support for quantized embedding layers to the modelopt framework.
Docstring Coverage	✅ Passed	Docstring coverage is 95.65% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	Comprehensive security review completed. No torch.load unsafe patterns, numpy.load with allow_pickle, hardcoded trust_remote_code, eval/exec, nosec comments, or unsafe dependencies detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ajrasane/quant_embedding

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-14T17:59:03Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1495/
Built to branch `gh-pages` at 2026-05-19 20:29 UTC. Preview will be ready when the GitHub Pages deployment is complete.

cjluo-nv

Bot review — DM the bot to share feedback.

Quantized nn.Embedding support cleanly mirrors the _QuantLinear/QuantLinearConvBase pattern (dynamic-attribute weight, quantize_weight context, _register_temp_attribute), so design-wise it slots in well — no second composition system. The wildcard-tolerance via _UnsettableInputQuantizer is unusual but justified: stock recipes apply *input_quantizer enables, and the YAML parent_class: nn.Embedding, *, enable: false rule is appended last in every preset that uses _default_disabled_quantizer_cfg, so the disabled state is restored before forward. Unit tests cover the lock semantics and weight quant against fake_tensor_quant reference.

Three things worth a maintainer look before approving:

output_quantizer is silently bypassed under torch.export. _QuantEmbedding.forward does if is_torch_export_mode(): return super().forward(...) — that path never calls self.output_quantizer(output). QuantLinearConvBase/QuantInputBase both keep the output_quantizer in the export path. If a user opts into output_quantizer and then torch.exports, they'll lose it without warning. Probably harmless today (output_quantizer is off by default) but it's an inconsistency.
Tied embeddings (tied_word_embeddings=True) likely break on export. _export_quantized_weight does setattr(sub_module, weight_name, nn.Parameter(quantized_weight, ...)), replacing embedding.weight with a new Parameter holding packed uint8 bytes. If lm_head.weight was tied to the same Parameter, the tie is severed and lm_head keeps a stale float weight; postprocess_state_dict's tied-weight dedup will then drop one of the keys from the safetensors output. The PR description's example uses an embedding-only model, which sidesteps this — but in real LLMs (Llama/Qwen with tied embeddings) this needs at least a guard or explicit warning.
No export-path test. All new tests are pure forward tests; the new _process_quantized_modules branch routing nn.Embedding through _export_quantized_weight has no coverage. Given (2), an export round-trip test on a tiny tied-embedding model would catch the issue. The PR description says it was verified manually on an embedding-only model — that's exactly the case that doesn't exercise the tying path.

Smaller/optional: the _UnsettableInputQuantizer.enable* overrides catch user-facing direct calls, but set_from_attribute_config({"enable": True}) writes _disabled directly via setattr, so the only real defense is the runtime check in forward. The current docstring already explains this; just confirm the runtime guard is the load-bearing one and the method overrides are belt-and-suspenders.

Register nn.Embedding in QuantModuleRegistry so the embedding table and the lookup activations participate in quantization. The literal input is integer indices, so input_quantizer is a non-configurable placeholder that raises on direct enable*() calls and at forward-time if its _disabled flag is flipped — wildcard configs (e.g. NVFP4_DEFAULT_CFG's *input_quantizer) are accepted silently so the stock deny-all → enable wildcards → opt-out pattern continues to work, and the opt-out is installed by default (parent_class: nn.Embedding in default_disabled_quantizers.yaml). export_hf_checkpoint packs quantized embedding weights through the same path as Linear layers. Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

- Apply output_quantizer in the torch.export branch of _QuantEmbedding.forward so users who opt into output activation quantization don't silently lose it during export. Matches QuantInputBase.forward's behavior. - Detect Python-level weight tying (e.g. tied_word_embeddings → lm_head) in _process_quantized_modules and skip packing the embedding when the .weight Parameter is shared, with a UserWarning. Packing would otherwise reassign the embedding's .weight to a new uint8 Parameter, severing the tie and leaving the tied module pointing at a stale float Parameter. - Add export-path tests covering the normal pack flow (weight → uint8 + weight_scale + weight_scale_2 buffers) and the tied-embedding skip path (weight unchanged, warning raised, tie preserved). Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

The previous design raised in _QuantEmbedding.forward whenever input_quantizer.is_enabled, on the theory that any non-disable config was an explicit user mistake. That assumption was wrong for wildcard configs: the default QuantizeConfig is just [{"quantizer_name": "*", "cfg": {"num_bits": 8, ...}}] (no embedding opt-out), so the wildcard enables embed_tokens.input_quantizer for tiny Llama-style tests and the forward guard fires — breaking test_peft_save_load and test_transformers_save_load. Switch _UnsettableInputQuantizer.set_from_attribute_config to absorb the incoming config like a normal quantizer, then force _disabled = True at the end. The "throw on explicit set" semantics are preserved via the .enable / .enable_quant / .enable_calib overrides, which catch the direct mistakes users would actually make. The forward-time guard (and the corresponding test) are removed since the invariant is now maintained at the configure step. Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

cjluo-nv

Bot review — DM the bot to share feedback.

Re-review: all three critical comments from the previous round are addressed.

output_quantizer under torch.export: forward() now restructured so return self.output_quantizer(output) lives outside the if/else — applied in both the export and non-export paths, matching QuantInputBase.forward. ✅
Tied-embedding export safety: _process_quantized_modules now scans model.named_modules() for any other module whose weight is the same Parameter object as the embedding's, and skips packing with a UserWarning when a tie is found, preserving tied_word_embeddings semantics. ✅
Export-path tests: New TestQuantEmbeddingExport covers both the success path (uint8-packed weight + weight_scale/weight_scale_2 registered, no input_scale) and the tied-weight guard (warning + tie preserved). ✅

Design-wise this isn't a new abstraction — quant_embedding.py slots in next to quant_linear.py/quant_conv.py/etc., reusing QuantModule, QuantModuleRegistry, _register_temp_attribute, _register_dynamic_attribute, and the quantize_weight context manager pattern. Previous reviewer explicitly endorsed the design.

Minor docs drift: PR body still says "forward() raises if _disabled is flipped via any back door", but the current forward has no such runtime check. That's fine in practice — nn.Embedding.forward never invokes input_quantizer, so a back-door enable can't cause an integer-index tensor to hit the fake-quant path. Worth tightening the PR description but not a blocker.

Complex PR: spans 7 directories (≥ 5). Looping in a human for approval.

cjluo-nv · 2026-05-19T20:50:09Z

+        # so we disable it once at construction via direct attribute assignment.
+        input_quantizer = _UnsettableInputQuantizer(self.default_quant_desc_input)
+        input_quantizer._disabled = True
+        self._register_temp_attribute("input_quantizer", input_quantizer)


could you help me understand:

why we have an input_quantizer here? Isn't this a weight quantizer only?

cjluo-nv · 2026-05-19T20:50:16Z


+    # Embedding has a 2D weight but is not a GEMM op, so calibration passes that operate
+    # on linear activations (AWQ, SmoothQuant, SVDQuant) must skip it.
+    if isinstance(module, nn.Embedding):


why do we need this?

cjluo-nv

do we plan to run some fake quant evals?

ajrasane requested review from a team as code owners May 14, 2026 17:54

ajrasane requested a review from cjluo-nv May 14, 2026 17:54

ajrasane changed the title ~~feat(quant): support quantized nn.Embedding~~ [OMNIML-4730] Support quantized nn.Embedding May 14, 2026

cjluo-nv reviewed May 14, 2026

View reviewed changes

ajrasane mentioned this pull request May 15, 2026

[Quantization] Add ModelOpt FP8/NVFP4 weight-only embedding methods vllm-project/vllm#42791

Draft

4 tasks

ajrasane added 3 commits May 19, 2026 20:25

ajrasane force-pushed the ajrasane/quant_embedding branch from 4c4db31 to a932284 Compare May 19, 2026 20:26

coderabbitai Bot approved these changes May 19, 2026

View reviewed changes

cjluo-nv reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OMNIML-4730] Support quantized nn.Embedding#1495

[OMNIML-4730] Support quantized nn.Embedding#1495
ajrasane wants to merge 3 commits into
mainfrom
ajrasane/quant_embedding

ajrasane commented May 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Walkthrough

Changes

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-19 20:29 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

cjluo-nv left a comment

Uh oh!

cjluo-nv left a comment

Uh oh!

cjluo-nv May 19, 2026

Uh oh!

cjluo-nv May 19, 2026

Uh oh!

cjluo-nv left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ajrasane commented May 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-05-19 20:29 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

cjluo-nv May 19, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv May 19, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ajrasane commented May 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-19 20:29 UTC.
Preview will be ready when the GitHub Pages deployment is complete.