Add support for Qwen3Omni30B thinking model by ajrasane · Pull Request #856 · NVIDIA/Model-Optimizer

ajrasane · 2026-02-05T01:01:08Z

What does this PR do?

Type of change:
Model support

Overview:

Support quantization of Qwen3-Omni-30B-A3B-Thinking

Usage

python hf_ptq.py \
    --pyt_ckpt_path Qwen/Qwen3-Omni-30B-A3B-Thinking \
    --qformat nvfp4 \
    --kv_cache_qformat none \
    --export_path ./qwen3_omni_30b_nvfp4 \
    --trust_remote_code \
    --attn_implementation flash_attention_2 \
    --dataset cnn_dailymail \
    --verbose \
    --device "cuda:0" \
    --quant_summary_path ./quant_summary.txt

Testing

Able to quantize model, export hf_checkpoint and run inference with vLLM:

Baseline:
|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7891|±  |0.0032|
| - humanities     |      2|none  |      |acc   |↑  |0.6978|±  |0.0063|
| - other          |      2|none  |      |acc   |↑  |0.8265|±  |0.0065|
| - social sciences|      2|none  |      |acc   |↑  |0.8794|±  |0.0058|
| - stem           |      2|none  |      |acc   |↑  |0.8002|±  |0.0069|

nvfp4:
|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7789|±  |0.0033|
| - humanities     |      2|none  |      |acc   |↑  |0.6895|±  |0.0064|
| - other          |      2|none  |      |acc   |↑  |0.8146|±  |0.0066|
| - social sciences|      2|none  |      |acc   |↑  |0.8723|±  |0.0059|
| - stem           |      2|none  |      |acc   |↑  |0.7859|±  |0.0070|

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: No
Did you add or update any necessary documentation?: No
Did you update Changelog?: No

Summary by CodeRabbit

New Features
- Multimodal support for Qwen3Omni and Mllama: text, image, and video processors, dataloaders, and integrated vLLM inference.
- Generation options now propagate through calibration, quantization, and post-PTQ previews for accurate multimodal outputs.
- CLI option to save quantization summaries.
Bug Fixes
- Pre-save generation_config adjusted to avoid HF checkpoint validation errors.
- Clearer error messaging for VL-model extraction failures.
Chores
- Improved model-type detection, tokenizer handling, and model-specific quantization/export tweaks.

coderabbitai · 2026-02-05T01:01:25Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds Qwen3Omni multimodal support across PTQ, dataloaders, vLLM inference, and export: new text/image/video processors and dataloaders, model-type mapping, quantization adjustments, calibration-batch propagation through pre/post-quantize, a vLLM runner, and a CLI option to save quantization summaries.

Changes

Cohort / File(s)	Summary
Example utilities & vLLM runner `examples/llm_ptq/example_utils.py`, `examples/llm_ptq/run_vllm.py`	New tokenizer constants and helpers (`TOKENIZER_FILES`, model type/sampling/quant format detection), ensure-tokenizer copy, export/patch helpers, model-type-specific generation kwargs, and a vLLM runner that detects model type/quant and runs generation.
PTQ pipeline & CLI `examples/llm_ptq/hf_ptq.py`	Integrated Qwen3Omni into calibration/quant/export flows; `pre_quantize` now returns `calib_batch`, `post_quantize` accepts `calib_batch`; calib_batch threaded through `quantize_main`; added `--quant_summary_path` and generation_kwargs propagation.
Image/Video processors `modelopt/torch/utils/image_processor.py`, `modelopt/torch/utils/video_dataset_utils.py`	Added Qwen3Omni processors: `Qwen3OmniTextProcessor`, `Qwen3OmniImageProcessor`, `Qwen3OmniVideoProcessor` with preprocess/collate and Arrow-compatible schemas; video utils include dataset loading, caching, and dataloader factory.
Dataset utilities & forward routing `modelopt/torch/utils/dataset_utils.py`, `modelopt/torch/utils/__init__.py`	Added `get_qwen3omni_text_dataloader`, re-exported video utils, introduced `_should_use_generate`, extended `create_forward_loop`/`_forward_loop` and `_process_batch` to accept `generation_kwargs`, and routed forward vs generate accordingly.
Model export / mapping / unified export `modelopt/torch/export/model_utils.py`, `modelopt/torch/export/unified_export_hf.py`	Added `MODEL_NAME_TO_TYPE` entry for Qwen3Omni; `get_language_model_from_vl` may return `thinker`; unified export improved for encoder-decoder and VL language_model extraction and fixes `generation_config` pre-save behavior.
Quant plugins & summary output `modelopt/torch/quantization/plugins/huggingface.py`, `modelopt/torch/quantization/model_quant.py`	Optional registration for Qwen3Omni MoE thinker block; `print_quant_summary` renamed parameter to `save_path` and now writes full summary to file or prints it.
Layer utils `modelopt/torch/export/layer_utils.py`	Recognizes `Qwen3OmniMoeThinkerTextSparseMoeBlock` in MoE linear name extraction.
Misc examples `examples/llm_eval/run_lm_eval_vllm.sh`	Added optional `HOST` parameter and updated `BASE_URL` to support remote vLLM hosts.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as User/CLI
    participant Main as hf_ptq (quantize_main)
    participant Proc as ProcessorSelector
    participant DLFactory as DataLoaderFactory
    participant PreQ as pre_quantize
    participant Quant as quantize_model
    participant PostQ as post_quantize
    participant Export as unified_export

    CLI->>Main: start with --model_name=qwen3omni
    Main->>Proc: get_processor(model_type="qwen3omni")
    Proc-->>Main: Qwen3OmniImageProcessor / TextProcessor / VideoProcessor
    Main->>DLFactory: select dataloader (video / vl / text)
    DLFactory-->>Main: calib_dataloader

    Main->>PreQ: pre_quantize(calib_dataloader,...)
    PreQ-->>Main: preview_ids, generated_before_ptq, calib_batch

    Main->>Quant: quantize_model(calib_batch,...)
    Quant->>Quant: route inference via _should_use_generate / forward
    Quant-->>Main: quantized_model

    Main->>PostQ: post_quantize(..., calib_batch)
    PostQ-->>Main: post-PTQ results

    Main->>Export: export_quantized(...)
    Export->>Export: patch_config_for_unified_export
    Export-->>Main: exported_checkpoint

    alt --quant_summary_path provided
        Main->>Main: write quant summary to file + print confirmation
    else
        Main->>Main: print quant summary to stdout
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: adding support for the Qwen3Omni30B thinking model. It matches the PR's primary objective and is specific enough for scanning history.
Security Anti-Patterns	✅ Passed	The only torch.load(..., weights_only=False) in the PR is loading a self-generated cache with inline justifying comment; no numpy.load with allow_pickle=True; all trust_remote_code flags passed through caller parameters; no eval/exec on external input; no nosec bypass comments added; no new non-permissive dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ajrasane/qwen3omni_final

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@examples/llm_ptq/hf_ptq.py`:
- Around line 219-221: The current code mutates processor.dtype (processor.dtype
= language_model.dtype) which can have unintended side effects; instead avoid
changing the existing processor and either pass dtype explicitly into
get_vlm_dataset_dataloader/collate function or construct a new processor
instance with the desired dtype (e.g., create a new Qwen3OmniImageProcessor
using processor.tokenizer and language_model.dtype) and pass that to
get_vlm_dataset_dataloader so the original processor remains unchanged.

In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1001-1009: The code currently mutates model.generation_config
in-place (via gen_config and clearing temperature/top_p/top_k) which persists
after export; instead, create a non-mutating copy of model.generation_config (or
record the original values for "temperature","top_p","top_k"), apply the
sampling fixes to that copy (or to the saved serialization data) and use the
copy for writing the checkpoint (e.g., before calling save_pretrained), then
restore the original model.generation_config (or simply never mutate it) so the
caller's model is unchanged; reference symbols: model.generation_config,
gen_config, and the sampling attrs ["temperature","top_p","top_k"] when locating
where to implement the copy/restore behavior.

In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 470-472: The code computes batch_size using tensor_data but
iterates keys from batch_data (batch_size =
tensor_data[next(iter(batch_data.keys()))].shape[0]), which can pick a
non-tensor key (e.g., max_new_tokens) and cause .shape access to fail; change
the key source to tensor_data so you select a tensor key (e.g., batch_size =
tensor_data[next(iter(tensor_data.keys()))].shape[0]) or explicitly find the
first value in tensor_data that is a tensor and use its shape[0]; update the
logic around the batch_size variable and references to tensor_data/batch_data to
ensure batch_size is derived from actual tensors (functions/variables to check:
tensor_data, batch_data, batch_size).

In `@modelopt/torch/utils/image_processor.py`:
- Around line 264-277: The code uses torch.tensor(...) for "pixel_values" and
"audio_features" which can change/infer dtypes inconsistently; change those to
preserve original dtype by using
torch.as_tensor(first["pixel_values"]).to(self.device) and
torch.as_tensor(first["audio_features"]).to(self.device) (keeping the existing
.to(self.device) pattern), so the dtype of the incoming data is preserved;
update the handling in image_processor.py where result["pixel_values"] and
result["audio_features"] are set (using the local variables first and result) to
use torch.as_tensor instead of torch.tensor.

🧹 Nitpick comments (5)

modelopt/torch/export/model_utils.py (1)
145-148: Minor: Update comment for consistency.

The comment "Pattern 3: No language_model found" at line 148 is now misleading since a new pattern was inserted above it. Consider renumbering or removing pattern numbering.
💡 Suggested fix
     if hasattr(model, "thinker"):
         return [model, model.thinker]

-    # Pattern 3: No language_model found
+    # No language_model found
     return None
examples/llm_ptq/example_utils.py (1)
284-286: Update docstring to reflect new return types.

The docstring states it returns a MllamaImageProcessor object, but now it can also return Qwen3OmniImageProcessor for the new model type.
📝 Suggested fix
 def get_processor(
     ckpt_path,
     model_type,
     device: torch.device = "auto",
     trust_remote_code=False,
     attn_implementation=None,
 ) -> BaseImageProcessor | ProcessorMixin | None:
     """
-    Returns a :class:`modelopt.torch.utils.image_processor.MllamaImageProcessor` object.
+    Returns an image processor for multimodal models (MllamaImageProcessor, Qwen3OmniImageProcessor)
+    or a ProcessorMixin for models like Whisper.
     """
modelopt/torch/utils/image_processor.py (1)
115-172: Consider interface consistency for preprocess_function.

Qwen3OmniTextProcessor.preprocess_function(text: str) has a different signature than the base class and sibling classes, which use preprocess_function(examples: dict). This could cause issues if callers expect a uniform interface.

Additionally, the dtype attribute is stored but never used in the processor methods.
💡 Suggested approach

Consider either:

Aligning the signature with the base class by wrapping text in a dict:
def preprocess_function(self, examples):
    text = examples.get("text", examples) if isinstance(examples, dict) else examples
    # ... rest of implementation
Or documenting the intentional deviation clearly in the class docstring.

For the unused dtype, either utilize it in tensor conversion or remove if not needed.
modelopt/torch/utils/dataset_utils.py (1)
136-146: Replace assertions with proper exceptions for public API.

Using assert for input validation in a public API function (get_qwen3omni_text_dataloader) is discouraged because assertions can be disabled with -O flag. Use ValueError or TypeError instead.
💡 Suggested fix
-    assert processor is not None, "Please provide a Qwen3OmniTextProcessor."
+    if processor is None:
+        raise ValueError("Please provide a Qwen3OmniTextProcessor.")
 
     if isinstance(num_samples, int):
         num_samples = [num_samples]
 
     if isinstance(dataset_name, str):
         dataset_name = [dataset_name]
 
-    assert len(dataset_name) == len(num_samples), (
-        "dataset_name and num_samples must be the same length"
-    )
+    if len(dataset_name) != len(num_samples):
+        raise ValueError("dataset_name and num_samples must be the same length")
examples/llm_ptq/hf_ptq.py (1)
756-766: Code duplication in Qwen3Omni generation handling.

The logic for handling Qwen3Omni's tuple return from generate() is duplicated between pre_quantize (lines 756-766) and post_quantize (lines 817-827). Consider extracting to a helper function.
♻️ Suggested refactor
def _extract_qwen3omni_generated_ids(result):
    """Extract generated IDs from Qwen3Omni generate() output."""
    if isinstance(result, tuple):
        text_ids, _ = result
        return text_ids.sequences if hasattr(text_ids, "sequences") else text_ids
    return result
Then use in both places:
# In pre_quantize and post_quantize:
result = full_model.generate(**calib_batch, max_new_tokens=100)
generated_ids = _extract_qwen3omni_generated_ids(result)
Also applies to: 817-827

cjluo-nv

@Edwardf0t1 could you review it as well?

Edwardf0t1

@ajrasane I'm wondering if the exported checkpoint can be deployed on vLLM/SGLang/TRT-LLM and how's the accuracy?

ajrasane · 2026-02-06T19:16:57Z

@Edwardf0t1 ,I was able to deploy think checkpoint on vLLM and get good outputs. I have also generated accuracy results with MMLU. PTAL.

cjluo-nv · 2026-02-09T23:21:17Z

+    return None
+
+
+def ensure_tokenizer_files(model_path: str, source_model_id: str) -> None:


why do we need this?

This was required by vLLM for running the model. If the tokenizer files are not saved, then we are unable to deploy the checkpoint with vLLM.

I think we should be able to export the tokenizer files with https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/llm_ptq/hf_ptq.py#L691-L696

cjluo-nv · 2026-02-09T23:22:50Z

        print("No custom model files found to copy")
+
+
+def patch_config_for_unified_export(model_type: str, export_path: str) -> None:


@Edwardf0t1 could you review this part?

cjluo-nv · 2026-02-09T23:25:10Z

            else:
-                calibrate_loop = create_forward_loop(dataloader=calib_dataloader)
+                calibrate_loop = create_forward_loop(
+                    dataloader=calib_dataloader, generation_kwargs=generation_kwargs


what will happen if generation_kwargs is empty?

The generation_kwargs are finally passed to the models infer_method (forward/generate). If the kwargs are empty, there should be no issues while unpacking this empty dict and just passing the batch_data.

cjluo-nv · 2026-02-09T23:26:11Z

    generated_ids_before_ptq,
    is_nemotron_vl_model,
    first_text_speech_dataset,
+    calib_batch: dict | None = None,


please document why we need to add calib_batch

cjluo-nv · 2026-02-09T23:27:11Z

-            return processor.tokenizer.batch_decode(input_ids)
+        # BaseImageProcessor covers MllamaImageProcessor and Qwen3OmniImageProcessor
+        if processor is not None and isinstance(processor, BaseImageProcessor):
+            return processor.tokenizer.batch_decode(input_ids, skip_special_tokens=True)


is 800 and 804 expected?

I have updated the code since the last review. Is this comment stil relevant?

cjluo-nv · 2026-02-09T23:27:53Z

@@ -0,0 +1,136 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


is this qwen3 omni specific?

Currently this is qwen3omni specific. But we can later update this to also enable the other models supported by hf_ptq.py

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

modelopt/torch/export/unified_export_hf.py (1)

360-384: ⚠️ Potential issue | 🔴 Critical

Unreachable code: Lines 380-382 are dead code.

The condition at line 380 (getattr(model.config, "is_encoder_decoder", False)) is identical to line 360. Since line 360 already handles all encoder-decoder models, the elif branch at line 380 can never be reached.

Control flow analysis:

If is_encoder_decoder is True → line 360 branch executes
If is_encoder_decoder is False → skip line 360, check line 363
If line 363 is also False → check line 380, but this is the same condition as line 360, which we already know is False

If the intent was to handle encoder-decoder models that are also VL (Nemotron/Qwen3Omni), consider restructuring the logic.

🐛 Proposed fix: Remove dead code

         if getattr(model.config, "is_encoder_decoder", False):
             # For encoder-decoder models, we need to pass both the encoder and decoder input ids
             model(fake_input, decoder_input_ids=decoder_fake_input)
         elif (is_vl_model and "nemotron" in model_type) or model_type.startswith("qwen3omni"):
             # For Nemotron VL models, try to run optimization on just the language model part
             language_model_lineage = get_language_model_from_vl(model)
 
             if language_model_lineage is not None:
                 language_model = language_model_lineage[-1]
                 print(
                     f"Running optimization on language model with fake_input shape: {fake_input.shape}"
                 )
                 # Pass use_cache=False to avoid KV cache issues in encoder-decoder models
                 language_model(fake_input, use_cache=False)
             else:
                 raise ValueError(
                     f"Cannot extract language_model from VL model (type: {model_type}). "
                     "This is required for requantization/resmoothing optimization. "
                     "Please ensure the model architecture is supported or file an issue."
                 )
-        elif getattr(model.config, "is_encoder_decoder", False):
-            # For other encoder-decoder models (non-VL), pass both encoder and decoder input ids
-            model(fake_input, decoder_input_ids=decoder_fake_input)
         else:
             model(fake_input)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/unified_export_hf.py` around lines 360 - 384, The
duplicate check for getattr(model.config, "is_encoder_decoder", False) creates
unreachable code; remove the second identical elif block (the branch starting
with that getattr at the end) and merge its behavior into the first
encoder-decoder handling if needed, or ensure encoder-decoder VL cases are
handled in the earlier VL-specific branch (the logic around is_vl_model,
"nemotron", model_type.startswith("qwen3omni"), and get_language_model_from_vl).
Update the control flow so only one encoder-decoder check remains (keep the
first if that calls model(fake_input, decoder_input_ids=decoder_fake_input)) and
adjust the VL branch to fall through or delegate to that encoder-decoder
handling where appropriate.

modelopt/torch/utils/dataset_utils.py (2)

626-628: ⚠️ Potential issue | 🟠 Major

Missing generation_kwargs in recursive _process_batch call.

When splitting batches due to known max_working_batch_size, the recursive call on line 626-628 doesn't pass generation_kwargs, causing it to default to an empty dict. This means generation parameters won't be propagated during batch splitting.

🐛 Proposed fix

             max_working_batch_size = _process_batch(
-                split_data, infer_method, max_working_batch_size
+                split_data, infer_method, generation_kwargs, max_working_batch_size
             )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/dataset_utils.py` around lines 626 - 628, The recursive
call to _process_batch when splitting into split_data currently omits the
generation_kwargs parameter, so generation options aren't propagated; update the
call to include generation_kwargs (i.e., call _process_batch(split_data,
infer_method, max_working_batch_size, generation_kwargs)) so the same generation
settings are passed through; ensure you reference the _process_batch function
signature and the generation_kwargs variable used earlier in the enclosing scope
to maintain behavior consistency.

658-659: ⚠️ Potential issue | 🟠 Major

Missing generation_kwargs in OOM recovery recursive calls.

The recursive _process_batch calls on lines 658-659 (for OOM recovery by splitting in half) also don't pass generation_kwargs.

🐛 Proposed fix

     # Recursively process each half and track max working batch size
-    max_working_batch_size = _process_batch(split_data_1, infer_method)
-    max_working_batch_size = _process_batch(split_data_2, infer_method, max_working_batch_size)
+    max_working_batch_size = _process_batch(split_data_1, infer_method, generation_kwargs)
+    max_working_batch_size = _process_batch(split_data_2, infer_method, generation_kwargs, max_working_batch_size)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/dataset_utils.py` around lines 658 - 659, The OOM
recovery recursion calling _process_batch for split_data_1 and split_data_2
omits passing generation_kwargs, so ensure both recursive calls include the
generation_kwargs parameter (i.e., call _process_batch(split_data_1,
infer_method, generation_kwargs=generation_kwargs, ...) and
_process_batch(split_data_2, infer_method, generation_kwargs=generation_kwargs,
...) while preserving the existing max_working_batch_size argument) so that
generation settings are preserved during batch-splitting; update the calls
referencing _process_batch and variables split_data_1 and split_data_2
accordingly.

♻️ Duplicate comments (3)

modelopt/torch/export/unified_export_hf.py (1)

1176-1186: ⚠️ Potential issue | 🟡 Minor

Model's generation_config mutation persists after export returns.

This block mutates model.generation_config.do_sample in-place. After export_hf_checkpoint returns, callers retain the modified model with do_sample=True, which may surprise users who continue to use the model object.

Consider saving the original value and restoring it after save_pretrained() completes.

♻️ Proposed fix: Restore original do_sample after save

         # Fix generation_config conflicts before saving
         # Some models have temperature/top_p/top_k set but do_sample=False which causes validation errors
+        _orig_do_sample = None
         if hasattr(model, "generation_config") and model.generation_config is not None:
             gen_config = model.generation_config
             if not getattr(gen_config, "do_sample", True):
                 # Enable sampling if sampling params are present
                 if any(
                     getattr(gen_config, attr, None) is not None
                     for attr in ["temperature", "top_p", "top_k"]
                 ):
+                    _orig_do_sample = gen_config.do_sample
                     gen_config.do_sample = True
 
         # Save model
         # Temporarily disable revert_weight_conversion if available — it doesn't handle
         # quantized state dicts (scalar scale tensors have 0 dimensions, causing IndexError).
         # We must patch both the source module and the importing module since
         # modeling_utils does `from core_model_loading import revert_weight_conversion`.
         _patches = _patch_revert_weight_conversion()
 
         try:
             model.save_pretrained(
                 export_dir,
                 state_dict={**post_state_dict, **(extra_state_dict or {})},
                 save_modelopt_state=save_modelopt_state,
             )
         finally:
             _unpatch_revert_weight_conversion(_patches)
+            # Restore original do_sample if we modified it
+            if _orig_do_sample is not None and hasattr(model, "generation_config"):
+                model.generation_config.do_sample = _orig_do_sample

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/unified_export_hf.py` around lines 1176 - 1186, The
code currently mutates model.generation_config.do_sample in-place causing the
change to leak to the caller; capture the original value (e.g., orig_do_sample =
gen_config.do_sample) before setting gen_config.do_sample = True, proceed with
the existing save flow (save_pretrained()), and then restore
gen_config.do_sample = orig_do_sample after saving completes; reference the same
objects/variables used in the diff (model, model.generation_config, gen_config,
and save_pretrained()) so the temporary change does not persist after
export_hf_checkpoint returns.

modelopt/torch/utils/dataset_utils.py (1)

608-610: ⚠️ Potential issue | 🟠 Major

Bug: batch_data.keys() used instead of tensor_data.keys() when computing batch_size.

Line 610 retrieves batch_size from tensor_data but iterates keys from batch_data. If the first key in batch_data is a scalar parameter (like max_new_tokens), accessing .shape[0] will fail because scalars don't have a .shape attribute.
🐛 Proposed fix
     # Get the batch size of current data
-    batch_size = tensor_data[next(iter(batch_data.keys()))].shape[0]
+    batch_size = tensor_data[next(iter(tensor_data.keys()))].shape[0]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/dataset_utils.py` around lines 608 - 610, The code
computes batch_size by indexing tensor_data with a key from batch_data, which
can pick up non-tensor scalar keys (e.g., max_new_tokens) and cause .shape
access to fail; update the batch_size computation to use a key from tensor_data
instead (e.g., use the first key/value from tensor_data or its values iterator)
so you always read a tensor to call .shape[0]; ensure tensor_data is non-empty
before accessing to avoid StopIteration.

modelopt/torch/utils/image_processor.py (1)

264-277: ⚠️ Potential issue | 🟡 Minor

Inconsistent tensor creation may cause dtype issues.

torch.tensor() (lines 265, 277) infers dtype from input data, while torch.LongTensor() explicitly creates int64 tensors. For pixel_values and audio_features, the original dtype (likely float32/bfloat16) should be preserved or explicitly set.

🔧 Suggested fix

         # Handle pixel values for images
         if first.get("pixel_values") is not None:
-            result["pixel_values"] = torch.tensor(first["pixel_values"]).to(self.device)
+            result["pixel_values"] = torch.as_tensor(first["pixel_values"]).to(self.device)
         ...
         if first.get("audio_features") is not None:
-            result["audio_features"] = torch.tensor(first["audio_features"]).to(self.device)
+            result["audio_features"] = torch.as_tensor(first["audio_features"]).to(self.device)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/image_processor.py` around lines 264 - 277, The code is
inconsistently creating tensors with torch.tensor(...) for pixel_values and
audio_features while using torch.LongTensor(...) for integer lengths, which can
change or infer dtypes incorrectly; update the pixel_values and audio_features
creation to preserve their original dtypes and place them on the device (e.g.,
use torch.as_tensor(...) or torch.tensor(..., dtype=first_dtype,
device=self.device)) instead of torch.tensor(...). Keep image_grid_thw and
audio_feature_lens as explicit LongTensors (result["image_grid_thw"],
result["audio_feature_lens"]) and ensure all tensor constructions reference
self.device consistently.

🧹 Nitpick comments (2)

modelopt/torch/utils/dataset_utils.py (1)

676-678: Unused _should_use_generate function - incomplete refactoring?

The _should_use_generate function is defined at lines 850-859 but the call site at line 677 is commented out, using model_type_is_enc_dec instead. Either remove the dead _should_use_generate function or uncomment the intended usage. This appears to be leftover from incomplete refactoring.
♻️ If _should_use_generate is intended, use it
     with torch.no_grad():
-        # use_generate = _should_use_generate(model)
-        use_generate = model_type_is_enc_dec(model)
+        use_generate = _should_use_generate(model)
         infer_method = model.generate if use_generate else model.forward
Or remove the unused function if model_type_is_enc_dec is the correct behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/dataset_utils.py` around lines 676 - 678, The code
currently comments out the call to _should_use_generate and instead uses
model_type_is_enc_dec to set use_generate (affecting infer_method), leaving the
_should_use_generate function unused; either restore the intended behavior by
uncommenting/using _should_use_generate where use_generate is assigned (so
infer_method = model.generate if _should_use_generate(model) else
model.forward), or delete the dead _should_use_generate function (defined at
lines 850-859) to remove dead code and ensure consistency between the
use_generate logic and the actual helper used (model_type_is_enc_dec or
_should_use_generate).

examples/llm_ptq/hf_ptq.py (1)

436-443: Clarify: processor = None forces text-only calibration for Qwen3Omni.

Setting processor = None at line 441 means get_qwen3omni_dataloader will use the text-only fallback path. If multimodal calibration (video/VLM datasets) is ever needed, the processor would need to be preserved. Consider adding a comment clarifying this is intentional for text-only calibration.

📝 Suggested documentation

         if model_type == "qwen3omni":
             print("Disabling talker for Qwen3Omni model")
             full_model.disable_talker()
             language_model = full_model.thinker.model
             tokenizer = processor.tokenizer.tokenizer
+            # Use text-only calibration for Qwen3Omni; set processor=None to skip multimodal path
             processor = None
             default_padding_side = tokenizer.padding_side
             default_pad_token = tokenizer.pad_token

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 436 - 443, The assignment processor
= None inside the model_type == "qwen3omni" branch intentionally forces
text-only calibration (so get_qwen3omni_dataloader will take the text-only
fallback); make this explicit by adding a clarifying comment next to processor =
None that states it disables multimodal processing and forces text-only
calibration for Qwen3Omni, and note that to enable multimodal/video/VLM
calibration the processor must be preserved (i.e., remove the processor = None
line), referencing the surrounding symbols full_model.disable_talker,
language_model = full_model.thinker.model, tokenizer =
processor.tokenizer.tokenizer, and get_qwen3omni_dataloader to guide where the
change applies.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 1097-1099: The code mutates processor.dtype directly
(processor.dtype = model_dtype) which can leak state; instead avoid altering the
passed-in Processor instance: create a shallow copy of the processor before
changing dtype (e.g., copy/clone the Processor and set dtype on the clone) or
modify get_vlm_dataset_dataloader/collate_function to accept an explicit dtype
parameter and pass model_dtype through that path; update the call sites around
get_vlm_dataset_dataloader and any collate_function usage to use the cloned
processor or the explicit dtype argument so the original processor instance is
not mutated.
- Around line 1078-1079: The current check "if dataset_name in
get_supported_video_datasets()" fails when dataset_name is a list because a list
cannot be a member of a list of strings; update the condition to handle both str
and list[str] by first checking isinstance(dataset_name, str) and comparing it
to get_supported_video_datasets(), or if isinstance(dataset_name, list) iterate
(or check dataset_name[0]) against get_supported_video_datasets(); adjust the
subsequent assert to validate the element type(s) (e.g., assert
isinstance(dataset_name, str) for the string branch or assert all(isinstance(x,
str) for x in dataset_name) for the list branch) and ensure the branch logic
uses the correct element(s).

In `@examples/llm_ptq/run_vllm.py`:
- Around line 104-111: The LLM is being instantiated with a hardcoded
trust_remote_code=True; update the LLM(...) call to use the CLI flag instead
(e.g., replace trust_remote_code=True with
trust_remote_code=args.trust_remote_code or the appropriate parsed flag name) so
the vLLM LLM class respects the command-line setting.
- Around line 68-76: The code currently hardcodes trust_remote_code=True in
AutoConfig.from_pretrained and AutoProcessor.from_pretrained; add a CLI flag
(e.g., args.trust_remote_code, default False) and use that value instead of the
hardcoded True when calling AutoConfig.from_pretrained and
AutoProcessor.from_pretrained (also update any parser help text and default
behavior), ensuring the examples/***.py exposes the option to callers and
defaults to False to avoid executing remote code by default.

In `@modelopt/torch/utils/video_dataset_utils.py`:
- Around line 121-129: Add an inline comment next to the torch.load(cache_path,
weights_only=False) call explaining why weights_only is set to False and why
this is safe: note that the cache file (loaded into processed_samples and later
turned into processed_dataset via Dataset.from_list) is produced internally by
this module, trusted, and not coming from untrusted external sources, so loading
non-weight tensors is acceptable; include mention that the cache format is
controlled by the code that writes it to ensure compatibility and security.
- Around line 281-325: The collate_function currently grabs only the first item
(first = batch[0]) and discards other samples, so either enforce batch_size==1
or properly aggregate the batch; update collate_function to iterate over batch
and stack/concat per-key (e.g., for "input_ids" and "attention_mask" create a
LongTensor from a list of all items and .to(self.device), for
"pixel_values_videos" and "input_features" build a torch.tensor for each item
then torch.stack and apply self.dtype and device, and similarly stack
"video_grid_thw", "video_second_per_grid" and "feature_attention_mask"), or
alternatively add an explicit validation in
collate_function/get_video_dataset_dataloader that raises a clear error if
len(batch) != 1 when multi-sample batching is not supported; reference
collate_function, batch, first, and get_video_dataset_dataloader when making the
change.

---

Outside diff comments:
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 360-384: The duplicate check for getattr(model.config,
"is_encoder_decoder", False) creates unreachable code; remove the second
identical elif block (the branch starting with that getattr at the end) and
merge its behavior into the first encoder-decoder handling if needed, or ensure
encoder-decoder VL cases are handled in the earlier VL-specific branch (the
logic around is_vl_model, "nemotron", model_type.startswith("qwen3omni"), and
get_language_model_from_vl). Update the control flow so only one encoder-decoder
check remains (keep the first if that calls model(fake_input,
decoder_input_ids=decoder_fake_input)) and adjust the VL branch to fall through
or delegate to that encoder-decoder handling where appropriate.

In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 626-628: The recursive call to _process_batch when splitting into
split_data currently omits the generation_kwargs parameter, so generation
options aren't propagated; update the call to include generation_kwargs (i.e.,
call _process_batch(split_data, infer_method, max_working_batch_size,
generation_kwargs)) so the same generation settings are passed through; ensure
you reference the _process_batch function signature and the generation_kwargs
variable used earlier in the enclosing scope to maintain behavior consistency.
- Around line 658-659: The OOM recovery recursion calling _process_batch for
split_data_1 and split_data_2 omits passing generation_kwargs, so ensure both
recursive calls include the generation_kwargs parameter (i.e., call
_process_batch(split_data_1, infer_method, generation_kwargs=generation_kwargs,
...) and _process_batch(split_data_2, infer_method,
generation_kwargs=generation_kwargs, ...) while preserving the existing
max_working_batch_size argument) so that generation settings are preserved
during batch-splitting; update the calls referencing _process_batch and
variables split_data_1 and split_data_2 accordingly.

---

Duplicate comments:
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1176-1186: The code currently mutates
model.generation_config.do_sample in-place causing the change to leak to the
caller; capture the original value (e.g., orig_do_sample = gen_config.do_sample)
before setting gen_config.do_sample = True, proceed with the existing save flow
(save_pretrained()), and then restore gen_config.do_sample = orig_do_sample
after saving completes; reference the same objects/variables used in the diff
(model, model.generation_config, gen_config, and save_pretrained()) so the
temporary change does not persist after export_hf_checkpoint returns.

In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 608-610: The code computes batch_size by indexing tensor_data with
a key from batch_data, which can pick up non-tensor scalar keys (e.g.,
max_new_tokens) and cause .shape access to fail; update the batch_size
computation to use a key from tensor_data instead (e.g., use the first key/value
from tensor_data or its values iterator) so you always read a tensor to call
.shape[0]; ensure tensor_data is non-empty before accessing to avoid
StopIteration.

In `@modelopt/torch/utils/image_processor.py`:
- Around line 264-277: The code is inconsistently creating tensors with
torch.tensor(...) for pixel_values and audio_features while using
torch.LongTensor(...) for integer lengths, which can change or infer dtypes
incorrectly; update the pixel_values and audio_features creation to preserve
their original dtypes and place them on the device (e.g., use
torch.as_tensor(...) or torch.tensor(..., dtype=first_dtype,
device=self.device)) instead of torch.tensor(...). Keep image_grid_thw and
audio_feature_lens as explicit LongTensors (result["image_grid_thw"],
result["audio_feature_lens"]) and ensure all tensor constructions reference
self.device consistently.

---

Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 436-443: The assignment processor = None inside the model_type ==
"qwen3omni" branch intentionally forces text-only calibration (so
get_qwen3omni_dataloader will take the text-only fallback); make this explicit
by adding a clarifying comment next to processor = None that states it disables
multimodal processing and forces text-only calibration for Qwen3Omni, and note
that to enable multimodal/video/VLM calibration the processor must be preserved
(i.e., remove the processor = None line), referencing the surrounding symbols
full_model.disable_talker, language_model = full_model.thinker.model, tokenizer
= processor.tokenizer.tokenizer, and get_qwen3omni_dataloader to guide where the
change applies.

In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 676-678: The code currently comments out the call to
_should_use_generate and instead uses model_type_is_enc_dec to set use_generate
(affecting infer_method), leaving the _should_use_generate function unused;
either restore the intended behavior by uncommenting/using _should_use_generate
where use_generate is assigned (so infer_method = model.generate if
_should_use_generate(model) else model.forward), or delete the dead
_should_use_generate function (defined at lines 850-859) to remove dead code and
ensure consistency between the use_generate logic and the actual helper used
(model_type_is_enc_dec or _should_use_generate).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d986f022-58ef-43cb-80f2-82592aabdf00

📥 Commits

Reviewing files that changed from the base of the PR and between f0519b1 and e202164.

📒 Files selected for processing (12)

examples/llm_eval/run_lm_eval_vllm.sh
examples/llm_ptq/example_utils.py
examples/llm_ptq/hf_ptq.py
examples/llm_ptq/run_vllm.py
modelopt/torch/export/model_utils.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/quantization/model_quant.py
modelopt/torch/quantization/plugins/huggingface.py
modelopt/torch/utils/__init__.py
modelopt/torch/utils/dataset_utils.py
modelopt/torch/utils/image_processor.py
modelopt/torch/utils/video_dataset_utils.py

🚧 Files skipped from review as they are similar to previous changes (3)

modelopt/torch/utils/init.py
modelopt/torch/export/model_utils.py
modelopt/torch/quantization/plugins/huggingface.py

coderabbitai · 2026-03-17T18:50:48Z

+            # Set the dtype for proper tensor conversion in collate_function
+            processor.dtype = model_dtype
+            calib_dataloader = get_vlm_dataset_dataloader(


⚠️ Potential issue | 🟡 Minor

Side effect: Mutating processor.dtype directly.

Line 1098 mutates processor.dtype = model_dtype, which could affect other callers if the processor is reused. The past review flagged this but the fix (creating a new processor) wasn't applied here.

💡 Suggested approach

Consider creating a copy or documenting that this mutation is intentional:

# Set the dtype for proper tensor conversion in collate_function + # Note: This mutates the processor instance processor.dtype = model_dtype

Or pass dtype through the dataloader/collate function rather than mutating the processor.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Set the dtype for proper tensor conversion in collate_function

processor.dtype = model_dtype

calib_dataloader = get_vlm_dataset_dataloader(

# Set the dtype for proper tensor conversion in collate_function

# Note: This mutates the processor instance

processor.dtype = model_dtype

calib_dataloader = get_vlm_dataset_dataloader(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/example_utils.py` around lines 1097 - 1099, The code mutates processor.dtype directly (processor.dtype = model_dtype) which can leak state; instead avoid altering the passed-in Processor instance: create a shallow copy of the processor before changing dtype (e.g., copy/clone the Processor and set dtype on the clone) or modify get_vlm_dataset_dataloader/collate_function to accept an explicit dtype parameter and pass model_dtype through that path; update the call sites around get_vlm_dataset_dataloader and any collate_function usage to use the cloned processor or the explicit dtype argument so the original processor instance is not mutated.

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (2)

modelopt/torch/utils/video_dataset_utils.py (1)

286-288: ⚠️ Potential issue | 🟠 Major

collate_function drops all items except the first.

With batch_size > 1, this silently discards samples. Either enforce batch_size == 1 or implement real batch collation.

🐛 Proposed fix (enforce supported behavior)

 def get_video_dataset_dataloader(
@@
 ) -> DataLoader:
@@
     assert processor is not None, "Please provide a valid processor."
+    if batch_size != 1:
+        raise ValueError(
+            "Qwen3OmniVideoProcessor currently supports batch_size=1 only. "
+            "collate_function consumes only one sample."
+        )

Also applies to: 156-160

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/video_dataset_utils.py` around lines 286 - 288, The
collate_function currently takes only batch[0] (variable first) which silently
drops additional samples; update collate_function to either enforce
batch_size==1 by checking len(batch) and raising a clear exception if >1, or
implement proper collation logic that merges per-sample multimodal fields (e.g.,
stack tensors where possible, keep lists for variable-length modalities like
frames/audio, or pad/pack sequences) so batches are not discarded; locate the
collate_function and the use of variable first to apply the chosen fix.

examples/llm_ptq/example_utils.py (1)

1078-1081: ⚠️ Potential issue | 🟡 Minor

dataset_name list handling is still incomplete for multimodal paths.

Only single-item lists are normalized. Longer lists silently fail membership checks and fall into unsupported-dataset errors.

💡 Proposed fix

-        if isinstance(dataset_name, list) and len(dataset_name) == 1:
-            dataset_name = dataset_name[0]
+        if isinstance(dataset_name, list):
+            if len(dataset_name) != 1:
+                raise ValueError(
+                    "For Qwen3Omni multimodal calibration with a processor, provide exactly one dataset."
+                )
+            dataset_name = dataset_name[0]

Also applies to: 1094-1095

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 1078 - 1081, The code only
normalizes single-item lists for dataset_name then checks membership against
get_supported_video_datasets(), which causes multi-item lists to fail; update
the handling so that if dataset_name is a list and len==1 you unwrap, and if
len>1 you either (a) validate that all entries are supported (e.g., all(name in
get_supported_video_datasets() for name in dataset_name)) or (b) if multimodal
paths accept mixed datasets, validate that at least one/required subset is
supported—then branch accordingly instead of asserting dataset_name is a str;
update the checks around dataset_name and get_supported_video_datasets() (also
apply the same fix at the other occurrence around lines 1094-1095) so multi-item
lists are correctly recognized and validated.

🧹 Nitpick comments (2)

modelopt/torch/utils/video_dataset_utils.py (1)

179-205: Temporary video files rely on manual cleanup only.

_save_video_bytes_to_file() accumulates temp files, but cleanup() is opt-in. Consider auto-registering cleanup to avoid disk buildup in normal script exits.

♻️ Proposed refactor

 import os
 import tempfile
+import atexit
@@
         self._temp_dir = tempfile.mkdtemp(prefix="qwen3omni_video_")
         self._video_counter = 0
+        atexit.register(self.cleanup)

Also applies to: 329-334

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/video_dataset_utils.py` around lines 179 - 205, The
temporary video files created by _save_video_bytes_to_file() are only removed by
an opt-in cleanup(), so register automatic cleanup to avoid disk buildup: modify
the class initializer where _temp_dir is created to either use
tempfile.TemporaryDirectory() or call atexit.register(self.cleanup) (and import
atexit), and ensure the existing cleanup() method removes self._temp_dir and
handles repeated calls safely; update references to _temp_dir, _video_counter,
and cleanup() (and any temp creation at lines ~329-334) so automatic teardown
runs on process exit.

modelopt/torch/utils/image_processor.py (1)

124-133: device="auto" should be normalized before tensor .to(device) calls.

The new processors pass "auto" through to BaseImageProcessor; downstream tensor moves may fail unless this is converted to a real torch device string.

♻️ Proposed fix

 class Qwen3OmniTextProcessor(BaseImageProcessor):
@@
     def __init__(self, processor, device="auto", dtype=None):
@@
-        super().__init__(processor, device)
+        if device == "auto":
+            device = "cuda" if torch.cuda.is_available() else "cpu"
+        super().__init__(processor, device)
         self.dtype = dtype
@@
 class Qwen3OmniImageProcessor(BaseImageProcessor):
@@
     def __init__(self, tokenizer, device="auto", dtype=None, use_audio_in_video=False):
         """Constructor."""
-        super().__init__(tokenizer, device)
+        if device == "auto":
+            device = "cuda" if torch.cuda.is_available() else "cpu"
+        super().__init__(tokenizer, device)

Also applies to: 178-182

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/image_processor.py` around lines 124 - 133, The
constructor currently accepts device="auto" but leaves it unnormalized, causing
later tensor.to(device) calls to fail; update the __init__ in the image
processor (and the other similar constructor around lines 178-182) to normalize
device when device == "auto" by resolving to a concrete torch.device (e.g.,
torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")),
store that normalized value in self.device, and ensure dtype (if provided) is a
torch.dtype; use torch.cuda.is_available() and torch.device to perform the
resolution so downstream .to(self.device) calls always receive a real device
object or string.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 167-170: The code calls snapshot_download(...) unconditionally and
fails for local tokenizer paths; update ensure_tokenizer_files to detect when
source_model_id is a local directory (use os.path.isdir or
Path(source_model_id).is_dir()) and in that case set cache_dir to the local path
instead of calling snapshot_download; otherwise keep the existing
snapshot_download call using TOKENIZER_FILES and assign the result to cache_dir
so downstream logic uses the correct directory.

In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 534-536: The code determining whether to call model.generate still
uses model_type_is_enc_dec(model) instead of the new routing helper; update the
two call sites (the infer_method selection around the use_generate variable and
the similar block at the later occurrence) to call _should_use_generate(model)
rather than model_type_is_enc_dec(model) so Qwen3Omni and other generation-only
models route to generate(); keep the rest of the logic (infer_method =
model.generate if use_generate else model.forward) unchanged.
- Around line 651-652: When building split_data_1 and split_data_2 from
tensor_data, guard against None (or non-sliceable) fields so the OOM split path
doesn't call __getitem__ on None; update the comprehensions that create
split_data_1 and split_data_2 to check each tensor_data[key] (e.g., use key
existence and "if value is None" or "if not sliceable") and assign None for
those keys, otherwise slice using [:mid, ...] and [mid:, ...]; ensure you
reference the same symbols (tensor_data, split_data_1, split_data_2, mid) so the
behavior matches existing logic.
- Line 590: The recursive calls to _process_batch are passing
max_working_batch_size as the third positional argument so it lands in
generation_kwargs (breaking infer_method(**batch_data, **generation_kwargs));
update those recursive calls inside _process_batch to pass generation_kwargs and
max_working_batch_size by name (e.g., generation_kwargs=generation_kwargs,
max_working_batch_size=max_working_batch_size) or reorder arguments to match the
signature so generation_kwargs is always a mapping when forwarded to
infer_method; locate calls to _process_batch within the same function and change
them to use keyword arguments.

In `@modelopt/torch/utils/image_processor.py`:
- Around line 162-172: The collate_function currently only uses batch[0],
dropping other samples and losing data; update it to either (preferred) stack
supported fields across all samples or (alternative) raise a clear error when
len(batch) > 1; specifically, for collate_function gather all "input_ids" and
"attention_mask" values from each item in batch (skip None entries), convert the
collected lists into a single tensor via torch.LongTensor(torch.stack(...) or
torch.LongTensor(list_of_lists)) and move to self.device, or if you choose
fail-fast, raise a ValueError when len(batch) > 1 with a clear message; apply
the same fix to the other collator implementation referenced (the
collate_function at the later block around lines 255-263) so both handle
multi-sample batches consistently.

---

Duplicate comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 1078-1081: The code only normalizes single-item lists for
dataset_name then checks membership against get_supported_video_datasets(),
which causes multi-item lists to fail; update the handling so that if
dataset_name is a list and len==1 you unwrap, and if len>1 you either (a)
validate that all entries are supported (e.g., all(name in
get_supported_video_datasets() for name in dataset_name)) or (b) if multimodal
paths accept mixed datasets, validate that at least one/required subset is
supported—then branch accordingly instead of asserting dataset_name is a str;
update the checks around dataset_name and get_supported_video_datasets() (also
apply the same fix at the other occurrence around lines 1094-1095) so multi-item
lists are correctly recognized and validated.

In `@modelopt/torch/utils/video_dataset_utils.py`:
- Around line 286-288: The collate_function currently takes only batch[0]
(variable first) which silently drops additional samples; update
collate_function to either enforce batch_size==1 by checking len(batch) and
raising a clear exception if >1, or implement proper collation logic that merges
per-sample multimodal fields (e.g., stack tensors where possible, keep lists for
variable-length modalities like frames/audio, or pad/pack sequences) so batches
are not discarded; locate the collate_function and the use of variable first to
apply the chosen fix.

---

Nitpick comments:
In `@modelopt/torch/utils/image_processor.py`:
- Around line 124-133: The constructor currently accepts device="auto" but
leaves it unnormalized, causing later tensor.to(device) calls to fail; update
the __init__ in the image processor (and the other similar constructor around
lines 178-182) to normalize device when device == "auto" by resolving to a
concrete torch.device (e.g., torch.device("cuda") if torch.cuda.is_available()
else torch.device("cpu")), store that normalized value in self.device, and
ensure dtype (if provided) is a torch.dtype; use torch.cuda.is_available() and
torch.device to perform the resolution so downstream .to(self.device) calls
always receive a real device object or string.

In `@modelopt/torch/utils/video_dataset_utils.py`:
- Around line 179-205: The temporary video files created by
_save_video_bytes_to_file() are only removed by an opt-in cleanup(), so register
automatic cleanup to avoid disk buildup: modify the class initializer where
_temp_dir is created to either use tempfile.TemporaryDirectory() or call
atexit.register(self.cleanup) (and import atexit), and ensure the existing
cleanup() method removes self._temp_dir and handles repeated calls safely;
update references to _temp_dir, _video_counter, and cleanup() (and any temp
creation at lines ~329-334) so automatic teardown runs on process exit.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 74edc7ca-e9dc-402b-a2fe-869d8c4cb6ae

📥 Commits

Reviewing files that changed from the base of the PR and between 0427063 and 39a081a.

📒 Files selected for processing (6)

examples/llm_ptq/example_utils.py
examples/llm_ptq/run_vllm.py
modelopt/torch/quantization/model_quant.py
modelopt/torch/utils/dataset_utils.py
modelopt/torch/utils/image_processor.py
modelopt/torch/utils/video_dataset_utils.py

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/torch/quantization/model_quant.py

cjluo-nv

Review: Code Simplicity & Avoiding Rebuilt Logic

1. Duplicated collate logic across 3 processor classes

Qwen3OmniImageProcessor.collate_function, Qwen3OmniVideoProcessor.collate_function, and Qwen3OmniTextProcessor.collate_function all repeat the same pattern: take batch[0], convert lists to tensors by key, move to device, handle dtype. The only difference is which keys they handle.

Suggestion: Extract a shared helper in BaseImageProcessor (or a standalone function) that takes a key-to-dtype mapping:

def _collate_first_item(self, batch, float_keys=(), long_keys=()):
    first = batch[0]
    result = {}
    for key in long_keys:
        if first.get(key) is not None:
            result[key] = torch.LongTensor(first[key]).to(self.device)
    for key in float_keys:
        if first.get(key) is not None:
            t = torch.tensor(first[key])
            if self.dtype is not None:
                t = t.to(self.dtype)
            result[key] = t.to(self.device)
    return result

Each subclass then just calls this with different key lists.

2. Duplicated `preprocess_function` between `Qwen3OmniImageProcessor` and `Qwen3OmniVideoProcessor`

Both build a conversation, call apply_chat_template, call process_mm_info, call self.tokenizer(...), then convert to lists with the same all_keys / tolist() pattern. The video processor adds video-byte-to-file logic but the rest is identical.

Suggestion: Factor the shared conversation-build + tokenize + serialize logic into a base method, with subclasses only overriding how they extract media content from examples.

3. `get_qwen3omni_text_dataloader` rebuilds `get_dataset_dataloader`

dataset_utils.py already has get_dataset_dataloader() which loads text samples, tokenizes, and returns a DataLoader. The new get_qwen3omni_text_dataloader does essentially the same thing but with a different tokenization step (applying chat template first). Rather than creating a whole new function with an inline _Qwen3OmniTextDataset class, consider passing a custom preprocess_fn to the existing get_dataset_dataloader or composing with it.

4. `_should_use_generate` is defined but never used

dataset_utils.py adds _should_use_generate() but it's commented out in _forward_loop (# use_generate = _should_use_generate(model)), and the old model_type_is_enc_dec is used instead. This is dead code — either use it or remove it.

5. Duplicated Qwen3Omni generate+unpack logic (pre/post quantize)

In hf_ptq.py, the pattern:

result = full_model.generate(**calib_batch, return_audio=False, thinker_max_new_tokens=100)
if isinstance(result, tuple):
    text_ids, _ = result
    generated_ids = text_ids.sequences if hasattr(text_ids, "sequences") else text_ids
else:
    generated_ids = result

appears identically in both pre_quantize() and post_quantize(). Extract this into a helper like _qwen3omni_generate(model, batch).

6. `get_model_type_from_config` duplicates `MODEL_NAME_TO_TYPE` matching

example_utils.py adds get_model_type_from_config() which reads config.json, then loops over MODEL_NAME_TO_TYPE doing case-insensitive substring matching. This is essentially what model_utils.py:get_model_type() already does when given a model. If the model is already loaded, just use the existing infrastructure. If you need it pre-load, consider adding a from_config classmethod to the existing detection code rather than reimplementing the matching logic in example code.

7. `patch_config_for_unified_export` is a post-hoc fixup for a root-cause issue

This function patches hf_quant_config.json and config.json after export to add missing exclusion patterns. This suggests the export itself isn't producing the right exclusions. The cleaner fix would be to ensure unified_export_hf.py emits the correct exclusion list in the first place, rather than patching config files after the fact.

8. Mutable default argument `generation_kwargs={}`

In _process_batch, _forward_loop, and create_forward_loop, the default argument is generation_kwargs={}. Mutable defaults are a well-known Python footgun — use generation_kwargs=None with if generation_kwargs is None: generation_kwargs = {} inside the function body.

9. `pre_quantize` return value change is a breaking API change

pre_quantize previously returned a 2-tuple (preview_input_ids, generated_ids_before_ptq) and now returns a 3-tuple adding calib_batch. Any external caller would break. Consider making calib_batch an optional part of the return or using a dataclass/named tuple.

10. `print_quant_summary` silently drops the try/except

The old code had a try/except around print_quant_summary + save_expert_token_count_table. The new code removes it. If save_expert_token_count_table can fail (e.g., on non-MOE models), this will now crash the pipeline instead of continuing gracefully.

11. `Qwen3OmniVideoProcessor` temp file cleanup never called

The processor creates temp files in __init__ via tempfile.mkdtemp() but cleanup() is never called anywhere in the PR. This will leak temp files. Consider using __del__ or a context manager, or at minimum call cleanup() after dataloader creation.

Summary

The core model support additions (MoE block registration, MODEL_NAME_TO_TYPE entry, get_language_model_from_vl thinker support, layer_utils) are clean and minimal. The main concerns are around duplicated patterns in the processor/dataloader code and rebuilding existing infrastructure rather than extending it. Consolidating the collate, preprocess, and generate-unpack logic would meaningfully reduce the +1271 line count and maintenance burden.

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

github-actions · 2026-04-01T18:49:15Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-07 12:55 UTC

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

cjluo-nv · 2026-04-02T06:58:19Z

Review Comment Status Check

Fully Addressed (7/12)

✅ Default dataset aligned to cnn_dailymail + nemotron-post-training-dataset-v2
✅ Functions moved to example_utils to keep hf_ptq.py short
✅ print_quant_summary now takes a save_path parameter
✅ Model type dict not sorted (phi3 vs phi ordering preserved)
✅ do_sample=True set when temperature/top_p/top_k are specified
✅ Documented why calib_batch is needed for multimodal models
✅ Empty generation_kwargs explained — no issue

Still Need Follow-up (3/12)

ensure_tokenizer_files() in example_utils — @Edwardf0t1 suggested using the existing tokenizer export code at hf_ptq.py#L691-L696 instead. @ajrasane could you address whether the separate utility is still needed?
"is 800 and 804 expected?" — @ajrasane asked if this is still relevant after code updates. I need to re-check the current state.
@Edwardf0t1 could you review this part? (example_utils processor/dataset code around line 768) — No visible review from Edward on that section yet. @Edwardf0t1 could you take a look?

Please address the above before I can approve.

chadvoegele · 2026-04-02T21:08:56Z

+        default=None,
+        help="Max model length (auto-detected from config if not specified)",
+    )
+    parser.add_argument("--prompt", type=str, default="What in Nvidia?", help="Text prompt")


Supposed to be "What is Nvidia?"?

chadvoegele · 2026-04-02T21:09:24Z

-        "batch_data values must be tensors"
+    if generation_kwargs is None:
+        generation_kwargs = {}
+    # Separate tensor values from scalar parameters (like max_new_tokens)


Would max_new_tokens be in generation_kwargs? Why is it in batch_data? Are there other non-tensors in batch_data?

chadvoegele · 2026-04-02T21:09:40Z

+    split_data_2.update(scalar_data)

    # Recursively process each half and track max working batch size
    max_working_batch_size = _process_batch(split_data_1, infer_method)


Do you need to pass the generation_kwargs through here too?

chadvoegele · 2026-04-02T21:09:59Z

@@ -0,0 +1,277 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


chadvoegele · 2026-04-02T21:10:13Z

+    """
+    assert processor is not None, "Please provide a valid processor."
+
+    # Default cache_dir to temp directory


Do we need some additional modification on top of the Huggingface ~/.cache/huggingface cache?

chadvoegele · 2026-04-02T21:10:27Z

+
+    processed_dataset = None
+
+    # Try to load from cache (use torch.save/load to avoid Arrow 32-bit offset overflow)


Is this Arrow 32-bit offset issue specific to the Qwen3OmniVideoProcessor processor and/or the finevideo dataset? If so, can we special case it?

chadvoegele · 2026-04-02T21:10:43Z

+        super().__init__(tokenizer, device)
+        self.dtype = dtype
+        self.use_audio_in_video = use_audio_in_video
+        self._temp_dir = tempfile.mkdtemp(prefix="qwen3omni_video_")


If you can switch to tempfile.TemporaryDirectory, then the clean up is automatic and you can remove the cleanup() and __del__()

https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory

def demo_cleanup(): tmp = tempfile.TemporaryDirectory(prefix="demo_") print(f"Created: {tmp.name}") print(f"Exists before crash: {os.path.exists(tmp.name)}") raise RuntimeError("crash!") if __name__ == "__main__": demo_cleanup()

ajrasane requested review from a team as code owners February 5, 2026 01:01

ajrasane requested review from Edwardf0t1 and cjluo-nv February 5, 2026 01:01

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

Comment thread examples/llm_ptq/hf_ptq.py Outdated

Comment thread modelopt/torch/export/unified_export_hf.py Outdated

Comment thread modelopt/torch/utils/dataset_utils.py Outdated

Comment thread modelopt/torch/utils/image_processor.py Outdated

cjluo-nv reviewed Feb 5, 2026

View reviewed changes

Edwardf0t1 reviewed Feb 6, 2026

View reviewed changes

cjluo-nv reviewed Feb 9, 2026

View reviewed changes

ajrasane force-pushed the ajrasane/qwen3omni_final branch from 8740db3 to e202164 Compare March 17, 2026 18:42

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

Comment thread examples/llm_ptq/example_utils.py Outdated

Comment thread modelopt/torch/utils/dataset_utils.py Outdated

Comment thread modelopt/torch/utils/dataset_utils.py Outdated

Comment thread modelopt/torch/utils/dataset_utils.py

Comment thread modelopt/torch/utils/image_processor.py

cjluo-nv reviewed Mar 18, 2026

View reviewed changes

ajrasane force-pushed the ajrasane/qwen3omni_final branch from a5146cc to 201fbae Compare March 18, 2026 20:14

ajrasane added 5 commits April 1, 2026 18:43

Add support for Qwen3Omni30B thinking model

a9cf8eb

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Optimize calibration for text data

0ccf7f9

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Refactor model specific code to example_utils

df50095

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Update hf configs for vLLM deployment

3f5859b

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Create a script to run vllm inference

41ee25b

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

ajrasane force-pushed the ajrasane/qwen3omni_final branch from 201fbae to 3938f2e Compare April 1, 2026 18:44

ajrasane force-pushed the ajrasane/qwen3omni_final branch 2 times, most recently from 133ee7b to 63d0af5 Compare April 1, 2026 22:07

ajrasane added 4 commits April 1, 2026 22:08

Add an option to supply host as an argument

a71c73b

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Add video dataset utils

e0e108e

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Update documentation for post_quantize

d334c9a

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Bug fixes

55f1e73

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

ajrasane added 2 commits April 1, 2026 22:08

Update get_expert_linear_names

9e3b399

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Address comments

20a4b33

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

ajrasane force-pushed the ajrasane/qwen3omni_final branch from 63d0af5 to 20a4b33 Compare April 1, 2026 22:10

Remove manual registration of sparse moe block

ee95177

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

cjluo-nv requested a review from chadvoegele April 2, 2026 07:10

chadvoegele approved these changes Apr 2, 2026

View reviewed changes

ajrasane mentioned this pull request Apr 2, 2026

Support PTQ for Qwen3 Omni #647

Open

ajrasane closed this Apr 7, 2026

		return None


		def ensure_tokenizer_files(model_path: str, source_model_id: str) -> None:

		print("No custom model files found to copy")


		def patch_config_for_unified_export(model_type: str, export_path: str) -> None:

		@@ -0,0 +1,136 @@
		# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

		@@ -0,0 +1,277 @@
		# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


		processed_dataset = None

		# Try to load from cache (use torch.save/load to avoid Arrow 32-bit offset overflow)

Conversation

ajrasane commented Feb 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

ajrasane commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ajrasane commented Feb 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 5, 2026 •

edited

Loading

ajrasane commented Feb 6, 2026 •

edited

Loading

2. Duplicated `preprocess_function` between `Qwen3OmniImageProcessor` and `Qwen3OmniVideoProcessor`

3. `get_qwen3omni_text_dataloader` rebuilds `get_dataset_dataloader`

4. `_should_use_generate` is defined but never used

6. `get_model_type_from_config` duplicates `MODEL_NAME_TO_TYPE` matching

7. `patch_config_for_unified_export` is a post-hoc fixup for a root-cause issue

8. Mutable default argument `generation_kwargs={}`

9. `pre_quantize` return value change is a breaking API change

10. `print_quant_summary` silently drops the try/except

11. `Qwen3OmniVideoProcessor` temp file cleanup never called

github-actions bot commented Apr 1, 2026 •

edited

Loading