Skip to content

Make Qwen VL model id as input and update it to Qwen3 #1783

Closed
oyilmaz-nvidia wants to merge 26 commits intoNVIDIA-NeMo:mainfrom
oyilmaz-nvidia:onur/upt-qwen3
Closed

Make Qwen VL model id as input and update it to Qwen3 #1783
oyilmaz-nvidia wants to merge 26 commits intoNVIDIA-NeMo:mainfrom
oyilmaz-nvidia:onur/upt-qwen3

Conversation

@oyilmaz-nvidia
Copy link
Copy Markdown
Contributor

Make Qwen Model ID and Revision Configurable

Summary

  • Parameterize model ID and revision in QwenLM and QwenVL — both __init__ and download_weights_on_node now accept model_id and model_revision arguments, replacing hardcoded module-level constants
  • Update defaults to Qwen3QwenLM now defaults to Qwen/Qwen3-14B (8268fe3) and QwenVL to Qwen/Qwen3-VL-8B-Instruct (0c351dd)
  • Add model ID validation — a _is_local_path helper guards against accidentally passing a non-Qwen HuggingFace ID; absolute paths and ./-prefixed relative paths bypass the check
  • Remove dead constants_QWEN_LM_MODEL_ID, _QWEN_LM_MODEL_REVISION, _QWEN2_5_VL_MODEL_ID, _QWEN2_5_VL_MODEL_REVISION, and _QWEN_VARIANTS_INFO are removed; the default strings are now inline in the function signatures
  • Expose fields in stage dataclassesCaptionGenerationStage and CaptionEnhancementStage gain optional model_id: str | None = None and model_revision: str | None = None fields that propagate to both model construction and weight download

Usage

Override the model at the stage level:

CaptionGenerationStage(
    model_id="Qwen/Qwen3-VL-32B-Instruct",
    model_revision="<revision>",
)

Or directly on the model class:

QwenLM(
    model_dir="models/",
    caption_batch_size=8,
    fp8=False,
    max_output_tokens=512,
    model_id="Qwen/Qwen3-14B",
    model_revision="8268fe3",
)

Files Changed

File Change
nemo_curator/models/qwen_lm.py Parameterize model ID/revision, add validation, default to Qwen3-14B
nemo_curator/models/qwen_vl.py Parameterize model ID/revision, remove _QWEN_VARIANTS_INFO, default to Qwen3-VL-8B
nemo_curator/stages/video/caption/caption_enhancement.py Add model_id/model_revision dataclass fields, pass through to QwenLM
nemo_curator/stages/video/caption/caption_generation.py Add model_id/model_revision dataclass fields, pass through to QwenVL
tests/models/test_qwen_lm.py Update for removed constants and new Qwen3 defaults
tests/models/test_qwen_vl.py Update for removed constants and new Qwen3 defaults
tests/stages/video/caption/test_caption_enhancement.py Assert new fields and updated mock call
tests/stages/video/caption/test_caption_generation.py Assert new fields and updated mock call
tutorials/video/getting-started/video_split_clip_example.py Update description strings to reference Qwen3-VL

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
@oyilmaz-nvidia oyilmaz-nvidia requested review from a team, abhinavg4 and suiyoubi as code owners April 9, 2026 22:59
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 9, 2026

Greptile Summary

This PR parameterizes model_id and model_revision in QwenLM and QwenVL, updates defaults to Qwen3 models, adds per-variant pixel configuration, and migrates model_variant from the legacy "qwen" key to "qwen2.5" / "qwen3" across all stage dataclasses and tests.

The previously reported critical bugs (property/setter conflicts on model_id, NameError for vllm_kwargs, None bypass of model defaults, cls.model_id returning a descriptor in the classmethod, and the mismatch between downloaded and loaded model paths) have all been resolved in this revision.

Confidence Score: 5/5

Safe to merge; all previously-reported critical bugs are resolved and the only remaining finding is a one-line documentation typo in a tutorial help string.

The series of blocking issues from prior review rounds — property/setter AttributeError on model_id, NameError for vllm_kwargs, None bypassing constructor defaults, cls.model_id returning a descriptor in the classmethod, and the download-path / load-path mismatch — have all been addressed. The sole remaining finding is a stale "qwen" option name in the --model-dir help text, which is P2 and does not affect runtime behaviour.

tutorials/video/getting-started/video_split_clip_example.py — one-line help text typo; caption_preparation.py — pre-existing Nemotron pixel-config fallback noted in prior threads.

Important Files Changed

Filename Overview
nemo_curator/models/qwen_lm.py Parameterized model_id/revision, added @Property for model_id_names (now correctly satisfies the abstract-property contract in ModelInterface), and properly handles None defaults via variant_info lookup. Previously-reported AttributeError and NameError are resolved.
nemo_curator/models/qwen_vl.py Parameterized model_id/revision, removed dead _QWEN_VARIANTS_INFO constant, and added get_qwen_vl_pixel_config helper; previously-reported property conflicts and classmethod descriptor bugs are resolved.
nemo_curator/stages/video/caption/caption_enhancement.py Adds model_id/model_revision dataclass fields and resolves effective values in setup_on_node before calling download_weights_on_node; None-bypass bug and model_variant propagation issues are fixed.
nemo_curator/stages/video/caption/caption_generation.py Adds model_id/model_revision fields, updates variant check from "qwen" to ("qwen2.5","qwen3"), and properly resolves effective model_id/revision in setup_on_node.
nemo_curator/stages/video/caption/caption_preparation.py Integrates per-variant pixel config via get_qwen_vl_pixel_config and passes image_factor/pixel bounds to split_video_into_windows; Nemotron variants still fall back to Qwen3 pixel config (pre-existing limitation).
nemo_curator/utils/windowing_utils.py Replaces global IMAGE_FACTOR/VIDEO_*_PIXELS constants with MODEL_PIXEL_CONFIG dict keyed by HF model ID, and threads image_factor/pixel bounds through split_video_into_windows and fetch_video as parameters with Qwen3 defaults.
nemo_curator/models/prompt_formatter.py Splits "qwen" into "qwen3"/"qwen2.5", adds fps parameter to _generate_qwen_inputs, and wraps video tensor in a (tensor, metadata) tuple required by Qwen3-VL vLLM.
tutorials/video/getting-started/video_split_clip_example.py Updates CLI choices and defaults from "qwen" to "qwen3"/"qwen2.5", but the --model-dir help text still references "--captioning-algorithm qwen" which is no longer a valid choice.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[CaptionPreparationStage\nmodel_variant=qwen3/qwen2.5] --> B[get_qwen_vl_pixel_config\nmodel_variant]
    B --> C{variant in\n_QWEN_VL_VARIANTS?}
    C -->|qwen3| D[image_factor=32\nvideo_min_pixels=128x32²]
    C -->|qwen2.5| E[image_factor=28\nvideo_min_pixels=128x28²]
    C -->|unknown/nemotron| F[fallback: Qwen3 config\nimage_factor=32]
    D --> G[split_video_into_windows\nwith per-variant params]
    E --> G
    F --> G
    G --> H[fetch_video\nsmartresize with correct factor]

    I[CaptionGenerationStage\nor CaptionEnhancementStage] --> J{model_variant}
    J -->|qwen2.5 / qwen3| K[compute effective_model_id\nfrom model_id or variant_info]
    J -->|nemotron*| L[NemotronHVL path]
    K --> M[download_weights_on_node\nmodel_id + model_revision]
    K --> N[QwenVL / QwenLM\n__init__ resolves None via variant_info]
Loading

Reviews (21): Last reviewed commit: "Merge branch 'NVIDIA-NeMo:main' into onu..." | Re-trigger Greptile

Comment thread nemo_curator/stages/video/caption/caption_enhancement.py
Comment on lines +67 to +68
model_id=self.model_id,
model_revision=self.model_revision,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Same None-bypass bug as in caption_enhancement.py

self.model_id and self.model_revision default to None in the dataclass, and are forwarded here as explicit keyword arguments to QwenVL.__init__. Passing model_id=None overrides the function default, causing _is_local_path(None)Path(None)TypeError. The same crash occurs on line 87 where QwenVL.download_weights_on_node(self.model_dir, model_id=self.model_id, ...) passes None as model_id and Path(model_dir) / None raises TypeError.

Comment thread nemo_curator/models/qwen_vl.py Outdated
Comment on lines +43 to +44
def _is_local_path(model_id: str) -> bool:
return Path(model_id).is_absolute() or model_id.startswith(("./", "../"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicated _is_local_path helper

_is_local_path is defined identically in both qwen_lm.py (line 41-42) and qwen_vl.py. Consider moving it to a shared location (e.g. nemo_curator/utils/model_utils.py) to avoid future drift between the two copies.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Signed-off-by: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 9, 2026

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

Comment thread nemo_curator/models/qwen_lm.py
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
@oyilmaz-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test dc95e4a

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
def setup_on_node(self, node_info: NodeInfo, worker_metadata: WorkerMetadata) -> None: # noqa: ARG002
"""Download weights and initialize vLLM once per node to avoid torch.compile race conditions."""
QwenLM.download_weights_on_node(self.model_dir)
QwenLM.download_weights_on_node(self.model_dir, model_id=self.model_id, model_revision=self.model_revision)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 model_variant not forwarded to download_weights_on_node

When a caller sets model_variant="qwen2.5" without an explicit model_id, self.model_id is None here. QwenLM.download_weights_on_node(self.model_dir, model_id=None, model_revision=None) then resolves to cls.DEFAULT_MODEL_ID = "Qwen/Qwen3-14B" and downloads Qwen3 weights. The immediately following _initialize_model() creates QwenLM(model_variant="qwen2.5", model_id=None, ...), which resolves model_id to "Qwen/Qwen2.5-14B-Instruct" and tries to load from model_dir/Qwen/Qwen2.5-14B-Instruct — a path that was never populated.

Suggested change
QwenLM.download_weights_on_node(self.model_dir, model_id=self.model_id, model_revision=self.model_revision)
QwenLM.download_weights_on_node(
self.model_dir,
model_id=self.model_id,
model_revision=self.model_revision,
model_variant=self.model_variant,
)

download_weights_on_node would also need to accept and propagate model_variant so it can resolve the correct default when model_id is None.

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Comment thread nemo_curator/stages/video/caption/caption_generation.py Outdated
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Comment thread nemo_curator/models/qwen_vl.py Outdated
"qwen": _QWEN2_5_VL_MODEL_ID,
_QWEN_VL_VARIANTS: dict[str, dict] = {
"qwen": {"model_id": "Qwen/Qwen3-VL-8B-Instruct", "revision": "0c351dd", "image_factor": 32},
"qwen3.5": {"model_id": "Qwen/Qwen3.5-9B", "revision": None, "image_factor": 32},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 qwen3.5 variant points to a text-only model in the VL pipeline

Qwen/Qwen3.5-9B does not carry the -VL- infix that every other vision-language model in this codebase uses (Qwen2.5-VL-7B-Instruct, Qwen3-VL-8B-Instruct), indicating it is a text-only model. Loading it through QwenVL will fail: the AutoProcessor in PromptFormatter won't support the {"type": "video"} chat content, and vLLM will reject video multi-modal inputs against a text-only weight file. The tutorial help text itself says the intended model is Qwen3.5-VL-8B-Instruct, confirming this is a copy-paste mistake.

The same wrong ID appears in prompt_formatter.py line 26 (VARIANT_MAPPING["qwen3.5"] = "Qwen/Qwen3.5-9B"), which is loaded via AutoProcessor.from_pretrained for all Qwen VL input preparation.

Suggested change
"qwen3.5": {"model_id": "Qwen/Qwen3.5-9B", "revision": None, "image_factor": 32},
"qwen3.5": {"model_id": "Qwen/Qwen3.5-VL-8B-Instruct", "revision": None, "image_factor": 32},

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

def __post_init__(self) -> None:
self._skip_intermediate_resize = False
self._image_factor = get_qwen_vl_image_factor(self.model_variant)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Nemotron preprocessing broken by Qwen-specific image factor lookup

get_qwen_vl_image_factor only has entries for Qwen VL variants; for any Nemotron variant it falls through to the new global IMAGE_FACTOR = 32. Previously (before this PR) the global was 28, so Nemotron was always preprocessed with factor=28 and 28²-based pixel bounds. Now Nemotron receives image_factor=32 and the updated pixel bounds (VIDEO_MIN_PIXELS = 128×32² = 131 072, up from 100 352).

Nemotron models use do_preprocess=True (forced in __post_init__), so smart_resize(h, w, factor=image_factor, min_pixels=VIDEO_MIN_PIXELS, ...) is always reached in fetch_video_skip_intermediate_resize=True only suppresses the standalone resize branch, not the preprocess path. Preprocessed frames sent to vLLM will have 32-aligned dimensions and sit in a larger pixel budget than before, silently changing inference inputs for all Nemotron users.

Consider adding Nemotron variants to get_qwen_vl_image_factor (or a more general get_model_image_factor) with their legacy factor=28:

_NEMOTRON_IMAGE_FACTOR = 28  # legacy preprocessing, matches pre-Qwen3 behaviour

def get_model_image_factor(model_variant: str) -> int:
    if model_variant in _QWEN_VL_VARIANTS:
        return _QWEN_VL_VARIANTS[model_variant].get("image_factor", IMAGE_FACTOR)
    if model_variant.startswith("nemotron"):
        return _NEMOTRON_IMAGE_FACTOR
    return IMAGE_FACTOR

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
@oyilmaz-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test c4eba76

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
@oyilmaz-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 385d1b7

Comment thread nemo_curator/utils/windowing_utils.py Outdated
Comment on lines +182 to +189
IMAGE_FACTOR = 32
MIN_PIXELS = 4 * 32 * 32
MAX_PIXELS = 16384 * 32 * 32
MAX_RATIO = 200

VIDEO_MIN_PIXELS = 128 * 28 * 28
VIDEO_MAX_PIXELS = 768 * 28 * 28
VIDEO_TOTAL_PIXELS = 24576 * 28 * 28
VIDEO_MIN_PIXELS = 128 * 32 * 32
VIDEO_MAX_PIXELS = 768 * 32 * 32
VIDEO_TOTAL_PIXELS = 24576 * 32 * 32
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 32²-based pixel bounds silently break Qwen2.5 preprocessing

VIDEO_MIN_PIXELS, VIDEO_MAX_PIXELS, and VIDEO_TOTAL_PIXELS are now computed with factor 32, but smart_resize for the "qwen2.5" variant is called with image_factor=28 while these global bounds are used unchanged. This raises the effective minimum frame resolution by ~31% (100 352 → 131 072 pixels) for all Qwen2.5 users, silently altering preprocessing relative to Qwen2.5-VL's documented expectations.

The image_factor was correctly made per-variant in get_qwen_vl_image_factor, but the pixel bounds were not. Inside fetch_video (line 344–350), min_pixels=VIDEO_MIN_PIXELS and max_pixels derived from VIDEO_MAX_PIXELS/VIDEO_TOTAL_PIXELS are passed to smart_resize regardless of which variant is running. Consider making these bounds variant-aware too, e.g. by accepting min_pixels/max_pixels parameters alongside image_factor.

Comment thread nemo_curator/utils/windowing_utils.py Outdated
Comment thread nemo_curator/utils/windowing_utils.py Outdated
"""
with make_pipeline_named_temporary_file(sub_dir="windowing") as input_file:
if image_factor is None:
image_factor = IMAGE_FACTOR
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to update the min_pixels, max_pixels as well. I think a cleaner way would be define a const dict with the key using model_id

Comment thread nemo_curator/models/prompt_formatter.py Outdated
# Mapping of variants to their HuggingFace model IDs
VARIANT_MAPPING: dict[str, str] = {
"qwen": "Qwen/Qwen2.5-VL-7B-Instruct",
"qwen": "Qwen/Qwen3-VL-8B-Instruct",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just remove qwen and allow accept either qwen3 or qwen2.5

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
@oyilmaz-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test eb9daa2

info_calls = [str(call) for call in mock_logger.info.call_args_list]
assert any("Caption for clip" in call for call in info_calls)
assert any("Enhanced QwenLM Caption" in call for call in info_calls)
assert any("Enhanced qwen LM Caption" in call for call in info_calls)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Verbose log assertion will always fail — wrong substring

The test stage is initialised with model_variant="qwen3", so caption_enhancement.py:179 emits "Enhanced qwen3 LM Caption for clip …". The string "Enhanced qwen LM Caption" is not a substring of "Enhanced qwen3 LM Caption …" because "qwen" is immediately followed by "3", not " LM". The assertion will always fail.

Suggested change
assert any("Enhanced qwen LM Caption" in call for call in info_calls)
assert any("Enhanced qwen3 LM Caption" in call for call in info_calls)

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
@oyilmaz-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test ce94e4a

@oyilmaz-nvidia
Copy link
Copy Markdown
Contributor Author

Closing due to #1827 PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants