Make Qwen VL model id as input and update it to Qwen3 #1783
Make Qwen VL model id as input and update it to Qwen3 #1783oyilmaz-nvidia wants to merge 26 commits intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Greptile SummaryThis PR parameterizes The previously reported critical bugs (property/setter conflicts on Confidence Score: 5/5Safe to merge; all previously-reported critical bugs are resolved and the only remaining finding is a one-line documentation typo in a tutorial help string. The series of blocking issues from prior review rounds — property/setter AttributeError on model_id, NameError for vllm_kwargs, None bypassing constructor defaults, cls.model_id returning a descriptor in the classmethod, and the download-path / load-path mismatch — have all been addressed. The sole remaining finding is a stale "qwen" option name in the --model-dir help text, which is P2 and does not affect runtime behaviour. tutorials/video/getting-started/video_split_clip_example.py — one-line help text typo; caption_preparation.py — pre-existing Nemotron pixel-config fallback noted in prior threads. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[CaptionPreparationStage\nmodel_variant=qwen3/qwen2.5] --> B[get_qwen_vl_pixel_config\nmodel_variant]
B --> C{variant in\n_QWEN_VL_VARIANTS?}
C -->|qwen3| D[image_factor=32\nvideo_min_pixels=128x32²]
C -->|qwen2.5| E[image_factor=28\nvideo_min_pixels=128x28²]
C -->|unknown/nemotron| F[fallback: Qwen3 config\nimage_factor=32]
D --> G[split_video_into_windows\nwith per-variant params]
E --> G
F --> G
G --> H[fetch_video\nsmartresize with correct factor]
I[CaptionGenerationStage\nor CaptionEnhancementStage] --> J{model_variant}
J -->|qwen2.5 / qwen3| K[compute effective_model_id\nfrom model_id or variant_info]
J -->|nemotron*| L[NemotronHVL path]
K --> M[download_weights_on_node\nmodel_id + model_revision]
K --> N[QwenVL / QwenLM\n__init__ resolves None via variant_info]
Reviews (21): Last reviewed commit: "Merge branch 'NVIDIA-NeMo:main' into onu..." | Re-trigger Greptile |
| model_id=self.model_id, | ||
| model_revision=self.model_revision, |
There was a problem hiding this comment.
Same
None-bypass bug as in caption_enhancement.py
self.model_id and self.model_revision default to None in the dataclass, and are forwarded here as explicit keyword arguments to QwenVL.__init__. Passing model_id=None overrides the function default, causing _is_local_path(None) → Path(None) → TypeError. The same crash occurs on line 87 where QwenVL.download_weights_on_node(self.model_dir, model_id=self.model_id, ...) passes None as model_id and Path(model_dir) / None raises TypeError.
| def _is_local_path(model_id: str) -> bool: | ||
| return Path(model_id).is_absolute() or model_id.startswith(("./", "../")) |
There was a problem hiding this comment.
Duplicated
_is_local_path helper
_is_local_path is defined identically in both qwen_lm.py (line 41-42) and qwen_vl.py. Consider moving it to a shared location (e.g. nemo_curator/utils/model_utils.py) to avoid future drift between the two copies.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Signed-off-by: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>
|
Tip: Greploop — Automatically fix all review issues by running Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal. |
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
…to onur/upt-qwen3
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
|
/ok to test dc95e4a |
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
| def setup_on_node(self, node_info: NodeInfo, worker_metadata: WorkerMetadata) -> None: # noqa: ARG002 | ||
| """Download weights and initialize vLLM once per node to avoid torch.compile race conditions.""" | ||
| QwenLM.download_weights_on_node(self.model_dir) | ||
| QwenLM.download_weights_on_node(self.model_dir, model_id=self.model_id, model_revision=self.model_revision) |
There was a problem hiding this comment.
model_variant not forwarded to download_weights_on_node
When a caller sets model_variant="qwen2.5" without an explicit model_id, self.model_id is None here. QwenLM.download_weights_on_node(self.model_dir, model_id=None, model_revision=None) then resolves to cls.DEFAULT_MODEL_ID = "Qwen/Qwen3-14B" and downloads Qwen3 weights. The immediately following _initialize_model() creates QwenLM(model_variant="qwen2.5", model_id=None, ...), which resolves model_id to "Qwen/Qwen2.5-14B-Instruct" and tries to load from model_dir/Qwen/Qwen2.5-14B-Instruct — a path that was never populated.
| QwenLM.download_weights_on_node(self.model_dir, model_id=self.model_id, model_revision=self.model_revision) | |
| QwenLM.download_weights_on_node( | |
| self.model_dir, | |
| model_id=self.model_id, | |
| model_revision=self.model_revision, | |
| model_variant=self.model_variant, | |
| ) |
download_weights_on_node would also need to accept and propagate model_variant so it can resolve the correct default when model_id is None.
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
| "qwen": _QWEN2_5_VL_MODEL_ID, | ||
| _QWEN_VL_VARIANTS: dict[str, dict] = { | ||
| "qwen": {"model_id": "Qwen/Qwen3-VL-8B-Instruct", "revision": "0c351dd", "image_factor": 32}, | ||
| "qwen3.5": {"model_id": "Qwen/Qwen3.5-9B", "revision": None, "image_factor": 32}, |
There was a problem hiding this comment.
qwen3.5 variant points to a text-only model in the VL pipeline
Qwen/Qwen3.5-9B does not carry the -VL- infix that every other vision-language model in this codebase uses (Qwen2.5-VL-7B-Instruct, Qwen3-VL-8B-Instruct), indicating it is a text-only model. Loading it through QwenVL will fail: the AutoProcessor in PromptFormatter won't support the {"type": "video"} chat content, and vLLM will reject video multi-modal inputs against a text-only weight file. The tutorial help text itself says the intended model is Qwen3.5-VL-8B-Instruct, confirming this is a copy-paste mistake.
The same wrong ID appears in prompt_formatter.py line 26 (VARIANT_MAPPING["qwen3.5"] = "Qwen/Qwen3.5-9B"), which is loaded via AutoProcessor.from_pretrained for all Qwen VL input preparation.
| "qwen3.5": {"model_id": "Qwen/Qwen3.5-9B", "revision": None, "image_factor": 32}, | |
| "qwen3.5": {"model_id": "Qwen/Qwen3.5-VL-8B-Instruct", "revision": None, "image_factor": 32}, |
|
|
||
| def __post_init__(self) -> None: | ||
| self._skip_intermediate_resize = False | ||
| self._image_factor = get_qwen_vl_image_factor(self.model_variant) |
There was a problem hiding this comment.
Nemotron preprocessing broken by Qwen-specific image factor lookup
get_qwen_vl_image_factor only has entries for Qwen VL variants; for any Nemotron variant it falls through to the new global IMAGE_FACTOR = 32. Previously (before this PR) the global was 28, so Nemotron was always preprocessed with factor=28 and 28²-based pixel bounds. Now Nemotron receives image_factor=32 and the updated pixel bounds (VIDEO_MIN_PIXELS = 128×32² = 131 072, up from 100 352).
Nemotron models use do_preprocess=True (forced in __post_init__), so smart_resize(h, w, factor=image_factor, min_pixels=VIDEO_MIN_PIXELS, ...) is always reached in fetch_video — _skip_intermediate_resize=True only suppresses the standalone resize branch, not the preprocess path. Preprocessed frames sent to vLLM will have 32-aligned dimensions and sit in a larger pixel budget than before, silently changing inference inputs for all Nemotron users.
Consider adding Nemotron variants to get_qwen_vl_image_factor (or a more general get_model_image_factor) with their legacy factor=28:
_NEMOTRON_IMAGE_FACTOR = 28 # legacy preprocessing, matches pre-Qwen3 behaviour
def get_model_image_factor(model_variant: str) -> int:
if model_variant in _QWEN_VL_VARIANTS:
return _QWEN_VL_VARIANTS[model_variant].get("image_factor", IMAGE_FACTOR)
if model_variant.startswith("nemotron"):
return _NEMOTRON_IMAGE_FACTOR
return IMAGE_FACTORSigned-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
|
/ok to test c4eba76 |
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
|
/ok to test 385d1b7 |
| IMAGE_FACTOR = 32 | ||
| MIN_PIXELS = 4 * 32 * 32 | ||
| MAX_PIXELS = 16384 * 32 * 32 | ||
| MAX_RATIO = 200 | ||
|
|
||
| VIDEO_MIN_PIXELS = 128 * 28 * 28 | ||
| VIDEO_MAX_PIXELS = 768 * 28 * 28 | ||
| VIDEO_TOTAL_PIXELS = 24576 * 28 * 28 | ||
| VIDEO_MIN_PIXELS = 128 * 32 * 32 | ||
| VIDEO_MAX_PIXELS = 768 * 32 * 32 | ||
| VIDEO_TOTAL_PIXELS = 24576 * 32 * 32 |
There was a problem hiding this comment.
32²-based pixel bounds silently break Qwen2.5 preprocessing
VIDEO_MIN_PIXELS, VIDEO_MAX_PIXELS, and VIDEO_TOTAL_PIXELS are now computed with factor 32, but smart_resize for the "qwen2.5" variant is called with image_factor=28 while these global bounds are used unchanged. This raises the effective minimum frame resolution by ~31% (100 352 → 131 072 pixels) for all Qwen2.5 users, silently altering preprocessing relative to Qwen2.5-VL's documented expectations.
The image_factor was correctly made per-variant in get_qwen_vl_image_factor, but the pixel bounds were not. Inside fetch_video (line 344–350), min_pixels=VIDEO_MIN_PIXELS and max_pixels derived from VIDEO_MAX_PIXELS/VIDEO_TOTAL_PIXELS are passed to smart_resize regardless of which variant is running. Consider making these bounds variant-aware too, e.g. by accepting min_pixels/max_pixels parameters alongside image_factor.
| """ | ||
| with make_pipeline_named_temporary_file(sub_dir="windowing") as input_file: | ||
| if image_factor is None: | ||
| image_factor = IMAGE_FACTOR |
There was a problem hiding this comment.
I think we need to update the min_pixels, max_pixels as well. I think a cleaner way would be define a const dict with the key using model_id
| # Mapping of variants to their HuggingFace model IDs | ||
| VARIANT_MAPPING: dict[str, str] = { | ||
| "qwen": "Qwen/Qwen2.5-VL-7B-Instruct", | ||
| "qwen": "Qwen/Qwen3-VL-8B-Instruct", |
There was a problem hiding this comment.
I think we should just remove qwen and allow accept either qwen3 or qwen2.5
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
|
/ok to test eb9daa2 |
| info_calls = [str(call) for call in mock_logger.info.call_args_list] | ||
| assert any("Caption for clip" in call for call in info_calls) | ||
| assert any("Enhanced QwenLM Caption" in call for call in info_calls) | ||
| assert any("Enhanced qwen LM Caption" in call for call in info_calls) |
There was a problem hiding this comment.
Verbose log assertion will always fail — wrong substring
The test stage is initialised with model_variant="qwen3", so caption_enhancement.py:179 emits "Enhanced qwen3 LM Caption for clip …". The string "Enhanced qwen LM Caption" is not a substring of "Enhanced qwen3 LM Caption …" because "qwen" is immediately followed by "3", not " LM". The assertion will always fail.
| assert any("Enhanced qwen LM Caption" in call for call in info_calls) | |
| assert any("Enhanced qwen3 LM Caption" in call for call in info_calls) |
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
|
/ok to test ce94e4a |
|
Closing due to #1827 PR |
Make Qwen Model ID and Revision Configurable
Summary
QwenLMandQwenVL— both__init__anddownload_weights_on_nodenow acceptmodel_idandmodel_revisionarguments, replacing hardcoded module-level constantsQwenLMnow defaults toQwen/Qwen3-14B(8268fe3) andQwenVLtoQwen/Qwen3-VL-8B-Instruct(0c351dd)_is_local_pathhelper guards against accidentally passing a non-Qwen HuggingFace ID; absolute paths and./-prefixed relative paths bypass the check_QWEN_LM_MODEL_ID,_QWEN_LM_MODEL_REVISION,_QWEN2_5_VL_MODEL_ID,_QWEN2_5_VL_MODEL_REVISION, and_QWEN_VARIANTS_INFOare removed; the default strings are now inline in the function signaturesCaptionGenerationStageandCaptionEnhancementStagegain optionalmodel_id: str | None = Noneandmodel_revision: str | None = Nonefields that propagate to both model construction and weight downloadUsage
Override the model at the stage level:
Or directly on the model class:
Files Changed
nemo_curator/models/qwen_lm.pynemo_curator/models/qwen_vl.py_QWEN_VARIANTS_INFO, default to Qwen3-VL-8Bnemo_curator/stages/video/caption/caption_enhancement.pymodel_id/model_revisiondataclass fields, pass through toQwenLMnemo_curator/stages/video/caption/caption_generation.pymodel_id/model_revisiondataclass fields, pass through toQwenVLtests/models/test_qwen_lm.pytests/models/test_qwen_vl.pytests/stages/video/caption/test_caption_enhancement.pytests/stages/video/caption/test_caption_generation.pytutorials/video/getting-started/video_split_clip_example.py