You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(utils): add pack mode to get_dataset_dataloader
`pack=False` (default) tokenizes each calibration sample with
`padding=True, truncation=True, max_length=...` — on long-document
datasets like cnn_dailymail that discards most of each article and
pads short samples up to the max, feeding calibration heavily padded
and context-impoverished batches.
`pack=True` concatenates the token streams of all raw samples
(separated by `tokenizer.eos_token_id`) and slices into uniform
`max_sample_length` chunks. Long documents stay intact, padding tokens
disappear, every chunk is natural-length context.
Measured on Qwen3-8B minitron prune to 30L/3584/11776
(cnn_dailymail, 256 samples, seq_length 512):
pack=False: MMLU 0.486
pack=True: MMLU 0.544 (+5.8 pts; Megatron-Bridge ref 0.563)
Default stays False for back-compat with a `warn_rank_0` nudging
callers toward `pack=True`; downstream examples (hf_ptq.py, vlm_ptq.py,
Megatron-LM prune.py / quantize.py) can opt in incrementally.
Tests: extend `_FakeTokenizer` with `encode()` + `eos_token_id` and
flip `TestGetDatasetDataloaderBlending` / HF tiny-dataset tests to
`pack=True`.
CHANGELOG: pack entry under New Features; fused-TE-spec import fix
entry under Bug Fixes (covering Qwen3-style attention/MLP norm
loading via the new per-context rule keys).
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+6-1Lines changed: 6 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,8 +25,13 @@ Changelog
25
25
- Add ``--cast_mxfp4_to_nvfp4`` flag to ``examples/llm_ptq/hf_ptq.py`` for closed-form, bit-exact MXFP4 → NVFP4 weight conversion. Supports the GPT-OSS family (``openai/gpt-oss-20b``, ``openai/gpt-oss-120b``). See `examples/llm_ptq/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq#mxfp4--nvfp4-cast-for-gpt-oss>`__ for usage.
26
26
- DeepSeek PTQ (``examples/deepseek/ptq.py``) now defaults to native top-k calibration with post-hoc per-layer peer-max sync of expert ``input_quantizer.amax``; the all-experts path is preserved behind ``--calib_all_experts``.
27
27
- Add NVFP4 W4A16 weight-only quantization (``w4a16_nvfp4``): FP4 weights with group_size=16, BF16 activations, no calibration forward pass required. Use ``mtq.W4A16_NVFP4_CFG`` or ``--qformat w4a16_nvfp4`` in ``hf_ptq.py``. vLLM deployment support is in progress.
28
+
- Add ``pack: bool`` option to ``modelopt.torch.utils.dataset_utils.get_dataset_dataloader``. When ``True``, raw samples are concatenated into a single token stream (separated by ``tokenizer.eos_token_id``) and sliced into uniform ``max_sample_length`` chunks, instead of tokenizing each sample with truncate-and-pad. Eliminates padding-token noise from calibration and keeps long-document context intact. Default ``False`` for backward compatibility (with a warning); recommended for pruning and amax-based PTQ.
28
29
29
-
0.44 (2026-05-18)
30
+
**Bug Fixes**
31
+
32
+
- Fix Megatron-Core HF importer to load fused ``TELayerNormColumnParallelLinear.layer_norm_weight`` from HF for GPT-family models (Qwen3 etc.) under ``--export-default-te-spec``. Importer now prefers per-context keys ``fused_input_layernorm`` / ``fused_pre_mlp_layernorm`` (fallback ``fused_norm`` for Nemotron-H backward compatibility); ``mcore_qwen.py`` provides the new rules. Without this fix, post-prune MMLU sat at chance.
0 commit comments