Skip to content

Commit c6e4e98

Browse files
Improve calibration loop without packing
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
1 parent 770ad97 commit c6e4e98

6 files changed

Lines changed: 92 additions & 214 deletions

File tree

CHANGELOG.rst

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,13 +22,11 @@ Changelog
2222
- Add composable ``$import`` system for recipe YAML configs, enabling reusable config snippets referenced via ``{$import: name}`` markers. All built-in PTQ recipes converted to use imports with shared snippets under ``modelopt_recipes/configs/`` (numeric formats, quant_cfg building blocks, presets). See :ref:`composable-imports`.
2323
- Add offline DFlash speculative decoding training. Train the draft module from pre-computed base-model hidden states dumped by ``examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py``; base-model transformer layers are deleted after conversion to save memory. Controlled by the auto-derived ``dflash_offline`` flag on ``DFlashConfig`` (derived from ``data_args.offline_data_path``). The dump scripts now share ``collect_hidden_states/common.py`` for aux-layer selection (``--aux-layers eagle|dflash|<list>``) and optional assistant-token ``loss_mask`` for answer-only-loss training.
2424
- Add support for ``active_params`` (for MoE models) and ``memory_mb`` constraints in Minitron pruning on top of existing ``params`` constraint. You can also provide multiple constraints. See `examples/pruning/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning>`_ for more details. The underlying utility functions ``mcore_param_count``, ``mcore_memory_footprint_mb``, and ``print_mcore_model_stats`` in ``modelopt.torch.nas.plugins.megatron_model_stats`` are also available for standalone use to compute parameter counts and memory footprints (weights + KV-cache + Mamba state) for any Megatron-Core model.
25-
- Enable ``--calib_mbs>1`` support in Minitron pruning for faster calibration
2625
- Add ``--cast_mxfp4_to_nvfp4`` flag to ``examples/llm_ptq/hf_ptq.py`` for closed-form, bit-exact MXFP4 → NVFP4 weight conversion. Supports the GPT-OSS family (``openai/gpt-oss-20b``, ``openai/gpt-oss-120b``). See `examples/llm_ptq/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq#mxfp4--nvfp4-cast-for-gpt-oss>`__ for usage.
2726
- DeepSeek PTQ (``examples/deepseek/ptq.py``) now defaults to native top-k calibration with post-hoc per-layer peer-max sync of expert ``input_quantizer.amax``; the all-experts path is preserved behind ``--calib_all_experts``.
2827
- Add NVFP4 W4A16 weight-only quantization (``w4a16_nvfp4``): FP4 weights with group_size=16, BF16 activations, no calibration forward pass required. Use ``mtq.W4A16_NVFP4_CFG`` or ``--qformat w4a16_nvfp4`` in ``hf_ptq.py``. vLLM deployment support is in progress.
2928
- Add ``DATASET_COMBOS`` to ``modelopt.torch.utils.dataset_utils`` — single ``--dataset`` tokens that fan out to multiple registered datasets; per-entry ``num_samples`` is split evenly across the members. Initial combos: ``cnn_nemotron_v2_mix`` (``cnn_dailymail`` + ``nemotron-post-training-dataset-v2``, used by ``hf_ptq.py`` when no ``--dataset`` is provided) and ``nemotron-post-training-v3`` (the seven ``nvidia/Nemotron-*`` SFT datasets added in #1498, mirroring the `nemotron-post-training-v3 collection <https://huggingface.co/collections/nvidia/nemotron-post-training-v3>`_). Combo names are listed by ``get_supported_datasets()`` and surfaced in ``--dataset`` help. ``get_dataset_dataloader`` rejects inputs that mix a combo with one of its member datasets (e.g. ``cnn_dailymail,cnn_nemotron_v2_mix``) to avoid double-sampling, and ``get_dataset_samples`` rejects combo names so callers route through the dataloader. ``hf_ptq.py`` default ``--calib_size`` is bumped from ``512`` to ``1024`` so the total calibration sample count under the new default combo matches the previous two-dataset fallback.
3029
- The ``nemotron-sft-agentic-v2`` registered dataset (added in #1498) now uses only the ``search`` split. The previously configured ``interactive_agent`` and ``tool_calling`` splits contain content-level defects (heterogeneous schema and a malformed JSON row, respectively) that cause pyarrow's streaming JSON reader to fail deterministically.
31-
- Add ``pack`` option to ``modelopt.torch.utils.dataset_utils.get_dataset_dataloader``. When ``True``, raw samples from each source are concatenated into a per-source token stream (separated by ``tokenizer.eos_token_id``) and sliced into uniform ``max_sample_length`` chunks, preserving the requested per-source ratio in ``num_samples``. Eliminates padding-token noise from calibration and keeps long-document context intact. Default ``False`` for backward compatibility; recommended for pruning and amax-based PTQ.
3230

3331
**Bug Fixes**
3432

examples/megatron_bridge/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ torchrun --nproc_per_node 2 prune_minitron.py \
102102
--hf_model_name_or_path Qwen/Qwen3-8B \
103103
--prune_target_memory_mb 12288 \
104104
--seq_length 4096 \
105-
--calib_mbs 1 \
105+
--calib_batch_size 1 \
106106
--output_hf_path /tmp/Qwen3-8B-Pruned-12GB
107107
```
108108

examples/megatron_bridge/prune_minitron.py

Lines changed: 7 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -102,8 +102,7 @@ def get_args() -> argparse.Namespace:
102102
"--calib_num_samples", type=int, default=1024, help="Number of samples for calibration"
103103
)
104104
# TODO: Add support for pre-training dataset (pre-tokenized)
105-
parser.add_argument("--calib_mbs", type=int, default=1, help="Calibration micro-batch size")
106-
parser.add_argument("--calib_gbs", type=int, default=1, help="Calibration global batch size")
105+
parser.add_argument("--calib_batch_size", type=int, default=1, help="Calibration batch size")
107106
parser.add_argument("--seq_length", type=int, default=4096)
108107
# Pruning parameters
109108
parser.add_argument(
@@ -159,8 +158,8 @@ def get_args() -> argparse.Namespace:
159158
default=None,
160159
help=(
161160
"Batch size used only for KV-cache sizing in --prune_target_memory_mb. "
162-
"Defaults to --calib_mbs when not set. "
163-
"Use this to target an inference batch size that differs from the calibration micro-batch size."
161+
"Defaults to --calib_batch_size when not set. "
162+
"Use this to target an inference batch size that differs from the calibration batch size."
164163
),
165164
)
166165

@@ -222,12 +221,6 @@ def get_args() -> argparse.Namespace:
222221
args = parser.parse_args()
223222

224223
# Validate pruning target arguments
225-
if args.calib_mbs > args.calib_gbs:
226-
args.calib_gbs = args.calib_mbs
227-
print_rank_0(
228-
f"{args.calib_gbs=} is less than {args.calib_mbs=}, setting it to {args.calib_mbs}."
229-
)
230-
231224
_nas_targets = [
232225
args.prune_target_params,
233226
args.prune_target_active_params,
@@ -302,7 +295,7 @@ def main(args: argparse.Namespace):
302295
dataset_name=args.calib_dataset_name,
303296
num_samples=args.calib_num_samples,
304297
seq_length=args.seq_length,
305-
batch_size=args.calib_gbs,
298+
batch_size=args.calib_batch_size,
306299
)
307300

308301
pruning_config = {
@@ -382,7 +375,9 @@ def score_func(m):
382375
pruning_config["top_k"] = args.top_k
383376
# memory_mb constraint requires batch_size and seq_length
384377
pruning_config["batch_size"] = (
385-
args.inference_batch_size if args.inference_batch_size is not None else args.calib_mbs
378+
args.inference_batch_size
379+
if args.inference_batch_size is not None
380+
else args.calib_batch_size
386381
)
387382
pruning_config["seq_length"] = args.seq_length
388383
print_rank_0(f"Pruning constraints: {pruning_constraints}")

modelopt/torch/utils/dataset_utils.py

Lines changed: 17 additions & 147 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,6 @@
2929
from torch.utils.data import DataLoader
3030
from tqdm import tqdm
3131

32-
from .logging import warn_rank_0
33-
3432
if TYPE_CHECKING:
3533
from transformers import PreTrainedTokenizerBase
3634

@@ -559,103 +557,6 @@ def __len__(self):
559557
return len(next(iter(self.encodings.values())))
560558

561559

562-
def _build_packed_input_ids(
563-
dataset_name: list[str],
564-
num_samples: list[int],
565-
max_sample_length: int,
566-
tokenizer: "PreTrainedTokenizerBase",
567-
apply_chat_template: bool,
568-
) -> torch.Tensor:
569-
"""Pack raw samples into a ``(n_chunks, max_sample_length)`` int tensor.
570-
571-
Each source contributes ``num_sample`` chunks (or fewer if exhausted), so the requested
572-
per-source ratio in ``num_samples`` is preserved instead of letting whichever source
573-
appears first dominate the budget. Within a source, tokenization runs in batches of
574-
``max(8, num_sample // 4)`` samples so we stop tokenizing once the chunk budget is
575-
full, instead of eagerly paying for the entire ``num_sample * 2`` oversample.
576-
577-
Documents are separated by ``tokenizer.eos_token_id`` when set; ``add_special_tokens=False``
578-
avoids injecting a fresh BOS at every sample boundary. Note that packed chunks therefore
579-
have no BOS at position 0 — fine for amax / sensitivity calibration where boundary
580-
tokens are statistically dominated, less ideal for callers that need BOS-prefixed
581-
sequences (use ``pack=False`` for those). When ``apply_chat_template=True``, the rendered
582-
samples often already end with the chat EOS marker (e.g. ``<|im_end|>``), which can
583-
tokenize to ``eos_token_id`` and produce ``<eos><eos>`` at document boundaries —
584-
harmless for calibration statistics but worth noting.
585-
586-
Sizing note: ``num_sample`` here is the desired chunk count per source. The loader
587-
internally fetches ``num_sample * 2`` raw samples. Short-document sources can still
588-
under-fill — to recover the target, scale ``num_sample`` itself (which doubles both
589-
the target and the internal raw-sample draw). Example: short-row source returning 1
590-
chunk for ``num_sample=64`` typically returns 4 chunks for ``num_sample=128`` because
591-
the raw draw goes from 128 to 256.
592-
"""
593-
sep_id = tokenizer.eos_token_id
594-
if sep_id is None:
595-
warn_rank_0(
596-
"pack=True: tokenizer has no eos_token_id; raw documents will be concatenated "
597-
"without a separator, so calibration activations will span document boundaries. "
598-
"Set tokenizer.eos_token_id (or another sentinel) for explicit separators."
599-
)
600-
601-
per_source_chunks: list[list[int]] = []
602-
actual_per_source: list[int] = []
603-
for ds_name, num_sample in zip(dataset_name, num_samples):
604-
# 2x oversample sized for cnn_dailymail-style long docs; short-sample datasets may
605-
# still under-fill and trigger the warning below.
606-
raw_samples = get_dataset_samples(
607-
ds_name,
608-
num_sample * 2,
609-
apply_chat_template=apply_chat_template,
610-
tokenizer=tokenizer,
611-
)
612-
needed_tokens = num_sample * max_sample_length
613-
# max(8, ...) floor keeps the Rust-batched tokenizer happy for small calibrations
614-
# (num_sample < 32 → batch is 8); above that, `// 4` grows the batch with the
615-
# request while keeping the early-exit check granular enough to actually skip
616-
# tokenizing the back half of the 2x oversample on long-doc sources.
617-
tokenize_batch_size = max(8, num_sample // 4)
618-
stream: list[int] = []
619-
for batch_start in range(0, len(raw_samples), tokenize_batch_size):
620-
if len(stream) >= needed_tokens:
621-
break
622-
batch = raw_samples[batch_start : batch_start + tokenize_batch_size]
623-
# padding/truncation=False explicit: don't trust subclass __call__ defaults.
624-
encoded = tokenizer(batch, add_special_tokens=False, padding=False, truncation=False)[
625-
"input_ids"
626-
]
627-
for ids in encoded:
628-
stream.extend(ids)
629-
if sep_id is not None:
630-
stream.append(sep_id)
631-
if len(stream) >= needed_tokens:
632-
break
633-
available = len(stream) // max_sample_length
634-
take = min(num_sample, available)
635-
per_source_chunks.extend(
636-
stream[i * max_sample_length : (i + 1) * max_sample_length] for i in range(take)
637-
)
638-
actual_per_source.append(take)
639-
640-
n_chunks = len(per_source_chunks)
641-
total_chunks = sum(num_samples)
642-
if n_chunks == 0:
643-
raise ValueError(
644-
f"pack=True yielded 0 chunks across {len(dataset_name)} source(s); each source "
645-
f"needs at least {max_sample_length} tokens after concatenation. Try longer "
646-
"samples or a smaller max_sample_length."
647-
)
648-
if n_chunks < total_chunks:
649-
warn_rank_0(
650-
f"pack=True produced {n_chunks} chunks (per-source {actual_per_source}) vs "
651-
f"requested {total_chunks} (per-source {list(num_samples)}). Some sources "
652-
"exhausted before reaching their target. The loader internally fetches "
653-
"`num_samples * 2` raw samples per source; for very short-sample sources, "
654-
"pass a 2-3x larger `num_samples` so the 2x draw covers the chunk budget."
655-
)
656-
return torch.tensor(per_source_chunks, dtype=torch.long)
657-
658-
659560
def get_dataset_dataloader(
660561
dataset_name: str | list[str] = "cnn_dailymail",
661562
tokenizer: "PreTrainedTokenizerBase | None" = None,
@@ -665,7 +566,6 @@ def get_dataset_dataloader(
665566
device: torch.device | str | None = None,
666567
include_labels: bool = False,
667568
apply_chat_template: bool = False,
668-
pack: bool = False,
669569
) -> DataLoader:
670570
"""Get a dataloader with the dataset name and tokenizer of the target model.
671571
@@ -676,31 +576,13 @@ def get_dataset_dataloader(
676576
an ``int`` (applied to a single source) or a list aligned with ``dataset_name``.
677577
tokenizer: Instance of HuggingFace tokenizer.
678578
batch_size: Batch size of the returned dataloader.
679-
num_samples: Number of samples from the dataset. Semantics depend on ``pack``:
680-
with ``pack=False`` this is the number of raw samples to fetch and tokenize
681-
(each becomes one row of ``max_sample_length`` after truncate-and-pad); with
682-
``pack=True`` this is the number of ``max_sample_length``-token chunks to
683-
produce per source. Migrating an existing call site to ``pack=True`` may
684-
therefore need a different value to hit the same total-token calibration
685-
budget.
579+
num_samples: Number of raw samples to fetch and tokenize (each becomes one row of
580+
``max_sample_length`` after truncate-and-pad).
686581
max_sample_length: Maximum length of a sample.
687582
device: Target device for the returned dataloader.
688583
include_labels: Whether to include labels in the dataloader.
689584
apply_chat_template: Whether to apply the chat template to the samples
690585
(if supported by the dataset).
691-
pack: If True, raw samples from each source are concatenated into a per-source token
692-
stream (separated by ``tokenizer.eos_token_id`` when set) and sliced into
693-
uniform-length chunks of ``max_sample_length``; the per-source chunks are then
694-
concatenated **contiguously by source** (no cross-source interleaving), preserving
695-
the requested per-source ratio in ``num_samples``. Avoids the per-sample
696-
truncate-and-pad waste of the default path: long documents stay intact, short
697-
ones don't introduce padding noise. Recommended for pruning calibration and
698-
amax-based PTQ where activation statistics should reflect natural-length
699-
contexts rather than padded fragments. ``attention_mask`` is unconditionally
700-
all-ones — attention crosses document boundaries (the ``eos`` separator is a
701-
token, not a mask boundary). Raises ``ValueError`` if the dataset doesn't yield
702-
enough tokens to form a single chunk; emits a rank-0 warning if it yields
703-
fewer chunks than requested.
704586
705587
Returns:
706588
An instance of dataloader.
@@ -752,30 +634,22 @@ def get_dataset_dataloader(
752634
expanded_num_samples.append(n)
753635
dataset_name, num_samples = expanded_names, expanded_num_samples
754636

755-
if pack:
756-
input_ids = _build_packed_input_ids(
757-
dataset_name, num_samples, max_sample_length, tokenizer, apply_chat_template
758-
)
759-
batch_encoded = {"input_ids": input_ids, "attention_mask": torch.ones_like(input_ids)}
760-
if device:
761-
batch_encoded = {k: v.to(device) for k, v in batch_encoded.items()}
762-
else:
763-
all_samples = []
764-
for ds_name, num_sample in zip(dataset_name, num_samples):
765-
samples = get_dataset_samples(
766-
ds_name, num_sample, apply_chat_template=apply_chat_template, tokenizer=tokenizer
767-
)
768-
all_samples.extend(samples)
769-
770-
batch_encoded = tokenizer(
771-
all_samples,
772-
return_tensors="pt",
773-
padding=True,
774-
truncation=True,
775-
max_length=max_sample_length,
637+
all_samples = []
638+
for ds_name, num_sample in zip(dataset_name, num_samples):
639+
samples = get_dataset_samples(
640+
ds_name, num_sample, apply_chat_template=apply_chat_template, tokenizer=tokenizer
776641
)
777-
if device:
778-
batch_encoded = batch_encoded.to(device)
642+
all_samples.extend(samples)
643+
644+
batch_encoded = tokenizer(
645+
all_samples,
646+
return_tensors="pt",
647+
padding=True,
648+
truncation=True,
649+
max_length=max_sample_length,
650+
)
651+
if device:
652+
batch_encoded = batch_encoded.to(device)
779653

780654
if include_labels:
781655
# Labels are needed when backward is called in the model.
@@ -1044,7 +918,6 @@ def create_forward_loop(
1044918
include_labels: bool = False,
1045919
dataloader: DataLoader | None = None,
1046920
allowed_non_tensor_keys: set | None = None,
1047-
pack: bool = False,
1048921
) -> Callable:
1049922
"""Creates and returns a forward loop function configured for a specific model, dataset, and tokenizer.
1050923
@@ -1066,8 +939,6 @@ def create_forward_loop(
1066939
allowed_non_tensor_keys: Set of key names whose batch values may be non-tensor types.
1067940
Useful when the dataloader yields batches with non-standard fields (e.g., nested
1068941
model outputs).
1069-
pack: Forwarded to :func:`get_dataset_dataloader`. See its docstring for semantics
1070-
(including the ``num_samples`` chunk-vs-document distinction).
1071942
1072943
Example usage for quantization:
1073944
@@ -1105,7 +976,6 @@ def create_forward_loop(
1105976
max_sample_length=max_sample_length,
1106977
device=device,
1107978
include_labels=include_labels,
1108-
pack=pack,
1109979
)
1110980

1111981
return lambda model: _forward_loop(model, dataloader, allowed_non_tensor_keys)

0 commit comments

Comments
 (0)