Improve calibration loop without packing

kevalmorabia97 · kevalmorabia97 · commit c6e4e987242d · 2026-05-20T09:15:07.000-07:00
Signed-off-by: Keval Morabia &lt;28916987+kevalmorabia97@users.noreply.github.com&gt;
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -22,13 +22,11 @@ Changelog
 - Add composable ``$import`` system for recipe YAML configs, enabling reusable config snippets referenced via ``{$import: name}`` markers. All built-in PTQ recipes converted to use imports with shared snippets under ``modelopt_recipes/configs/`` (numeric formats, quant_cfg building blocks, presets). See :ref:`composable-imports`.
 - Add offline DFlash speculative decoding training. Train the draft module from pre-computed base-model hidden states dumped by ``examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py``; base-model transformer layers are deleted after conversion to save memory. Controlled by the auto-derived ``dflash_offline`` flag on ``DFlashConfig`` (derived from ``data_args.offline_data_path``). The dump scripts now share ``collect_hidden_states/common.py`` for aux-layer selection (``--aux-layers eagle|dflash|<list>``) and optional assistant-token ``loss_mask`` for answer-only-loss training.
 - Add support for ``active_params`` (for MoE models) and ``memory_mb`` constraints in Minitron pruning on top of existing ``params`` constraint. You can also provide multiple constraints. See `examples/pruning/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning>`_ for more details. The underlying utility functions ``mcore_param_count``, ``mcore_memory_footprint_mb``, and ``print_mcore_model_stats`` in ``modelopt.torch.nas.plugins.megatron_model_stats`` are also available for standalone use to compute parameter counts and memory footprints (weights + KV-cache + Mamba state) for any Megatron-Core model.
-- Enable ``--calib_mbs>1`` support in Minitron pruning for faster calibration
 - Add ``--cast_mxfp4_to_nvfp4`` flag to ``examples/llm_ptq/hf_ptq.py`` for closed-form, bit-exact MXFP4 → NVFP4 weight conversion. Supports the GPT-OSS family (``openai/gpt-oss-20b``, ``openai/gpt-oss-120b``). See `examples/llm_ptq/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq#mxfp4--nvfp4-cast-for-gpt-oss>`__ for usage.
 - DeepSeek PTQ (``examples/deepseek/ptq.py``) now defaults to native top-k calibration with post-hoc per-layer peer-max sync of expert ``input_quantizer.amax``; the all-experts path is preserved behind ``--calib_all_experts``.
 - Add NVFP4 W4A16 weight-only quantization (``w4a16_nvfp4``): FP4 weights with group_size=16, BF16 activations, no calibration forward pass required. Use ``mtq.W4A16_NVFP4_CFG`` or ``--qformat w4a16_nvfp4`` in ``hf_ptq.py``. vLLM deployment support is in progress.
 - Add ``DATASET_COMBOS`` to ``modelopt.torch.utils.dataset_utils`` — single ``--dataset`` tokens that fan out to multiple registered datasets; per-entry ``num_samples`` is split evenly across the members. Initial combos: ``cnn_nemotron_v2_mix`` (``cnn_dailymail`` + ``nemotron-post-training-dataset-v2``, used by ``hf_ptq.py`` when no ``--dataset`` is provided) and ``nemotron-post-training-v3`` (the seven ``nvidia/Nemotron-*`` SFT datasets added in #1498, mirroring the `nemotron-post-training-v3 collection <https://huggingface.co/collections/nvidia/nemotron-post-training-v3>`_). Combo names are listed by ``get_supported_datasets()`` and surfaced in ``--dataset`` help. ``get_dataset_dataloader`` rejects inputs that mix a combo with one of its member datasets (e.g. ``cnn_dailymail,cnn_nemotron_v2_mix``) to avoid double-sampling, and ``get_dataset_samples`` rejects combo names so callers route through the dataloader. ``hf_ptq.py`` default ``--calib_size`` is bumped from ``512`` to ``1024`` so the total calibration sample count under the new default combo matches the previous two-dataset fallback.
 - The ``nemotron-sft-agentic-v2`` registered dataset (added in #1498) now uses only the ``search`` split. The previously configured ``interactive_agent`` and ``tool_calling`` splits contain content-level defects (heterogeneous schema and a malformed JSON row, respectively) that cause pyarrow's streaming JSON reader to fail deterministically.
-- Add ``pack`` option to ``modelopt.torch.utils.dataset_utils.get_dataset_dataloader``. When ``True``, raw samples from each source are concatenated into a per-source token stream (separated by ``tokenizer.eos_token_id``) and sliced into uniform ``max_sample_length`` chunks, preserving the requested per-source ratio in ``num_samples``. Eliminates padding-token noise from calibration and keeps long-document context intact. Default ``False`` for backward compatibility; recommended for pruning and amax-based PTQ.
 
 **Bug Fixes**
 
diff --git a/examples/megatron_bridge/README.md b/examples/megatron_bridge/README.md
@@ -102,7 +102,7 @@ torchrun --nproc_per_node 2 prune_minitron.py \
     --hf_model_name_or_path Qwen/Qwen3-8B \
     --prune_target_memory_mb 12288 \
     --seq_length 4096 \
-    --calib_mbs 1 \
+    --calib_batch_size 1 \
     --output_hf_path /tmp/Qwen3-8B-Pruned-12GB
 ```
 
diff --git a/examples/megatron_bridge/prune_minitron.py b/examples/megatron_bridge/prune_minitron.py
@@ -102,8 +102,7 @@ def get_args() -> argparse.Namespace:
         "--calib_num_samples", type=int, default=1024, help="Number of samples for calibration"
     )
     # TODO: Add support for pre-training dataset (pre-tokenized)
-    parser.add_argument("--calib_mbs", type=int, default=1, help="Calibration micro-batch size")
-    parser.add_argument("--calib_gbs", type=int, default=1, help="Calibration global batch size")
+    parser.add_argument("--calib_batch_size", type=int, default=1, help="Calibration batch size")
     parser.add_argument("--seq_length", type=int, default=4096)
     # Pruning parameters
     parser.add_argument(
@@ -159,8 +158,8 @@ def get_args() -> argparse.Namespace:
         default=None,
         help=(
             "Batch size used only for KV-cache sizing in --prune_target_memory_mb. "
-            "Defaults to --calib_mbs when not set. "
-            "Use this to target an inference batch size that differs from the calibration micro-batch size."
+            "Defaults to --calib_batch_size when not set. "
+            "Use this to target an inference batch size that differs from the calibration batch size."
         ),
     )
 
@@ -222,12 +221,6 @@ def get_args() -> argparse.Namespace:
     args = parser.parse_args()
 
     # Validate pruning target arguments
-    if args.calib_mbs > args.calib_gbs:
-        args.calib_gbs = args.calib_mbs
-        print_rank_0(
-            f"{args.calib_gbs=} is less than {args.calib_mbs=}, setting it to {args.calib_mbs}."
-        )
-
     _nas_targets = [
         args.prune_target_params,
         args.prune_target_active_params,
@@ -302,7 +295,7 @@ def main(args: argparse.Namespace):
         dataset_name=args.calib_dataset_name,
         num_samples=args.calib_num_samples,
         seq_length=args.seq_length,
-        batch_size=args.calib_gbs,
+        batch_size=args.calib_batch_size,
     )
 
     pruning_config = {
@@ -382,7 +375,9 @@ def score_func(m):
         pruning_config["top_k"] = args.top_k
         # memory_mb constraint requires batch_size and seq_length
         pruning_config["batch_size"] = (
-            args.inference_batch_size if args.inference_batch_size is not None else args.calib_mbs
+            args.inference_batch_size
+            if args.inference_batch_size is not None
+            else args.calib_batch_size
         )
         pruning_config["seq_length"] = args.seq_length
     print_rank_0(f"Pruning constraints: {pruning_constraints}")
diff --git a/modelopt/torch/utils/dataset_utils.py b/modelopt/torch/utils/dataset_utils.py
@@ -29,8 +29,6 @@
 from torch.utils.data import DataLoader
 from tqdm import tqdm
 
-from .logging import warn_rank_0
-
 if TYPE_CHECKING:
     from transformers import PreTrainedTokenizerBase
 
@@ -559,103 +557,6 @@ def __len__(self):
         return len(next(iter(self.encodings.values())))
 
 
-def _build_packed_input_ids(
-    dataset_name: list[str],
-    num_samples: list[int],
-    max_sample_length: int,
-    tokenizer: "PreTrainedTokenizerBase",
-    apply_chat_template: bool,
-) -> torch.Tensor:
-    """Pack raw samples into a ``(n_chunks, max_sample_length)`` int tensor.
-
-    Each source contributes ``num_sample`` chunks (or fewer if exhausted), so the requested
-    per-source ratio in ``num_samples`` is preserved instead of letting whichever source
-    appears first dominate the budget. Within a source, tokenization runs in batches of
-    ``max(8, num_sample // 4)`` samples so we stop tokenizing once the chunk budget is
-    full, instead of eagerly paying for the entire ``num_sample * 2`` oversample.
-
-    Documents are separated by ``tokenizer.eos_token_id`` when set; ``add_special_tokens=False``
-    avoids injecting a fresh BOS at every sample boundary. Note that packed chunks therefore
-    have no BOS at position 0 — fine for amax / sensitivity calibration where boundary
-    tokens are statistically dominated, less ideal for callers that need BOS-prefixed
-    sequences (use ``pack=False`` for those). When ``apply_chat_template=True``, the rendered
-    samples often already end with the chat EOS marker (e.g. ``<|im_end|>``), which can
-    tokenize to ``eos_token_id`` and produce ``<eos><eos>`` at document boundaries —
-    harmless for calibration statistics but worth noting.
-
-    Sizing note: ``num_sample`` here is the desired chunk count per source. The loader
-    internally fetches ``num_sample * 2`` raw samples. Short-document sources can still
-    under-fill — to recover the target, scale ``num_sample`` itself (which doubles both
-    the target and the internal raw-sample draw). Example: short-row source returning 1
-    chunk for ``num_sample=64`` typically returns 4 chunks for ``num_sample=128`` because
-    the raw draw goes from 128 to 256.
-    """
-    sep_id = tokenizer.eos_token_id
-    if sep_id is None:
-        warn_rank_0(
-            "pack=True: tokenizer has no eos_token_id; raw documents will be concatenated "
-            "without a separator, so calibration activations will span document boundaries. "
-            "Set tokenizer.eos_token_id (or another sentinel) for explicit separators."
-        )
-
-    per_source_chunks: list[list[int]] = []
-    actual_per_source: list[int] = []
-    for ds_name, num_sample in zip(dataset_name, num_samples):
-        # 2x oversample sized for cnn_dailymail-style long docs; short-sample datasets may
-        # still under-fill and trigger the warning below.
-        raw_samples = get_dataset_samples(
-            ds_name,
-            num_sample * 2,
-            apply_chat_template=apply_chat_template,
-            tokenizer=tokenizer,
-        )
-        needed_tokens = num_sample * max_sample_length
-        # max(8, ...) floor keeps the Rust-batched tokenizer happy for small calibrations
-        # (num_sample < 32 → batch is 8); above that, `// 4` grows the batch with the
-        # request while keeping the early-exit check granular enough to actually skip
-        # tokenizing the back half of the 2x oversample on long-doc sources.
-        tokenize_batch_size = max(8, num_sample // 4)
-        stream: list[int] = []
-        for batch_start in range(0, len(raw_samples), tokenize_batch_size):
-            if len(stream) >= needed_tokens:
-                break
-            batch = raw_samples[batch_start : batch_start + tokenize_batch_size]
-            # padding/truncation=False explicit: don't trust subclass __call__ defaults.
-            encoded = tokenizer(batch, add_special_tokens=False, padding=False, truncation=False)[
-                "input_ids"
-            ]
-            for ids in encoded:
-                stream.extend(ids)
-                if sep_id is not None:
-                    stream.append(sep_id)
-                if len(stream) >= needed_tokens:
-                    break
-        available = len(stream) // max_sample_length
-        take = min(num_sample, available)
-        per_source_chunks.extend(
-            stream[i * max_sample_length : (i + 1) * max_sample_length] for i in range(take)
-        )
-        actual_per_source.append(take)
-
-    n_chunks = len(per_source_chunks)
-    total_chunks = sum(num_samples)
-    if n_chunks == 0:
-        raise ValueError(
-            f"pack=True yielded 0 chunks across {len(dataset_name)} source(s); each source "
-            f"needs at least {max_sample_length} tokens after concatenation. Try longer "
-            "samples or a smaller max_sample_length."
-        )
-    if n_chunks < total_chunks:
-        warn_rank_0(
-            f"pack=True produced {n_chunks} chunks (per-source {actual_per_source}) vs "
-            f"requested {total_chunks} (per-source {list(num_samples)}). Some sources "
-            "exhausted before reaching their target. The loader internally fetches "
-            "`num_samples * 2` raw samples per source; for very short-sample sources, "
-            "pass a 2-3x larger `num_samples` so the 2x draw covers the chunk budget."
-        )
-    return torch.tensor(per_source_chunks, dtype=torch.long)
-
-
 def get_dataset_dataloader(
     dataset_name: str | list[str] = "cnn_dailymail",
     tokenizer: "PreTrainedTokenizerBase | None" = None,
@@ -665,7 +566,6 @@ def get_dataset_dataloader(
     device: torch.device | str | None = None,
     include_labels: bool = False,
     apply_chat_template: bool = False,
-    pack: bool = False,
 ) -> DataLoader:
     """Get a dataloader with the dataset name and tokenizer of the target model.
 
@@ -676,31 +576,13 @@ def get_dataset_dataloader(
             an ``int`` (applied to a single source) or a list aligned with ``dataset_name``.
         tokenizer: Instance of HuggingFace tokenizer.
         batch_size: Batch size of the returned dataloader.
-        num_samples: Number of samples from the dataset. Semantics depend on ``pack``:
-            with ``pack=False`` this is the number of raw samples to fetch and tokenize
-            (each becomes one row of ``max_sample_length`` after truncate-and-pad); with
-            ``pack=True`` this is the number of ``max_sample_length``-token chunks to
-            produce per source. Migrating an existing call site to ``pack=True`` may
-            therefore need a different value to hit the same total-token calibration
-            budget.
+        num_samples: Number of raw samples to fetch and tokenize (each becomes one row of
+            ``max_sample_length`` after truncate-and-pad).
         max_sample_length: Maximum length of a sample.
         device: Target device for the returned dataloader.
         include_labels: Whether to include labels in the dataloader.
         apply_chat_template: Whether to apply the chat template to the samples
             (if supported by the dataset).
-        pack: If True, raw samples from each source are concatenated into a per-source token
-            stream (separated by ``tokenizer.eos_token_id`` when set) and sliced into
-            uniform-length chunks of ``max_sample_length``; the per-source chunks are then
-            concatenated **contiguously by source** (no cross-source interleaving), preserving
-            the requested per-source ratio in ``num_samples``. Avoids the per-sample
-            truncate-and-pad waste of the default path: long documents stay intact, short
-            ones don't introduce padding noise. Recommended for pruning calibration and
-            amax-based PTQ where activation statistics should reflect natural-length
-            contexts rather than padded fragments. ``attention_mask`` is unconditionally
-            all-ones — attention crosses document boundaries (the ``eos`` separator is a
-            token, not a mask boundary). Raises ``ValueError`` if the dataset doesn't yield
-            enough tokens to form a single chunk; emits a rank-0 warning if it yields
-            fewer chunks than requested.
 
     Returns:
         An instance of dataloader.
@@ -752,30 +634,22 @@ def get_dataset_dataloader(
             expanded_num_samples.append(n)
     dataset_name, num_samples = expanded_names, expanded_num_samples
 
-    if pack:
-        input_ids = _build_packed_input_ids(
-            dataset_name, num_samples, max_sample_length, tokenizer, apply_chat_template
-        )
-        batch_encoded = {"input_ids": input_ids, "attention_mask": torch.ones_like(input_ids)}
-        if device:
-            batch_encoded = {k: v.to(device) for k, v in batch_encoded.items()}
-    else:
-        all_samples = []
-        for ds_name, num_sample in zip(dataset_name, num_samples):
-            samples = get_dataset_samples(
-                ds_name, num_sample, apply_chat_template=apply_chat_template, tokenizer=tokenizer
-            )
-            all_samples.extend(samples)
-
-        batch_encoded = tokenizer(
-            all_samples,
-            return_tensors="pt",
-            padding=True,
-            truncation=True,
-            max_length=max_sample_length,
+    all_samples = []
+    for ds_name, num_sample in zip(dataset_name, num_samples):
+        samples = get_dataset_samples(
+            ds_name, num_sample, apply_chat_template=apply_chat_template, tokenizer=tokenizer
         )
-        if device:
-            batch_encoded = batch_encoded.to(device)
+        all_samples.extend(samples)
+
+    batch_encoded = tokenizer(
+        all_samples,
+        return_tensors="pt",
+        padding=True,
+        truncation=True,
+        max_length=max_sample_length,
+    )
+    if device:
+        batch_encoded = batch_encoded.to(device)
 
     if include_labels:
         # Labels are needed when backward is called in the model.
@@ -1044,7 +918,6 @@ def create_forward_loop(
     include_labels: bool = False,
     dataloader: DataLoader | None = None,
     allowed_non_tensor_keys: set | None = None,
-    pack: bool = False,
 ) -> Callable:
     """Creates and returns a forward loop function configured for a specific model, dataset, and tokenizer.
 
@@ -1066,8 +939,6 @@ def create_forward_loop(
         allowed_non_tensor_keys: Set of key names whose batch values may be non-tensor types.
             Useful when the dataloader yields batches with non-standard fields (e.g., nested
             model outputs).
-        pack: Forwarded to :func:`get_dataset_dataloader`. See its docstring for semantics
-            (including the ``num_samples`` chunk-vs-document distinction).
 
     Example usage for quantization:
 
@@ -1105,7 +976,6 @@ def create_forward_loop(
             max_sample_length=max_sample_length,
             device=device,
             include_labels=include_labels,
-            pack=pack,
         )
 
     return lambda model: _forward_loop(model, dataloader, allowed_non_tensor_keys)
diff --git a/modelopt/torch/utils/plugins/megatron_calibration.py b/modelopt/torch/utils/plugins/megatron_calibration.py
diff --git a/tests/unit/torch/utils/test_dataset_utils.py b/tests/unit/torch/utils/test_dataset_utils.py