refactor(examples): update imports, docstrings, and README for dataset/

ChenhanYu · claude · ChenhanYu · commit a3c62c20c154 · 2026-04-03T21:21:44.000-07:00
- _specdec_aug → conversation_utils: update imports in ptv2/ptv3 scripts
  and generalize the module docstring
- augmentations.yaml: reference both ptv2 and ptv3 in header comment
- speculative_decoding/README.md: update all paths to ../dataset/

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;
Signed-off-by: chenhany &lt;chenhany@nvidia.com&gt;
diff --git a/examples/dataset/augmentations.yaml b/examples/dataset/augmentations.yaml
@@ -1,4 +1,4 @@
-# Augmentation specs for make_nemotron_ptv2_dataset.py
+# Augmentation specs for make_nemotron_ptv2_dataset.py and make_nemotron_ptv3_dataset.py
 #
 # Each entry defines one augmentation variant applied cyclically across the dataset.
 # The augmented copy is the same size as the source — each row gets exactly one variant.
diff --git a/examples/dataset/conversation_utils.py b/examples/dataset/conversation_utils.py
@@ -14,7 +14,7 @@
 # limitations under the License.
 
 """
-Shared augmentation utilities for Nemotron speculative-decoding dataset scripts.
+Shared conversation manipulation and augmentation utilities for dataset preparation.
 
 Imported by make_nemotron_ptv2_dataset.py and make_nemotron_ptv3_dataset.py.
 
@@ -24,10 +24,10 @@
 
 Conversation format
 -------------------
-Each conversation is kept as a full message list (system + user + assistant turns)
-with only the *last* assistant turn stripped — that is the response the target model
-will generate.  All prior assistant turns are preserved so the model has the full
-multi-turn context it needs to produce a coherent next response.
+Each conversation is stripped down to a skeleton of system + user turns only — all
+assistant turns are removed.  The downstream generation pipeline (query.py) feeds this
+skeleton to the target model turn-by-turn, appending each generated response before
+sending the next user turn, so the model produces coherent multi-turn continuations.
 
 Augmentations are applied only to the *last* user message (the new prompt), not to
 earlier user turns that are already part of the established context.
@@ -152,6 +152,8 @@ def strip_assistant_turns(example: dict[str, Any], idx: int) -> dict[str, Any]:
     Rows with no user turns are returned empty and filtered out by the caller.
     """
     messages = [m for m in example["messages"] if m["role"] in ("system", "user")]
+    if not any(m["role"] == "user" for m in messages):
+        return {"messages": []}
     return {"messages": messages}
 
 
diff --git a/examples/dataset/make_nemotron_ptv2_dataset.py b/examples/dataset/make_nemotron_ptv2_dataset.py
@@ -61,7 +61,7 @@
 
 from datasets import concatenate_datasets, load_dataset
 
-from _specdec_aug import (
+from conversation_utils import (
     has_tool_turns,
     load_augmentations,
     make_augment_fn,
diff --git a/examples/dataset/make_nemotron_ptv3_dataset.py b/examples/dataset/make_nemotron_ptv3_dataset.py
@@ -85,7 +85,7 @@
 import yaml
 from datasets import concatenate_datasets, load_dataset
 
-from _specdec_aug import has_tool_turns, load_augmentations, make_augment_fn, normalize_messages, strip_assistant_turns
+from conversation_utils import has_tool_turns, load_augmentations, make_augment_fn, normalize_messages, strip_assistant_turns
 
 logging.basicConfig(
     level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s", datefmt="%H:%M:%S"
@@ -270,6 +270,10 @@ def main() -> None:
     if non_augmentable is not None:
         parts_to_combine.append(non_augmentable)
 
+    if not parts_to_combine:
+        logger.warning("No data to combine — all rows were filtered out. Exiting.")
+        return
+
     combined = concatenate_datasets(parts_to_combine)
     logger.info("Combined (pre-shuffle): %d rows", len(combined))
 
diff --git a/examples/speculative_decoding/README.md b/examples/speculative_decoding/README.md
@@ -48,7 +48,7 @@ pip install -r requirements.txt
 We support a range of input datasets. In this example, we will use the [UltraChat-200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset.
 
 ```bash
-python prepare_input_conversations/make_dataset.py -f prepare_input_conversations/example_data_config.yaml --full-conversations
+python ../dataset/make_dataset.py -f ../dataset/example_data_config.yaml --full-conversations
 ```
 
 See [other-datasets](#other-datasets) section for other dataset options and instruction for user-provided data.
@@ -203,7 +203,7 @@ See more details on deployment of quantized model to TRTLLM [here](../llm_ptq/RE
 
 ### Other Datasets
 
-In addition to the default dataset, we support adding several other commonly used datasets in `prepare_input_conversations/make_dataset.py`:
+In addition to the default dataset, we support adding several other commonly used datasets in `../dataset/make_dataset.py`:
 
 - MTBench (for debugging)
 - ShareGPT
@@ -232,10 +232,10 @@ For large-scale training we provide dedicated scripts for NVIDIA's Nemotron Post
 
 ```bash
 # Synthetic data generation (~3.3M rows):
-python prepare_input_conversations/make_nemotron_ptv2_dataset.py --output-dir /tmp/ptv2_gen
+python ../dataset/make_nemotron_ptv2_dataset.py --output-dir /tmp/ptv2_gen
 
 # Direct SFT training mix (~1.9M rows):
-python prepare_input_conversations/make_nemotron_ptv2_dataset.py --mode train --output-dir /tmp/ptv2_train
+python ../dataset/make_nemotron_ptv2_dataset.py --mode train --output-dir /tmp/ptv2_train
 ```
 
 Covers: `stem`, `chat`, `math`, `code` + 5 multilingual splits (ja/de/it/es/fr, capped at 100K each).
@@ -244,19 +244,19 @@ Covers: `stem`, `chat`, `math`, `code` + 5 multilingual splits (ja/de/it/es/fr,
 
 ```bash
 # Synthetic data generation (~3.4M rows):
-python prepare_input_conversations/make_nemotron_ptv3_dataset.py --output-dir /tmp/ptv3_gen
+python ../dataset/make_nemotron_ptv3_dataset.py --output-dir /tmp/ptv3_gen
 
 # Direct SFT training mix (~3.9M rows, includes agentic/tool-use datasets):
-python prepare_input_conversations/make_nemotron_ptv3_dataset.py --mode train --output-dir /tmp/ptv3_train
+python ../dataset/make_nemotron_ptv3_dataset.py --mode train --output-dir /tmp/ptv3_train
 ```
 
-Covers: math, code, science, instruction-following, agentic/tool-use, safety, finance, and multilingual data. The dataset mix and per-split row caps are configurable via `prepare_input_conversations/nemotron_ptv3_datasets.yaml`.
+Covers: math, code, science, instruction-following, agentic/tool-use, safety, finance, and multilingual data. The dataset mix and per-split row caps are configurable via `../dataset/nemotron_ptv3_datasets.yaml`.
 
-**Augmentation** (generate mode only) is controlled by `prepare_input_conversations/augmentations.yaml`. By default it includes 12 language-redirect variants and several style/format hints. The `/no_think` system-prompt variant is disabled by default (enable it for models that support it, e.g. Qwen3):
+**Augmentation** (generate mode only) is controlled by `../dataset/augmentations.yaml`. By default it includes 12 language-redirect variants and several style/format hints. The `/no_think` system-prompt variant is disabled by default (enable it for models that support it, e.g. Qwen3):
 
 ```bash
 # Custom augmentation config:
-python prepare_input_conversations/make_nemotron_ptv2_dataset.py \
+python ../dataset/make_nemotron_ptv2_dataset.py \
     --augmentations-config my_augs.yaml --output-dir /tmp/ptv2_gen
 ```
 

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# Augmentation specs for make_nemotron_ptv2_dataset.py`
	`1`	`+# Augmentation specs for make_nemotron_ptv2_dataset.py and make_nemotron_ptv3_dataset.py`
`2`	`2`	`#`
`3`	`3`	`# Each entry defines one augmentation variant applied cyclically across the dataset.`
`4`	`4`	`# The augmented copy is the same size as the source — each row gets exactly one variant.`