Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
847375c
plan: add skip_when for conditional column generation (#479)
nabinchha Mar 30, 2026
3f415c7
plan: remove HopChain example from skip_when plan
nabinchha Mar 30, 2026
6ee4c70
plan: replace HopChain example with generic product review example
nabinchha Mar 30, 2026
14ab39b
plan: add open questions on skip sentinel value and row filtering
nabinchha Mar 30, 2026
351bc29
plan: major revision — SkipConfig model, sync engine support, decoupl…
nabinchha Apr 1, 2026
b481e73
Merge branch 'main' into nmulepati/feat/479-skip-when-plan
nabinchha Apr 6, 2026
2219cd6
updates
nabinchha Apr 6, 2026
9790f95
plan: document get_required_columns for skip propagation
nabinchha Apr 6, 2026
7046378
plan: centralize __skipped__ handling in skip_provenance
nabinchha Apr 6, 2026
a157a7f
plan: align doc title with SkipConfig / skip.when
nabinchha Apr 6, 2026
c5d2dbc
plan: address review — delimiter validation, centralized error handli…
nabinchha Apr 6, 2026
75943d5
plan: add get_side_effect_columns accessor to execution graph spec
nabinchha Apr 6, 2026
d9e3f2d
Merge branch 'nmulepati/feat/479-skip-when-plan' into nmulepati/feat/…
nabinchha Apr 6, 2026
f94a847
add skip.when conditional column generation
nabinchha Apr 6, 2026
8ceb76e
Merge branch 'main' into nmulepati/feat/479-skip-conditional-gen-impl…
nabinchha Apr 6, 2026
df3dd67
fix review findings for skip.when implementation
nabinchha Apr 6, 2026
d399065
add conditional generation with skip recipe and refactor skip helpers
nabinchha Apr 6, 2026
2c6cb3d
updates
nabinchha Apr 7, 2026
875b4bc
fixes
nabinchha Apr 7, 2026
c1aa139
remove recipe > inject conditional gen into existing tutorial
nabinchha Apr 7, 2026
569fb16
regen colab notebooks
nabinchha Apr 7, 2026
8fceabe
fix: handle missing execution graph in _column_can_skip
nabinchha Apr 7, 2026
cbe76a4
Merge branch 'main' into nmulepati/feat/479-skip-conditional-gen-impl…
nabinchha Apr 7, 2026
634901e
parametrize some tests
nabinchha Apr 7, 2026
4948fe2
public before private
nabinchha Apr 7, 2026
4d04842
slight refactor for readability
nabinchha Apr 7, 2026
a5b905e
parametrize some tests
nabinchha Apr 7, 2026
30632fa
minor fixes
nabinchha Apr 7, 2026
3bb51b1
reanme internla skip tracker key name
nabinchha Apr 7, 2026
989f946
clarify intent in comment
nabinchha Apr 7, 2026
3bf9187
when skipped _run_cell should return skipped value even though the co…
nabinchha Apr 7, 2026
e98ec9b
remove inline import
nabinchha Apr 7, 2026
bd5e2bf
minor refactor for clarity
nabinchha Apr 7, 2026
1affce9
fix: preserve skip metadata across replace_buffer and exclude allow_r…
nabinchha Apr 8, 2026
7a12b20
Merge branch 'main' into nmulepati/feat/479-skip-conditional-gen-impl…
nabinchha Apr 8, 2026
4e7ee2d
Merge branch 'main' into nmulepati/feat/479-skip-conditional-gen-impl…
nabinchha Apr 9, 2026
bf6332e
address review comments on skip.when PR (#502)
nabinchha Apr 10, 2026
9fbf971
Merge branch 'main' into nmulepati/feat/479-skip-conditional-gen-impl…
nabinchha Apr 10, 2026
4f4c461
address cr feedback
nabinchha Apr 10, 2026
8931b45
Fix issue with full column generating messing up order of skipped rows
nabinchha Apr 10, 2026
4bc63b8
add skip conditional generation edge case tests
nabinchha Apr 13, 2026
d33a692
Merge branch 'main' into nmulepati/feat/479-skip-conditional-gen-impl…
nabinchha Apr 13, 2026
296aa62
Merge remote-tracking branch 'origin/main' into nmulepati/feat/479-sk…
nabinchha Apr 14, 2026
efbd5fe
fix: make expression jinja validator private
nabinchha Apr 14, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 22 additions & 2 deletions architecture/dataset-builders.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,11 @@ Preparation (`_prepare_async_run`):
### Execution Graph

`ExecutionGraph` (in `dataset_builders/utils/execution_graph.py`) models column dependencies:
- Upstream/downstream sets derived from `required_columns` and side-effect columns
- Upstream/downstream sets derived from `required_columns`, side-effect columns, and `skip.when` references
- `GenerationStrategy` per column (CELL_BY_CELL or FULL_COLUMN)
- Kahn topological sort for execution order
- `split_upstream_by_strategy` — separates batch-level from cell-level dependencies
- Skip metadata per column — `get_skip_config`, `should_propagate_skip`, `get_required_columns`, and `get_side_effect_columns` — queried at runtime by both engines to evaluate skip decisions

### CompletionTracker

Expand All @@ -53,9 +54,28 @@ Tracks per-row-group, per-column completion state:
- **Frontier**: computes ready tasks when backed by `ExecutionGraph`
- Handles dropped rows and downstream task enqueuing

### Conditional Generation (Skip)

Columns can be conditionally skipped per-row via `SkipConfig` (defined in `data_designer.config.base`). Two mechanisms control skipping:

1. **Expression gate** — `skip=SkipConfig(when="{{ expr }}")` on a `SingleColumnConfig`. The Jinja2 expression is evaluated per-row; when truthy, the column is skipped for that row and the configured `value` (default `None`) is written instead of calling the generator.
2. **Skip propagation** — when an upstream column was skipped, downstream columns auto-skip unless they set `propagate_skip=False`. Propagation checks `required_columns` against the row's `__internal_skipped_columns` set.

Skip evaluation is handled by two utility modules:

- **`skip_evaluator.py`** — `evaluate_skip_when` renders the expression in a `NativeSandboxedEnvironment` (native Python types, `StrictUndefined`). `should_skip_by_propagation` checks set intersection between required columns and skipped columns.
- **`skip_tracker.py`** — manages the `__internal_skipped_columns` metadata key on record dicts. Each record carries a `__internal_skipped_columns` set listing which columns were skipped for that row. `apply_skip_to_record` adds the column name to that set, writes the skip value into the cell, and clears any side-effect columns. `strip_skip_metadata_from_records` removes the `__internal_skipped_columns` key before DataFrame construction so it never reaches parquet (called by `DatasetBatchManager`, `RowGroupBufferManager`, and inline in both engines).

Both execution modes integrate skip at the same points:

- **Sequential**: `_run_full_column_generator` and the fan-out methods (`_fan_out_with_threads`, `_fan_out_with_async`) call `_should_skip_cell` per record. Skipped rows are excluded from the generator input, then merged back with skip metadata preserved. A fast `_column_can_skip` check short-circuits the per-record evaluation when no skip config or propagation applies.
- **Async**: `_run_cell` and `_run_batch` in `AsyncTaskScheduler` call `_should_skip_record` / `_apply_skip_to_record` with the same logic. Skipped cells report as skipped (not success) in progress tracking.

DAG edges are added for `skip.when` column references (both in `dag.py` and `ExecutionGraph.create`) so skip-gate columns are generated before the gated column.

### DAG (Config-Level)

`dataset_builders/utils/dag.py` provides `topologically_sort_column_configs` — builds a NetworkX graph from `required_columns` and side-effect columns, returns a topological ordering. Used by both execution modes for initial column ordering.
`dataset_builders/utils/dag.py` provides `topologically_sort_column_configs` — builds a NetworkX graph from `required_columns`, side-effect columns, and `skip.when` references, returns a topological ordering. Used by both execution modes for initial column ordering.

### DatasetBatchManager

Expand Down
66 changes: 33 additions & 33 deletions docs/colab_notebooks/1-the-basics.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@
"cells": [
{
"cell_type": "markdown",
"id": "527e4e8f",
"id": "9a82d43a",
"metadata": {},
"source": [
"<a href=\"https://colab.research.google.com/github/NVIDIA-NeMo/DataDesigner/blob/main/docs/colab_notebooks/1-the-basics.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"id": "e58e7a85",
"id": "4d30e1c7",
"metadata": {},
"source": [
"# 🎨 Data Designer Tutorial: The Basics\n",
Expand All @@ -22,7 +22,7 @@
},
{
"cell_type": "markdown",
"id": "199610f6",
"id": "f2a53a3c",
"metadata": {},
"source": [
"### 📦 Import Data Designer\n",
Expand All @@ -34,7 +34,7 @@
},
{
"cell_type": "markdown",
"id": "9b738a91",
"id": "c442284e",
"metadata": {},
"source": [
"### ⚡ Colab Setup\n",
Expand All @@ -45,7 +45,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "af2bf6f9",
"id": "dac0f01a",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -56,7 +56,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "0640f8fd",
"id": "5da0d2a0",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -74,7 +74,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "e84edbc3",
"id": "d20ed2cb",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -84,7 +84,7 @@
},
{
"cell_type": "markdown",
"id": "32d99706",
"id": "fdb97cba",
"metadata": {},
"source": [
"### ⚙️ Initialize the Data Designer interface\n",
Expand All @@ -97,7 +97,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "e998f3ae",
"id": "fc8f76cf",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -106,7 +106,7 @@
},
{
"cell_type": "markdown",
"id": "6afad397",
"id": "67bc996a",
"metadata": {},
"source": [
"### 🎛️ Define model configurations\n",
Expand All @@ -123,7 +123,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "14ab53fd",
"id": "70d5cddd",
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -153,7 +153,7 @@
},
{
"cell_type": "markdown",
"id": "a73fef78",
"id": "e8e02b51",
"metadata": {},
"source": [
"### 🏗️ Initialize the Data Designer Config Builder\n",
Expand All @@ -168,7 +168,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "5367d8c8",
"id": "2fdd1312",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -177,7 +177,7 @@
},
{
"cell_type": "markdown",
"id": "c9bec24b",
"id": "4056cfd3",
"metadata": {},
"source": [
"## 🎲 Getting started with sampler columns\n",
Expand All @@ -194,7 +194,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "b5551231",
"id": "85dd5c70",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -203,7 +203,7 @@
},
{
"cell_type": "markdown",
"id": "9076cef7",
"id": "541b2033",
"metadata": {},
"source": [
"Let's start designing our product review dataset by adding product category and subcategory columns.\n"
Expand All @@ -212,7 +212,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "5fdddb75",
"id": "356b361e",
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -293,7 +293,7 @@
},
{
"cell_type": "markdown",
"id": "282a37f8",
"id": "287cf22a",
"metadata": {},
"source": [
"Next, let's add samplers to generate data related to the customer and their review.\n"
Expand All @@ -302,7 +302,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "5d018574",
"id": "282d9074",
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -339,7 +339,7 @@
},
{
"cell_type": "markdown",
"id": "5d4dee03",
"id": "102c1634",
"metadata": {},
"source": [
"## 🦜 LLM-generated columns\n",
Expand All @@ -354,7 +354,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "f6e79f81",
"id": "20dc1332",
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -390,7 +390,7 @@
},
{
"cell_type": "markdown",
"id": "f562005f",
"id": "a983bf8a",
"metadata": {},
"source": [
"### 🔁 Iteration is key – preview the dataset!\n",
Expand All @@ -407,7 +407,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "70d761cd",
"id": "6b019b6e",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -417,7 +417,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "6b1c75a5",
"id": "82ab36be",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -428,7 +428,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "77d0530c",
"id": "a75256be",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -438,7 +438,7 @@
},
{
"cell_type": "markdown",
"id": "9c22fe3a",
"id": "d6d92058",
"metadata": {},
"source": [
"### 📊 Analyze the generated data\n",
Expand All @@ -451,7 +451,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "8619efbb",
"id": "63b19fbc",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -461,7 +461,7 @@
},
{
"cell_type": "markdown",
"id": "7d538cfd",
"id": "66073eb7",
"metadata": {},
"source": [
"### 🆙 Scale up!\n",
Expand All @@ -474,7 +474,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "af3702b8",
"id": "9270d1fc",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -484,7 +484,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "c862c183",
"id": "77b7dec0",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -497,7 +497,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "5228e949",
"id": "831eed73",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -509,14 +509,14 @@
},
{
"cell_type": "markdown",
"id": "1d434fd7",
"id": "54a22faf",
"metadata": {},
"source": [
"## ⏭️ Next Steps\n",
"\n",
"Now that you've seen the basics of Data Designer, check out the following notebooks to learn more about:\n",
"\n",
"- [Structured outputs and jinja expressions](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/2-structured-outputs-and-jinja-expressions/)\n",
"- [Structured outputs, jinja expressions, and conditional generation](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/2-structured-outputs-and-jinja-expressions/)\n",
"\n",
"- [Seeding synthetic data generation with an external dataset](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/3-seeding-with-a-dataset/)\n",
"\n",
Expand Down
Loading
Loading