Skip to content

Commit a9af365

Browse files
nabinchhaclaude
andauthored
feat: add skip.when conditional column generation (#502)
* plan: add skip_when for conditional column generation (#479) Adds implementation plan for a `skip_when` field on `SingleColumnConfig` that enables conditional column generation. When the Jinja2 expression evaluates truthy, the cell is set to None and the generator is skipped. Skips auto-propagate through the DAG to downstream columns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * plan: remove HopChain example from skip_when plan Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * plan: replace HopChain example with generic product review example Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * plan: add open questions on skip sentinel value and row filtering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * plan: major revision — SkipConfig model, sync engine support, decouple propagation - Introduce SkipConfig(when, value) as nested model on SingleColumnConfig - Move propagate_skip to SingleColumnConfig as independent field, fixing bug where columns with no SkipConfig couldn't participate in propagation - Add full sync engine implementation (Steps 4a-4d) covering both _fan_out_with_threads and _run_full_column_generator dispatch paths - Add serialization boundary stripping for both DatasetBatchManager (sync) and RowGroupBufferManager (async) - Simplify architecture diagrams for readability - Update all references, design decisions, verification plan Made-with: Cursor * updates * plan: document get_required_columns for skip propagation - Explain why propagation must not use get_upstream_columns() once skip.when adds DAG edges; add _required_columns and get_required_columns() to the execution graph plan - Point async _run_cell at get_required_columns for parity with sync - Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for DataFrames; tighten resolved-questions wording - Extend DAG/graph verification with gating_col regression case Refs #479 Made-with: Cursor * plan: centralize __skipped__ handling in skip_provenance - Document new skip_provenance.py (key constant, read/write/strip API) - Point sync builder, async scheduler, and batch buffers at shared helpers - Strip metadata before every DataFrame from buffer dicts, including FULL_COLUMN active subsets - Split §3 into skip_evaluator vs skip_provenance; extend verification Refs #479 Made-with: Cursor * plan: align doc title with SkipConfig / skip.when Drop legacy skip_when naming in headings and #362 cross-reference. Refs #479 Made-with: Cursor * plan: address review — delimiter validation, centralized error handling, caller-owns-deserialization - SkipConfig._validate_when_syntax now checks find_undeclared_variables is non-empty, rejecting expressions without {{ }} delimiters that would silently skip every row - evaluate_skip_when centralizes try/except so both sync and async engines get identical fail-safe behavior on eval errors - evaluate_skip_when takes a single pre-deserialized record; caller runs deserialize_json_values once and passes to both skip eval and generator (no double deserialization, no redundant parameter) - Update _should_skip_cell, async _run_cell, Files Modified table, and verification section accordingly Refs #479 Made-with: Cursor * plan: add get_side_effect_columns accessor to execution graph spec Document _side_effects_by_producer inverse map and get_side_effect_columns() accessor on ExecutionGraph, needed by _write_skip_to_record / apply_skip_to_record to clear __trace, __reasoning_content, etc. on skip. Added to both Step 2b metadata section and Files Modified table. The __skipped__ leak into active_df (greptile's other P1) was already fixed in 7046378 via strip_skip_metadata_from_records. Refs #479 Made-with: Cursor * add skip.when conditional column generation Introduce SkipConfig on SingleColumnConfig to gate column generation with a Jinja2 expression. Columns can be skipped by expression or by upstream propagation (propagate_skip flag). - SkipConfig: Pydantic model with config-time syntax/delimiter/variable validation and cached column extraction from the Jinja2 AST - skip_evaluator: runtime expression evaluation via NativeSandboxedEnvironment with fail-safe error handling (skip on expected failures) - skip_provenance: centralized __skipped__ record tracking shared by sync builder, async scheduler, and buffer managers - DAG/ExecutionGraph: skip.columns wired as dependency edges in both topological sort and static execution graph - Validation: validate_skip_references checks reference existence, sampler/seed scope, and allow_resize conflicts - Sync builder: cell-by-cell and full-column skip with merge-back - Async scheduler: cell and batch skip with live-buffer provenance Made-with: Cursor * fix review findings for skip.when implementation - Add skip evaluation to _fan_out_with_async (was missing, causing skipped rows to still be sent to the LLM) - Preserve __skipped__ provenance on non-skipped records after full-column generation so multi-hop propagation works - Use single live-buffer reference in _run_batch skip loop for consistency with _run_cell - Move Template import to TYPE_CHECKING and reorder import blocks - Replace O(n²) sum() with itertools.chain in dag.py - Add set_required_columns/set_propagate_skip/set_skip_config setters to ExecutionGraph for symmetry with existing API Made-with: Cursor * add conditional generation with skip recipe and refactor skip helpers Add a new recipe demonstrating skip.when patterns (expression gate, propagation, opt-out) with a customer support ticket pipeline. Also extract _should_skip_record in async_scheduler, remove the redundant propagate_skip param from should_skip_by_propagation, and pass a precomputed all_side_effects set through the DAG sort. Made-with: Cursor * updates * fixes * remove recipe > inject conditional gen into existing tutorial * regen colab notebooks * fix: handle missing execution graph in _column_can_skip Return False when the graph has not been initialized instead of raising, since skip logic cannot apply before generators are set up. Made-with: Cursor * parametrize some tests * public before private * slight refactor for readability * parametrize some tests * minor fixes * reanme internla skip tracker key name * clarify intent in comment * when skipped _run_cell should return skipped value even though the consumer doesn't currenlty care about it * remove inline import * minor refactor for clarity * fix: preserve skip metadata across replace_buffer and exclude allow_resize from skip branch Two bugs in the sequential engine's _run_full_column_generator: 1. replace_buffer(df.to_dict()) erased __internal_skipped_columns in three code paths (MultiColumnConfig, non-skip-aware, has_skipped=False fallthrough), breaking propagate_skip for downstream columns when an independent FULL_COLUMN generator ran between skip-setting and propagating columns. 2. _column_can_skip returned True for allow_resize=True columns via propagation, causing the skip-aware merge path to raise on the 1:1 row-count check for 1:N generators. - Add restore_skip_metadata helper to skip_tracker.py - Guard _column_can_skip against allow_resize=True columns - Refactor _run_full_column_generator into three focused methods - Remove dead allow_resize / _log_resize_if_changed from skip path - Remove redundant _require_graph() calls in skip helpers - Add single_column_config_by_name cached property - Add integration tests for both bugs and unit tests for the helper Made-with: Cursor * address review comments on skip.when PR (#502) - Extract shared skip decision logic (_should_skip_cell / _should_skip_record) into should_skip_column_for_record() in skip_evaluator.py so both sync and async engines call the same function (andreatgretel review comment) - Extend SkipConfig self-reference validation to cover side-effect columns (e.g. review__trace on the review column) — previously only checked self.name, now checks self.name | self.side_effect_columns - Add async engine integration tests for skip paths: cell-by-cell with propagation and full-column batch skip (exercises _run_cell / _run_batch) - Fix test_allow_resize_column_not_blocked_by_upstream_skip to use default propagate_skip=True so it actually exercises the allow_resize guard - Move get_skipped_column_names from skip_tracker to skip_evaluator (sole production consumer) Made-with: Cursor * address cr feedback * Fix issue with full column generating messing up order of skipped rows * add skip conditional generation edge case tests - test_skip_evaluator: parametrized should_skip_column_for_record covering propagation, expression gates, short-circuiting, and disabled propagation - test_execution_graph: skip metadata accessors (get_skip_config, should_propagate_skip, get_required_columns, get_side_effect_columns, resolve_side_effect, skip.when DAG edges) - test_dataset_builder: chained transitive propagation (4 levels), two independent skip gates, custom skip.value, row count preservation Made-with: Cursor * fix: make expression jinja validator private Rename assert_expression_valid_jinja to _assert_expression_valid_jinja to match the private naming convention used by other model validators. Made-with: Cursor --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent f267e19 commit a9af365

31 files changed

+2605
-245
lines changed

architecture/dataset-builders.md

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,10 +40,11 @@ Preparation (`_prepare_async_run`):
4040
### Execution Graph
4141

4242
`ExecutionGraph` (in `dataset_builders/utils/execution_graph.py`) models column dependencies:
43-
- Upstream/downstream sets derived from `required_columns` and side-effect columns
43+
- Upstream/downstream sets derived from `required_columns`, side-effect columns, and `skip.when` references
4444
- `GenerationStrategy` per column (CELL_BY_CELL or FULL_COLUMN)
4545
- Kahn topological sort for execution order
4646
- `split_upstream_by_strategy` — separates batch-level from cell-level dependencies
47+
- Skip metadata per column — `get_skip_config`, `should_propagate_skip`, `get_required_columns`, and `get_side_effect_columns` — queried at runtime by both engines to evaluate skip decisions
4748

4849
### CompletionTracker
4950

@@ -53,9 +54,28 @@ Tracks per-row-group, per-column completion state:
5354
- **Frontier**: computes ready tasks when backed by `ExecutionGraph`
5455
- Handles dropped rows and downstream task enqueuing
5556

57+
### Conditional Generation (Skip)
58+
59+
Columns can be conditionally skipped per-row via `SkipConfig` (defined in `data_designer.config.base`). Two mechanisms control skipping:
60+
61+
1. **Expression gate**`skip=SkipConfig(when="{{ expr }}")` on a `SingleColumnConfig`. The Jinja2 expression is evaluated per-row; when truthy, the column is skipped for that row and the configured `value` (default `None`) is written instead of calling the generator.
62+
2. **Skip propagation** — when an upstream column was skipped, downstream columns auto-skip unless they set `propagate_skip=False`. Propagation checks `required_columns` against the row's `__internal_skipped_columns` set.
63+
64+
Skip evaluation is handled by two utility modules:
65+
66+
- **`skip_evaluator.py`**`evaluate_skip_when` renders the expression in a `NativeSandboxedEnvironment` (native Python types, `StrictUndefined`). `should_skip_by_propagation` checks set intersection between required columns and skipped columns.
67+
- **`skip_tracker.py`** — manages the `__internal_skipped_columns` metadata key on record dicts. Each record carries a `__internal_skipped_columns` set listing which columns were skipped for that row. `apply_skip_to_record` adds the column name to that set, writes the skip value into the cell, and clears any side-effect columns. `strip_skip_metadata_from_records` removes the `__internal_skipped_columns` key before DataFrame construction so it never reaches parquet (called by `DatasetBatchManager`, `RowGroupBufferManager`, and inline in both engines).
68+
69+
Both execution modes integrate skip at the same points:
70+
71+
- **Sequential**: `_run_full_column_generator` and the fan-out methods (`_fan_out_with_threads`, `_fan_out_with_async`) call `_should_skip_cell` per record. Skipped rows are excluded from the generator input, then merged back with skip metadata preserved. A fast `_column_can_skip` check short-circuits the per-record evaluation when no skip config or propagation applies.
72+
- **Async**: `_run_cell` and `_run_batch` in `AsyncTaskScheduler` call `_should_skip_record` / `_apply_skip_to_record` with the same logic. Skipped cells report as skipped (not success) in progress tracking.
73+
74+
DAG edges are added for `skip.when` column references (both in `dag.py` and `ExecutionGraph.create`) so skip-gate columns are generated before the gated column.
75+
5676
### DAG (Config-Level)
5777

58-
`dataset_builders/utils/dag.py` provides `topologically_sort_column_configs` — builds a NetworkX graph from `required_columns` and side-effect columns, returns a topological ordering. Used by both execution modes for initial column ordering.
78+
`dataset_builders/utils/dag.py` provides `topologically_sort_column_configs` — builds a NetworkX graph from `required_columns`, side-effect columns, and `skip.when` references, returns a topological ordering. Used by both execution modes for initial column ordering.
5979

6080
### DatasetBatchManager
6181

docs/colab_notebooks/1-the-basics.ipynb

Lines changed: 33 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,15 @@
22
"cells": [
33
{
44
"cell_type": "markdown",
5-
"id": "527e4e8f",
5+
"id": "9a82d43a",
66
"metadata": {},
77
"source": [
88
"<a href=\"https://colab.research.google.com/github/NVIDIA-NeMo/DataDesigner/blob/main/docs/colab_notebooks/1-the-basics.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
99
]
1010
},
1111
{
1212
"cell_type": "markdown",
13-
"id": "e58e7a85",
13+
"id": "4d30e1c7",
1414
"metadata": {},
1515
"source": [
1616
"# 🎨 Data Designer Tutorial: The Basics\n",
@@ -22,7 +22,7 @@
2222
},
2323
{
2424
"cell_type": "markdown",
25-
"id": "199610f6",
25+
"id": "f2a53a3c",
2626
"metadata": {},
2727
"source": [
2828
"### 📦 Import Data Designer\n",
@@ -34,7 +34,7 @@
3434
},
3535
{
3636
"cell_type": "markdown",
37-
"id": "9b738a91",
37+
"id": "c442284e",
3838
"metadata": {},
3939
"source": [
4040
"### ⚡ Colab Setup\n",
@@ -45,7 +45,7 @@
4545
{
4646
"cell_type": "code",
4747
"execution_count": null,
48-
"id": "af2bf6f9",
48+
"id": "dac0f01a",
4949
"metadata": {},
5050
"outputs": [],
5151
"source": [
@@ -56,7 +56,7 @@
5656
{
5757
"cell_type": "code",
5858
"execution_count": null,
59-
"id": "0640f8fd",
59+
"id": "5da0d2a0",
6060
"metadata": {},
6161
"outputs": [],
6262
"source": [
@@ -74,7 +74,7 @@
7474
{
7575
"cell_type": "code",
7676
"execution_count": null,
77-
"id": "e84edbc3",
77+
"id": "d20ed2cb",
7878
"metadata": {},
7979
"outputs": [],
8080
"source": [
@@ -84,7 +84,7 @@
8484
},
8585
{
8686
"cell_type": "markdown",
87-
"id": "32d99706",
87+
"id": "fdb97cba",
8888
"metadata": {},
8989
"source": [
9090
"### ⚙️ Initialize the Data Designer interface\n",
@@ -97,7 +97,7 @@
9797
{
9898
"cell_type": "code",
9999
"execution_count": null,
100-
"id": "e998f3ae",
100+
"id": "fc8f76cf",
101101
"metadata": {},
102102
"outputs": [],
103103
"source": [
@@ -106,7 +106,7 @@
106106
},
107107
{
108108
"cell_type": "markdown",
109-
"id": "6afad397",
109+
"id": "67bc996a",
110110
"metadata": {},
111111
"source": [
112112
"### 🎛️ Define model configurations\n",
@@ -123,7 +123,7 @@
123123
{
124124
"cell_type": "code",
125125
"execution_count": null,
126-
"id": "14ab53fd",
126+
"id": "70d5cddd",
127127
"metadata": {},
128128
"outputs": [],
129129
"source": [
@@ -153,7 +153,7 @@
153153
},
154154
{
155155
"cell_type": "markdown",
156-
"id": "a73fef78",
156+
"id": "e8e02b51",
157157
"metadata": {},
158158
"source": [
159159
"### 🏗️ Initialize the Data Designer Config Builder\n",
@@ -168,7 +168,7 @@
168168
{
169169
"cell_type": "code",
170170
"execution_count": null,
171-
"id": "5367d8c8",
171+
"id": "2fdd1312",
172172
"metadata": {},
173173
"outputs": [],
174174
"source": [
@@ -177,7 +177,7 @@
177177
},
178178
{
179179
"cell_type": "markdown",
180-
"id": "c9bec24b",
180+
"id": "4056cfd3",
181181
"metadata": {},
182182
"source": [
183183
"## 🎲 Getting started with sampler columns\n",
@@ -194,7 +194,7 @@
194194
{
195195
"cell_type": "code",
196196
"execution_count": null,
197-
"id": "b5551231",
197+
"id": "85dd5c70",
198198
"metadata": {},
199199
"outputs": [],
200200
"source": [
@@ -203,7 +203,7 @@
203203
},
204204
{
205205
"cell_type": "markdown",
206-
"id": "9076cef7",
206+
"id": "541b2033",
207207
"metadata": {},
208208
"source": [
209209
"Let's start designing our product review dataset by adding product category and subcategory columns.\n"
@@ -212,7 +212,7 @@
212212
{
213213
"cell_type": "code",
214214
"execution_count": null,
215-
"id": "5fdddb75",
215+
"id": "356b361e",
216216
"metadata": {},
217217
"outputs": [],
218218
"source": [
@@ -293,7 +293,7 @@
293293
},
294294
{
295295
"cell_type": "markdown",
296-
"id": "282a37f8",
296+
"id": "287cf22a",
297297
"metadata": {},
298298
"source": [
299299
"Next, let's add samplers to generate data related to the customer and their review.\n"
@@ -302,7 +302,7 @@
302302
{
303303
"cell_type": "code",
304304
"execution_count": null,
305-
"id": "5d018574",
305+
"id": "282d9074",
306306
"metadata": {},
307307
"outputs": [],
308308
"source": [
@@ -339,7 +339,7 @@
339339
},
340340
{
341341
"cell_type": "markdown",
342-
"id": "5d4dee03",
342+
"id": "102c1634",
343343
"metadata": {},
344344
"source": [
345345
"## 🦜 LLM-generated columns\n",
@@ -354,7 +354,7 @@
354354
{
355355
"cell_type": "code",
356356
"execution_count": null,
357-
"id": "f6e79f81",
357+
"id": "20dc1332",
358358
"metadata": {},
359359
"outputs": [],
360360
"source": [
@@ -390,7 +390,7 @@
390390
},
391391
{
392392
"cell_type": "markdown",
393-
"id": "f562005f",
393+
"id": "a983bf8a",
394394
"metadata": {},
395395
"source": [
396396
"### 🔁 Iteration is key – preview the dataset!\n",
@@ -407,7 +407,7 @@
407407
{
408408
"cell_type": "code",
409409
"execution_count": null,
410-
"id": "70d761cd",
410+
"id": "6b019b6e",
411411
"metadata": {},
412412
"outputs": [],
413413
"source": [
@@ -417,7 +417,7 @@
417417
{
418418
"cell_type": "code",
419419
"execution_count": null,
420-
"id": "6b1c75a5",
420+
"id": "82ab36be",
421421
"metadata": {},
422422
"outputs": [],
423423
"source": [
@@ -428,7 +428,7 @@
428428
{
429429
"cell_type": "code",
430430
"execution_count": null,
431-
"id": "77d0530c",
431+
"id": "a75256be",
432432
"metadata": {},
433433
"outputs": [],
434434
"source": [
@@ -438,7 +438,7 @@
438438
},
439439
{
440440
"cell_type": "markdown",
441-
"id": "9c22fe3a",
441+
"id": "d6d92058",
442442
"metadata": {},
443443
"source": [
444444
"### 📊 Analyze the generated data\n",
@@ -451,7 +451,7 @@
451451
{
452452
"cell_type": "code",
453453
"execution_count": null,
454-
"id": "8619efbb",
454+
"id": "63b19fbc",
455455
"metadata": {},
456456
"outputs": [],
457457
"source": [
@@ -461,7 +461,7 @@
461461
},
462462
{
463463
"cell_type": "markdown",
464-
"id": "7d538cfd",
464+
"id": "66073eb7",
465465
"metadata": {},
466466
"source": [
467467
"### 🆙 Scale up!\n",
@@ -474,7 +474,7 @@
474474
{
475475
"cell_type": "code",
476476
"execution_count": null,
477-
"id": "af3702b8",
477+
"id": "9270d1fc",
478478
"metadata": {},
479479
"outputs": [],
480480
"source": [
@@ -484,7 +484,7 @@
484484
{
485485
"cell_type": "code",
486486
"execution_count": null,
487-
"id": "c862c183",
487+
"id": "77b7dec0",
488488
"metadata": {},
489489
"outputs": [],
490490
"source": [
@@ -497,7 +497,7 @@
497497
{
498498
"cell_type": "code",
499499
"execution_count": null,
500-
"id": "5228e949",
500+
"id": "831eed73",
501501
"metadata": {},
502502
"outputs": [],
503503
"source": [
@@ -509,14 +509,14 @@
509509
},
510510
{
511511
"cell_type": "markdown",
512-
"id": "1d434fd7",
512+
"id": "54a22faf",
513513
"metadata": {},
514514
"source": [
515515
"## ⏭️ Next Steps\n",
516516
"\n",
517517
"Now that you've seen the basics of Data Designer, check out the following notebooks to learn more about:\n",
518518
"\n",
519-
"- [Structured outputs and jinja expressions](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/2-structured-outputs-and-jinja-expressions/)\n",
519+
"- [Structured outputs, jinja expressions, and conditional generation](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/2-structured-outputs-and-jinja-expressions/)\n",
520520
"\n",
521521
"- [Seeding synthetic data generation with an external dataset](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/3-seeding-with-a-dataset/)\n",
522522
"\n",

0 commit comments

Comments
 (0)