NVIDIA-NeMo
diff --git a/‎architecture/dataset-builders.md‎
Lines changed: 22 additions & 2 deletions b/‎architecture/dataset-builders.md‎
Lines changed: 22 additions & 2 deletions
diff --git a/‎docs/colab_notebooks/1-the-basics.ipynb‎
Lines changed: 33 additions & 33 deletions b/‎docs/colab_notebooks/1-the-basics.ipynb‎
Lines changed: 33 additions & 33 deletions
@@ -40,10 +40,11 @@ Preparation (`_prepare_async_run`):
 ### Execution Graph
 
 `ExecutionGraph` (in `dataset_builders/utils/execution_graph.py`) models column dependencies:
-- Upstream/downstream sets derived from `required_columns` and side-effect columns
+- Upstream/downstream sets derived from `required_columns`, side-effect columns, and `skip.when` references
 - `GenerationStrategy` per column (CELL_BY_CELL or FULL_COLUMN)
 - Kahn topological sort for execution order
 - `split_upstream_by_strategy` — separates batch-level from cell-level dependencies
+- Skip metadata per column — `get_skip_config`, `should_propagate_skip`, `get_required_columns`, and `get_side_effect_columns` — queried at runtime by both engines to evaluate skip decisions
 
 ### CompletionTracker
 
@@ -53,9 +54,28 @@ Tracks per-row-group, per-column completion state:
 - **Frontier**: computes ready tasks when backed by `ExecutionGraph`
 - Handles dropped rows and downstream task enqueuing
 
+### Conditional Generation (Skip)
+
+Columns can be conditionally skipped per-row via `SkipConfig` (defined in `data_designer.config.base`). Two mechanisms control skipping:
+
+1. **Expression gate** — `skip=SkipConfig(when="{{ expr }}")` on a `SingleColumnConfig`. The Jinja2 expression is evaluated per-row; when truthy, the column is skipped for that row and the configured `value` (default `None`) is written instead of calling the generator.
+2. **Skip propagation** — when an upstream column was skipped, downstream columns auto-skip unless they set `propagate_skip=False`. Propagation checks `required_columns` against the row's `__internal_skipped_columns` set.
+
+Skip evaluation is handled by two utility modules:
+
+- **`skip_evaluator.py`** — `evaluate_skip_when` renders the expression in a `NativeSandboxedEnvironment` (native Python types, `StrictUndefined`). `should_skip_by_propagation` checks set intersection between required columns and skipped columns.
+- **`skip_tracker.py`** — manages the `__internal_skipped_columns` metadata key on record dicts. Each record carries a `__internal_skipped_columns` set listing which columns were skipped for that row. `apply_skip_to_record` adds the column name to that set, writes the skip value into the cell, and clears any side-effect columns. `strip_skip_metadata_from_records` removes the `__internal_skipped_columns` key before DataFrame construction so it never reaches parquet (called by `DatasetBatchManager`, `RowGroupBufferManager`, and inline in both engines).
+
+Both execution modes integrate skip at the same points:
+
+- **Sequential**: `_run_full_column_generator` and the fan-out methods (`_fan_out_with_threads`, `_fan_out_with_async`) call `_should_skip_cell` per record. Skipped rows are excluded from the generator input, then merged back with skip metadata preserved. A fast `_column_can_skip` check short-circuits the per-record evaluation when no skip config or propagation applies.
+- **Async**: `_run_cell` and `_run_batch` in `AsyncTaskScheduler` call `_should_skip_record` / `_apply_skip_to_record` with the same logic. Skipped cells report as skipped (not success) in progress tracking.
+
+DAG edges are added for `skip.when` column references (both in `dag.py` and `ExecutionGraph.create`) so skip-gate columns are generated before the gated column.
+
 ### DAG (Config-Level)
 
-`dataset_builders/utils/dag.py` provides `topologically_sort_column_configs` — builds a NetworkX graph from `required_columns` and side-effect columns, returns a topological ordering. Used by both execution modes for initial column ordering.
+`dataset_builders/utils/dag.py` provides `topologically_sort_column_configs` — builds a NetworkX graph from `required_columns`, side-effect columns, and `skip.when` references, returns a topological ordering. Used by both execution modes for initial column ordering.
 
 ### DatasetBatchManager
 
 
@@ -2,15 +2,15 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "527e4e8f",
+   "id": "9a82d43a",
    "metadata": {},
    "source": [
     "<a href=\"https://colab.research.google.com/github/NVIDIA-NeMo/DataDesigner/blob/main/docs/colab_notebooks/1-the-basics.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "e58e7a85",
+   "id": "4d30e1c7",
    "metadata": {},
    "source": [
     "# 🎨 Data Designer Tutorial: The Basics\n",
@@ -22,7 +22,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "199610f6",
+   "id": "f2a53a3c",
    "metadata": {},
    "source": [
     "### 📦 Import Data Designer\n",
@@ -34,7 +34,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "9b738a91",
+   "id": "c442284e",
    "metadata": {},
    "source": [
     "### ⚡ Colab Setup\n",
@@ -45,7 +45,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "af2bf6f9",
+   "id": "dac0f01a",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -56,7 +56,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "0640f8fd",
+   "id": "5da0d2a0",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -74,7 +74,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "e84edbc3",
+   "id": "d20ed2cb",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -84,7 +84,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "32d99706",
+   "id": "fdb97cba",
    "metadata": {},
    "source": [
     "### ⚙️ Initialize the Data Designer interface\n",
@@ -97,7 +97,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "e998f3ae",
+   "id": "fc8f76cf",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -106,7 +106,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "6afad397",
+   "id": "67bc996a",
    "metadata": {},
    "source": [
     "### 🎛️ Define model configurations\n",
@@ -123,7 +123,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "14ab53fd",
+   "id": "70d5cddd",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -153,7 +153,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "a73fef78",
+   "id": "e8e02b51",
    "metadata": {},
    "source": [
     "### 🏗️ Initialize the Data Designer Config Builder\n",
@@ -168,7 +168,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "5367d8c8",
+   "id": "2fdd1312",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -177,7 +177,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "c9bec24b",
+   "id": "4056cfd3",
    "metadata": {},
    "source": [
     "## 🎲 Getting started with sampler columns\n",
@@ -194,7 +194,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "b5551231",
+   "id": "85dd5c70",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -203,7 +203,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "9076cef7",
+   "id": "541b2033",
    "metadata": {},
    "source": [
     "Let's start designing our product review dataset by adding product category and subcategory columns.\n"
@@ -212,7 +212,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "5fdddb75",
+   "id": "356b361e",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -293,7 +293,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "282a37f8",
+   "id": "287cf22a",
    "metadata": {},
    "source": [
     "Next, let's add samplers to generate data related to the customer and their review.\n"
@@ -302,7 +302,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "5d018574",
+   "id": "282d9074",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -339,7 +339,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "5d4dee03",
+   "id": "102c1634",
    "metadata": {},
    "source": [
     "## 🦜 LLM-generated columns\n",
@@ -354,7 +354,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "f6e79f81",
+   "id": "20dc1332",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -390,7 +390,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "f562005f",
+   "id": "a983bf8a",
    "metadata": {},
    "source": [
     "### 🔁 Iteration is key – preview the dataset!\n",
@@ -407,7 +407,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "70d761cd",
+   "id": "6b019b6e",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -417,7 +417,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "6b1c75a5",
+   "id": "82ab36be",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -428,7 +428,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "77d0530c",
+   "id": "a75256be",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -438,7 +438,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "9c22fe3a",
+   "id": "d6d92058",
    "metadata": {},
    "source": [
     "### 📊 Analyze the generated data\n",
@@ -451,7 +451,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "8619efbb",
+   "id": "63b19fbc",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -461,7 +461,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "7d538cfd",
+   "id": "66073eb7",
    "metadata": {},
    "source": [
     "### 🆙 Scale up!\n",
@@ -474,7 +474,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "af3702b8",
+   "id": "9270d1fc",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -484,7 +484,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "c862c183",
+   "id": "77b7dec0",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -497,7 +497,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "5228e949",
+   "id": "831eed73",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -509,14 +509,14 @@
   },
   {
    "cell_type": "markdown",
-   "id": "1d434fd7",
+   "id": "54a22faf",
    "metadata": {},
    "source": [
     "## ⏭️ Next Steps\n",
     "\n",
     "Now that you've seen the basics of Data Designer, check out the following notebooks to learn more about:\n",
     "\n",
-    "- [Structured outputs and jinja expressions](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/2-structured-outputs-and-jinja-expressions/)\n",
+    "- [Structured outputs, jinja expressions, and conditional generation](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/2-structured-outputs-and-jinja-expressions/)\n",
     "\n",
     "- [Seeding synthetic data generation with an external dataset](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/3-seeding-with-a-dataset/)\n",
     "\n",