docs: add skip column config option for conditional column generation (#479) by nabinchha · Pull Request #480 · NVIDIA-NeMo/DataDesigner

nabinchha · 2026-03-30T20:26:42Z

📋 Summary

Implementation plan for skip_when — a new field on SingleColumnConfig that enables conditional column generation. When the Jinja2 expression evaluates truthy for a row, the cell is set to None and the generator is skipped. Skips auto-propagate through the DAG to downstream columns.

Closes #479

🔄 Changes

✨ Added

plans/479/skip-when-conditional-generation.md — detailed implementation plan covering config, DAG, sync engine, async engine, validation, and test strategy

Plan Highlights

Config: skip_when field + Jinja2 validator + skip_when_columns property on SingleColumnConfig
DAG: skip_when referenced columns become edges ensuring correct execution order
Engine: Skip checks inserted before cell dispatch in both sync and async paths
Auto-propagation: Downstream columns whose required_columns include a skipped cell auto-skip
Open questions: Sentinel value (None vs dedicated type) and auto-removal of skipped rows

🤖 Generated with AI

Adds implementation plan for a `skip_when` field on `SingleColumnConfig` that enables conditional column generation. When the Jinja2 expression evaluates truthy, the cell is set to None and the generator is skipped. Skips auto-propagate through the DAG to downstream columns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-03-30T20:31:19Z

Greptile Summary

This PR adds a detailed implementation plan for skip_when / SkipConfig conditional column generation, covering config modeling, DAG ordering, sync/async engine integration, and test strategy. The plan has been significantly revised (commits 351bc29, 9790f95) and all previously raised concerns are addressed. One correctness gap remains: in _run_full_column_generator() (Step 4d) and its async counterpart (Step 5c), the merge-back replaces non-skipped buffer records wholesale with records derived from the generator's output DataFrame — which was built from strip_skip_metadata_from_records() and therefore carries no __skipped__ key, silently erasing upstream skip provenance for those rows and breaking propagate_skip=True propagation in any pipeline that mixes CELL_BY_CELL and FULL_COLUMN generators.

Confidence Score: 4/5

Plan is nearly ready to implement but has one P1 correctness bug in the FULL_COLUMN merge-back that would silently break transitive propagation in mixed-generator pipelines

All eight previously raised issues are addressed. One new P1 remains: the FULL_COLUMN merge-back in Step 4d and Step 5c replaces non-skipped buffer records with generator output records stripped of skipped, breaking downstream propagate_skip logic whenever a FULL_COLUMN generator is interleaved between a skipped column and its propagation targets.

plans/479/skip-when-conditional-generation.md — specifically Step 4d (_run_full_column_generator merge-back, lines 486-494) and Step 5c (async _run_batch merge-back)

Important Files Changed

Filename	Overview
plans/479/skip-when-conditional-generation.md	Implementation plan for skip_when conditional generation; mostly sound after prior revisions, but one P1 correctness bug in the FULL_COLUMN merge-back silently erases upstream skipped provenance

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Column A (CELL_BY_CELL)\nskipped for row 3"] -->|"record.__skipped__ = {'A'}"| B
    B["Column B (FULL_COLUMN)\nnot skipped for row 3"] --> C["active_records built\n(row 3 included)"]
    C --> D["strip_skip_metadata_from_records\nrow 3 loses __skipped__"]
    D --> E["generator.generate(active_df)"]
    E --> F["result_df.to_dict()\nno __skipped__ key"]
    F --> G["merged.append(next(result_iter))\n❌ __skipped__ LOST for row 3"]
    G --> H["Column C\npropagate_skip=True, required_columns=['A']"]
    H --> I["get_skipped_column_names = set()\npropagation check: False"]
    I --> J["❌ Column C generates for row 3\nshould have been auto-skipped"]

Prompt To Fix All With AI

This is a comment left during a code review.
Path: plans/479/skip-when-conditional-generation.md
Line: 486-494

Comment:
**`__skipped__` provenance erased for non-skipped rows after FULL_COLUMN generator**

In the merge-back loop, non-skipped rows are replaced wholesale with records from `result_df.to_dict(orient="records")`. Those result records were derived from `active_df`, which was built via `strip_skip_metadata_from_records(active_records)` — so they carry no `__skipped__` key. Any upstream skip provenance (e.g., column A was skipped for row 3) that existed on the original buffer record is silently dropped.

Concrete failure: column A is CELL_BY_CELL-skipped for row 3 → `record.__skipped__ = {'column_A'}`. Column B is a FULL_COLUMN generator that does **not** skip row 3. After the merge-back, row 3's record is replaced with `next(result_iter)`, which has no `__skipped__`. Column C runs next with `propagate_skip=True` and `required_columns=['column_A']`; `get_skipped_column_names(record)` returns `set()`, so propagation does not fire — C generates for row 3 even though it should have been auto-skipped.

Fix: carry the existing `__skipped__` set forward into each merged result record:

```python
for i, record in enumerate(batch):
    if i in skip_indices:
        merged.append(record)
    else:
        result_record = next(result_iter)
        if SKIPPED_COLUMNS_RECORD_KEY in record:
            result_record[SKIPPED_COLUMNS_RECORD_KEY] = record[SKIPPED_COLUMNS_RECORD_KEY]
        merged.append(result_record)
```

The same fix is needed in the async `_run_batch()` merge-back (Step 5c).

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (6): Last reviewed commit: "plan: add get_side_effect_columns access..." | Re-trigger Greptile}

plans/479/skip-when-conditional-generation.md

andreatgretel · 2026-03-31T16:30:41Z

plans/479/skip-when-conditional-generation.md:172-195 (async engine sections)

worth noting the overlap with #362 (keep failed-parse fields as null). Both features need a "cell became unavailable, propagate through the DAG" mechanism - skip_when is intentional/config-driven, #362 is runtime failure. Might be worth designing one shared propagation path (especially on the async side where CompletionTracker already has dependency-aware state) instead of building two separate cascades. Not blocking for the plan, just something to keep in mind before the implementation PRs land.

johnnygreco

looks great! only question to sort out is the skip value

…e propagation - Introduce SkipConfig(when, value) as nested model on SingleColumnConfig - Move propagate_skip to SingleColumnConfig as independent field, fixing bug where columns with no SkipConfig couldn't participate in propagation - Add full sync engine implementation (Steps 4a-4d) covering both _fan_out_with_threads and _run_full_column_generator dispatch paths - Add serialization boundary stripping for both DatasetBatchManager (sync) and RowGroupBufferManager (async) - Simplify architecture diagrams for readability - Update all references, design decisions, verification plan Made-with: Cursor

plans/479/skip-when-conditional-generation.md

- Explain why propagation must not use get_upstream_columns() once skip.when adds DAG edges; add _required_columns and get_required_columns() to the execution graph plan - Point async _run_cell at get_required_columns for parity with sync - Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for DataFrames; tighten resolved-questions wording - Extend DAG/graph verification with gating_col regression case Refs #479 Made-with: Cursor

andreatgretel · 2026-04-06T16:37:43Z

plans/479/skip-when-conditional-generation.md

+    self, column_name: str, record: dict
+) -> bool:
+    skipped_cols: set[str] = record.get("__skipped__", set())
+


_should_skip_cell() calls self._graph.get_required_columns(column_name) but the Files Modified table only adds get_skip_config() and should_propagate_skip() to ExecutionGraph - I don't think this method exists yet. get_upstream_columns() wouldn't work as a stand-in since after Step 2b it includes skip.columns edges too, and propagation should only consider template dependencies. Probably needs a _required_columns dict + accessor on the graph.

This was already addressed in 9790f95 (Step 2b lines 258–264 and Files Modified table line 591 both document _required_columns + get_required_columns()). Confirmed still present in c5d2dbc.

andreatgretel · 2026-04-06T16:37:45Z

plans/479/skip-when-conditional-generation.md

+    def _validate_when_syntax(cls, v: str) -> str:
+        from jinja2.sandbox import ImmutableSandboxedEnvironment
+        ImmutableSandboxedEnvironment().parse(v)
+        return v


env.parse("in_stock == 0") (no {{ }}) is valid Jinja2 - it's just literal text. columns would come back empty, no DAG edges get created, and the runtime renders the truthy string "in_stock == 0", silently skipping every row. This feels like the most likely user mistake. Worth checking that find_undeclared_variables() is non-empty or at least requiring {{ to be present.

Good catch — addressed in c5d2dbc. _validate_when_syntax now checks find_undeclared_variables(ast) is non-empty and raises ValueError with a hint about {{ }} delimiters if the expression references no columns.

andreatgretel · 2026-04-06T16:37:48Z

plans/479/skip-when-conditional-generation.md

+   - `record.setdefault("__skipped__", set()).add(task.column)` — records skip provenance.
+   - `record[task.column] = skip_config.value if skip_config else None` — the **column key must be present** in the record dict, not absent. Downstream `skip.when` expressions and Jinja2 templates may reference skipped columns (e.g., `{{ col is none }}`); an absent key would cause `UndefinedError`. Propagation-only skips (no `SkipConfig`) use `None`.
+   - `record[se_col] = None` for each side-effect column — side-effect columns (`__trace`, `__reasoning_content`, etc.) are always written as `None`, regardless of the parent column's `skip.value`. A skipped cell has no trace or reasoning.
+   - Return `None`.


this gives async a clear policy for eval errors (log + skip), but the sync path in Steps 4a-4c would let the exception bubble up and crash the column. Same typo in skip.when would silently skip rows in async but abort in sync. Probably worth moving the try/except into evaluate_skip_when() itself so both engines get the same behavior.

Agreed — addressed in c5d2dbc. The try/except is now inside evaluate_skip_when() itself (Step 3a), so both sync and async callers get identical fail-safe behavior. Removed the per-engine error handling from Step 5b.

andreatgretel · 2026-04-06T16:37:49Z

plans/479/skip-when-conditional-generation.md

+
+@lru_cache(maxsize=64)
+def _compile_skip_template(expression: str) -> Template:
+    return _env.from_string(expression)


nit: evaluate_skip_when calls deserialize_json_values(record) per row, and the generator does it again when it runs. For records with a lot of nested JSON that could add up. Worth deserializing once at the dispatch layer and passing the result to both.

Addressed in c5d2dbc. Went a step further than the optional kwarg — evaluate_skip_when now takes a single record that the caller must have already deserialized. The dispatch layer (_fan_out_with_threads, _run_cell) runs deserialize_json_values once and passes the result to both skip eval and the generator. No redundant parameter, no double work.

greptile-apps · 2026-04-06T16:40:40Z

plans/479/skip-when-conditional-generation.md

+        return evaluate_skip_when(skip_config.when, record)
+
+    return False
+```
+
+#### 4b. Helper: `_write_skip_to_record()`
+
+When a cell is skipped, write provenance and the skip value into the record in-place. The skip value comes from `SkipConfig.value` if the column has one, otherwise `None` (propagation-only skips always use `None`):
+
+```python
+def _write_skip_to_record(
+    self, column_name: str, record: dict
+) -> None:
+    skip_config = self._graph.get_skip_config(column_name)
+    skip_value = skip_config.value if skip_config is not None else None


get_side_effect_columns() not defined — runtime AttributeError

_write_skip_to_record() calls self._graph.get_side_effect_columns(column_name) (line 366), but this method does not exist on ExecutionGraph. The existing graph only has set_side_effect() (maps a side-effect column to its producer) and resolve_side_effect() (reverse lookup) — neither iterates side-effect columns for a given producer.

The Files Modified section for execution_graph.py lists get_skip_config(), should_propagate_skip(), and (from the resolved Q6 thread) get_required_columns() as new accessors, but omits get_side_effect_columns(). Without it, every call to _write_skip_to_record() — in both sync and async engines — will raise AttributeError at runtime.

The implementation will need an inverse map and a new accessor, e.g.:

# In ExecutionGraph.__init__: self._side_effects_by_producer: dict[str, list[str]] = {} # In set_side_effect(): self._side_effects_by_producer.setdefault(producer, []).append(side_effect_col) # New accessor: def get_side_effect_columns(self, column: str) -> list[str]: return list(self._side_effects_by_producer.get(column, []))

This accessor should also be added to the Files Modified table.

Prompt To Fix With AI

This is a comment left during a code review. Path: plans/479/skip-when-conditional-generation.md Line: 354-368 Comment: **`get_side_effect_columns()` not defined — runtime `AttributeError`** `_write_skip_to_record()` calls `self._graph.get_side_effect_columns(column_name)` (line 366), but this method does not exist on `ExecutionGraph`. The existing graph only has `set_side_effect()` (maps a side-effect column to its producer) and `resolve_side_effect()` (reverse lookup) — neither iterates side-effect columns for a given producer. The Files Modified section for `execution_graph.py` lists `get_skip_config()`, `should_propagate_skip()`, and (from the resolved Q6 thread) `get_required_columns()` as new accessors, but **omits** `get_side_effect_columns()`. Without it, every call to `_write_skip_to_record()` — in both sync and async engines — will raise `AttributeError` at runtime. The implementation will need an inverse map and a new accessor, e.g.: ```python # In ExecutionGraph.__init__: self._side_effects_by_producer: dict[str, list[str]] = {} # In set_side_effect(): self._side_effects_by_producer.setdefault(producer, []).append(side_effect_col) # New accessor: def get_side_effect_columns(self, column: str) -> list[str]: return list(self._side_effects_by_producer.get(column, [])) ``` This accessor should also be added to the Files Modified table. How can I resolve this? If you propose a fix, please make it concise.

Good catch — addressed in 75943d5. Added _side_effects_by_producer: dict[str, list[str]] (inverse map populated inside set_side_effect()) and get_side_effect_columns(column) -> list[str] accessor to both the Step 2b metadata section and the Files Modified table for execution_graph.py.

greptile-apps · 2026-04-06T16:40:41Z

plans/479/skip-when-conditional-generation.md

+        if self._should_skip_cell(column_name, record):
+            self._write_skip_to_record(column_name, record)
+            self.batch_manager.update_record(i, record)
+            skip_indices.add(i)
+
+    # Build DataFrame excluding skipped rows
+    batch = self.batch_manager.get_current_batch(as_dataframe=False)
+    active_records = [r for i, r in enumerate(batch) if i not in skip_indices]


__skipped__ key leaks into active_df passed to FULL_COLUMN generators

In Step 4d, active_records comes from get_current_batch(as_dataframe=False), which returns the raw buffer list — including __skipped__ keys (Python set objects) that previous columns have written. The DataFrame is then built directly:

active_df = lazy.pd.DataFrame(active_records) result_df = generator.generate(active_df)

The serialization boundary in Step 8 strips __skipped__ only in get_current_batch(as_dataframe=True) and write(), not in this ad-hoc pd.DataFrame() call. Generators therefore receive an unexpected column containing set objects, which can trigger pandas dtype-inference errors or silently expose the internal sentinel to plugin generators.

Fix by stripping __skipped__ when building active_df:

active_df = lazy.pd.DataFrame( [{k: v for k, v in r.items() if k != "__skipped__"} for r in active_records] )

The same strip is needed in Step 5c (_run_batch) when constructing the filtered batch_df for generator.agenerate().

Prompt To Fix With AI

This is a comment left during a code review. Path: plans/479/skip-when-conditional-generation.md Line: 407-414 Comment: **`__skipped__` key leaks into `active_df` passed to FULL_COLUMN generators** In Step 4d, `active_records` comes from `get_current_batch(as_dataframe=False)`, which returns the raw buffer list — including `__skipped__` keys (Python `set` objects) that previous columns have written. The DataFrame is then built directly: ```python active_df = lazy.pd.DataFrame(active_records) result_df = generator.generate(active_df) ``` The serialization boundary in Step 8 strips `__skipped__` only in `get_current_batch(as_dataframe=True)` and `write()`, not in this ad-hoc `pd.DataFrame()` call. Generators therefore receive an unexpected column containing `set` objects, which can trigger `pandas` dtype-inference errors or silently expose the internal sentinel to plugin generators. Fix by stripping `__skipped__` when building `active_df`: ```python active_df = lazy.pd.DataFrame( [{k: v for k, v in r.items() if k != "__skipped__"} for r in active_records] ) ``` The same strip is needed in Step 5c (`_run_batch`) when constructing the filtered `batch_df` for `generator.agenerate()`. How can I resolve this? If you propose a fix, please make it concise.

This was already addressed in 7046378. Step 4d builds active_df via strip_skip_metadata_from_records(active_records) (line 478-479), and Step 5c / the usage contract in §3b require the same for the async path. The inline {k: v ...} pattern you suggested is intentionally avoided — all stripping goes through skip_provenance.py helpers.

- Document new skip_provenance.py (key constant, read/write/strip API) - Point sync builder, async scheduler, and batch buffers at shared helpers - Strip metadata before every DataFrame from buffer dicts, including FULL_COLUMN active subsets - Split §3 into skip_evaluator vs skip_provenance; extend verification Refs #479 Made-with: Cursor

Drop legacy skip_when naming in headings and #362 cross-reference. Refs #479 Made-with: Cursor

…ng, caller-owns-deserialization - SkipConfig._validate_when_syntax now checks find_undeclared_variables is non-empty, rejecting expressions without {{ }} delimiters that would silently skip every row - evaluate_skip_when centralizes try/except so both sync and async engines get identical fail-safe behavior on eval errors - evaluate_skip_when takes a single pre-deserialized record; caller runs deserialize_json_values once and passes to both skip eval and generator (no double deserialization, no redundant parameter) - Update _should_skip_cell, async _run_cell, Files Modified table, and verification section accordingly Refs #479 Made-with: Cursor

Document _side_effects_by_producer inverse map and get_side_effect_columns() accessor on ExecutionGraph, needed by _write_skip_to_record / apply_skip_to_record to clear __trace, __reasoning_content, etc. on skip. Added to both Step 2b metadata section and Files Modified table. The __skipped__ leak into active_df (greptile's other P1) was already fixed in 7046378 via strip_skip_metadata_from_records. Refs #479 Made-with: Cursor

johnnygreco

love this one @nabinchha !!

andy9trider69-bit

Help me

nabinchha and others added 4 commits March 30, 2026 14:19

plan: remove HopChain example from skip_when plan

3f415c7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

plan: replace HopChain example with generic product review example

6ee4c70

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

plan: add open questions on skip sentinel value and row filtering

14ab39b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

nabinchha requested a review from a team as a code owner March 30, 2026 20:26

nabinchha changed the title ~~plan: add skip_when for conditional column generation (#479)~~ docs: add skip_when for conditional column generation (#479) Mar 30, 2026

greptile-apps bot reviewed Mar 30, 2026

View reviewed changes

nabinchha added the plan Agent-assisted development plan label Mar 31, 2026

johnnygreco reviewed Mar 31, 2026

View reviewed changes

plans/479/skip-when-conditional-generation.md Outdated Show resolved Hide resolved

johnnygreco reviewed Mar 31, 2026

View reviewed changes

plans/479/skip-when-conditional-generation.md Outdated Show resolved Hide resolved

johnnygreco reviewed Mar 31, 2026

View reviewed changes

plans/479/skip-when-conditional-generation.md Show resolved Hide resolved

johnnygreco reviewed Mar 31, 2026

View reviewed changes

plans/479/skip-when-conditional-generation.md Outdated Show resolved Hide resolved

andreatgretel reviewed Mar 31, 2026

View reviewed changes