docs: add skip column config option for conditional column generation (#479)#480
docs: add skip column config option for conditional column generation (#479)#480
Conversation
Adds implementation plan for a `skip_when` field on `SingleColumnConfig` that enables conditional column generation. When the Jinja2 expression evaluates truthy, the cell is set to None and the generator is skipped. Skips auto-propagate through the DAG to downstream columns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR adds a detailed implementation plan for
|
| Filename | Overview |
|---|---|
| plans/479/skip-when-conditional-generation.md | Implementation plan for skip_when conditional generation; mostly sound after prior revisions, but one P1 correctness bug in the FULL_COLUMN merge-back silently erases upstream skipped provenance |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["Column A (CELL_BY_CELL)\nskipped for row 3"] -->|"record.__skipped__ = {'A'}"| B
B["Column B (FULL_COLUMN)\nnot skipped for row 3"] --> C["active_records built\n(row 3 included)"]
C --> D["strip_skip_metadata_from_records\nrow 3 loses __skipped__"]
D --> E["generator.generate(active_df)"]
E --> F["result_df.to_dict()\nno __skipped__ key"]
F --> G["merged.append(next(result_iter))\n❌ __skipped__ LOST for row 3"]
G --> H["Column C\npropagate_skip=True, required_columns=['A']"]
H --> I["get_skipped_column_names = set()\npropagation check: False"]
I --> J["❌ Column C generates for row 3\nshould have been auto-skipped"]
Prompt To Fix All With AI
This is a comment left during a code review.
Path: plans/479/skip-when-conditional-generation.md
Line: 486-494
Comment:
**`__skipped__` provenance erased for non-skipped rows after FULL_COLUMN generator**
In the merge-back loop, non-skipped rows are replaced wholesale with records from `result_df.to_dict(orient="records")`. Those result records were derived from `active_df`, which was built via `strip_skip_metadata_from_records(active_records)` — so they carry no `__skipped__` key. Any upstream skip provenance (e.g., column A was skipped for row 3) that existed on the original buffer record is silently dropped.
Concrete failure: column A is CELL_BY_CELL-skipped for row 3 → `record.__skipped__ = {'column_A'}`. Column B is a FULL_COLUMN generator that does **not** skip row 3. After the merge-back, row 3's record is replaced with `next(result_iter)`, which has no `__skipped__`. Column C runs next with `propagate_skip=True` and `required_columns=['column_A']`; `get_skipped_column_names(record)` returns `set()`, so propagation does not fire — C generates for row 3 even though it should have been auto-skipped.
Fix: carry the existing `__skipped__` set forward into each merged result record:
```python
for i, record in enumerate(batch):
if i in skip_indices:
merged.append(record)
else:
result_record = next(result_iter)
if SKIPPED_COLUMNS_RECORD_KEY in record:
result_record[SKIPPED_COLUMNS_RECORD_KEY] = record[SKIPPED_COLUMNS_RECORD_KEY]
merged.append(result_record)
```
The same fix is needed in the async `_run_batch()` merge-back (Step 5c).
How can I resolve this? If you propose a fix, please make it concise.Reviews (6): Last reviewed commit: "plan: add get_side_effect_columns access..." | Re-trigger Greptile
worth noting the overlap with #362 (keep failed-parse fields as null). Both features need a "cell became unavailable, propagate through the DAG" mechanism - |
johnnygreco
left a comment
There was a problem hiding this comment.
looks great! only question to sort out is the skip value
…e propagation - Introduce SkipConfig(when, value) as nested model on SingleColumnConfig - Move propagate_skip to SingleColumnConfig as independent field, fixing bug where columns with no SkipConfig couldn't participate in propagation - Add full sync engine implementation (Steps 4a-4d) covering both _fan_out_with_threads and _run_full_column_generator dispatch paths - Add serialization boundary stripping for both DatasetBatchManager (sync) and RowGroupBufferManager (async) - Simplify architecture diagrams for readability - Update all references, design decisions, verification plan Made-with: Cursor
- Explain why propagation must not use get_upstream_columns() once skip.when adds DAG edges; add _required_columns and get_required_columns() to the execution graph plan - Point async _run_cell at get_required_columns for parity with sync - Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for DataFrames; tighten resolved-questions wording - Extend DAG/graph verification with gating_col regression case Refs #479 Made-with: Cursor
| self, column_name: str, record: dict | ||
| ) -> bool: | ||
| skipped_cols: set[str] = record.get("__skipped__", set()) | ||
|
|
There was a problem hiding this comment.
_should_skip_cell() calls self._graph.get_required_columns(column_name) but the Files Modified table only adds get_skip_config() and should_propagate_skip() to ExecutionGraph - I don't think this method exists yet. get_upstream_columns() wouldn't work as a stand-in since after Step 2b it includes skip.columns edges too, and propagation should only consider template dependencies. Probably needs a _required_columns dict + accessor on the graph.
| def _validate_when_syntax(cls, v: str) -> str: | ||
| from jinja2.sandbox import ImmutableSandboxedEnvironment | ||
| ImmutableSandboxedEnvironment().parse(v) | ||
| return v |
There was a problem hiding this comment.
env.parse("in_stock == 0") (no {{ }}) is valid Jinja2 - it's just literal text. columns would come back empty, no DAG edges get created, and the runtime renders the truthy string "in_stock == 0", silently skipping every row. This feels like the most likely user mistake. Worth checking that find_undeclared_variables() is non-empty or at least requiring {{ to be present.
There was a problem hiding this comment.
Good catch — addressed in c5d2dbc. _validate_when_syntax now checks find_undeclared_variables(ast) is non-empty and raises ValueError with a hint about {{ }} delimiters if the expression references no columns.
| - `record.setdefault("__skipped__", set()).add(task.column)` — records skip provenance. | ||
| - `record[task.column] = skip_config.value if skip_config else None` — the **column key must be present** in the record dict, not absent. Downstream `skip.when` expressions and Jinja2 templates may reference skipped columns (e.g., `{{ col is none }}`); an absent key would cause `UndefinedError`. Propagation-only skips (no `SkipConfig`) use `None`. | ||
| - `record[se_col] = None` for each side-effect column — side-effect columns (`__trace`, `__reasoning_content`, etc.) are always written as `None`, regardless of the parent column's `skip.value`. A skipped cell has no trace or reasoning. | ||
| - Return `None`. |
There was a problem hiding this comment.
this gives async a clear policy for eval errors (log + skip), but the sync path in Steps 4a-4c would let the exception bubble up and crash the column. Same typo in skip.when would silently skip rows in async but abort in sync. Probably worth moving the try/except into evaluate_skip_when() itself so both engines get the same behavior.
There was a problem hiding this comment.
Agreed — addressed in c5d2dbc. The try/except is now inside evaluate_skip_when() itself (Step 3a), so both sync and async callers get identical fail-safe behavior. Removed the per-engine error handling from Step 5b.
|
|
||
| @lru_cache(maxsize=64) | ||
| def _compile_skip_template(expression: str) -> Template: | ||
| return _env.from_string(expression) |
There was a problem hiding this comment.
nit: evaluate_skip_when calls deserialize_json_values(record) per row, and the generator does it again when it runs. For records with a lot of nested JSON that could add up. Worth deserializing once at the dispatch layer and passing the result to both.
There was a problem hiding this comment.
Addressed in c5d2dbc. Went a step further than the optional kwarg — evaluate_skip_when now takes a single record that the caller must have already deserialized. The dispatch layer (_fan_out_with_threads, _run_cell) runs deserialize_json_values once and passes the result to both skip eval and the generator. No redundant parameter, no double work.
| return evaluate_skip_when(skip_config.when, record) | ||
|
|
||
| return False | ||
| ``` | ||
|
|
||
| #### 4b. Helper: `_write_skip_to_record()` | ||
|
|
||
| When a cell is skipped, write provenance and the skip value into the record in-place. The skip value comes from `SkipConfig.value` if the column has one, otherwise `None` (propagation-only skips always use `None`): | ||
|
|
||
| ```python | ||
| def _write_skip_to_record( | ||
| self, column_name: str, record: dict | ||
| ) -> None: | ||
| skip_config = self._graph.get_skip_config(column_name) | ||
| skip_value = skip_config.value if skip_config is not None else None |
There was a problem hiding this comment.
get_side_effect_columns() not defined — runtime AttributeError
_write_skip_to_record() calls self._graph.get_side_effect_columns(column_name) (line 366), but this method does not exist on ExecutionGraph. The existing graph only has set_side_effect() (maps a side-effect column to its producer) and resolve_side_effect() (reverse lookup) — neither iterates side-effect columns for a given producer.
The Files Modified section for execution_graph.py lists get_skip_config(), should_propagate_skip(), and (from the resolved Q6 thread) get_required_columns() as new accessors, but omits get_side_effect_columns(). Without it, every call to _write_skip_to_record() — in both sync and async engines — will raise AttributeError at runtime.
The implementation will need an inverse map and a new accessor, e.g.:
# In ExecutionGraph.__init__:
self._side_effects_by_producer: dict[str, list[str]] = {}
# In set_side_effect():
self._side_effects_by_producer.setdefault(producer, []).append(side_effect_col)
# New accessor:
def get_side_effect_columns(self, column: str) -> list[str]:
return list(self._side_effects_by_producer.get(column, []))This accessor should also be added to the Files Modified table.
Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/479/skip-when-conditional-generation.md
Line: 354-368
Comment:
**`get_side_effect_columns()` not defined — runtime `AttributeError`**
`_write_skip_to_record()` calls `self._graph.get_side_effect_columns(column_name)` (line 366), but this method does not exist on `ExecutionGraph`. The existing graph only has `set_side_effect()` (maps a side-effect column to its producer) and `resolve_side_effect()` (reverse lookup) — neither iterates side-effect columns for a given producer.
The Files Modified section for `execution_graph.py` lists `get_skip_config()`, `should_propagate_skip()`, and (from the resolved Q6 thread) `get_required_columns()` as new accessors, but **omits** `get_side_effect_columns()`. Without it, every call to `_write_skip_to_record()` — in both sync and async engines — will raise `AttributeError` at runtime.
The implementation will need an inverse map and a new accessor, e.g.:
```python
# In ExecutionGraph.__init__:
self._side_effects_by_producer: dict[str, list[str]] = {}
# In set_side_effect():
self._side_effects_by_producer.setdefault(producer, []).append(side_effect_col)
# New accessor:
def get_side_effect_columns(self, column: str) -> list[str]:
return list(self._side_effects_by_producer.get(column, []))
```
This accessor should also be added to the Files Modified table.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Good catch — addressed in 75943d5. Added _side_effects_by_producer: dict[str, list[str]] (inverse map populated inside set_side_effect()) and get_side_effect_columns(column) -> list[str] accessor to both the Step 2b metadata section and the Files Modified table for execution_graph.py.
| if self._should_skip_cell(column_name, record): | ||
| self._write_skip_to_record(column_name, record) | ||
| self.batch_manager.update_record(i, record) | ||
| skip_indices.add(i) | ||
|
|
||
| # Build DataFrame excluding skipped rows | ||
| batch = self.batch_manager.get_current_batch(as_dataframe=False) | ||
| active_records = [r for i, r in enumerate(batch) if i not in skip_indices] |
There was a problem hiding this comment.
__skipped__ key leaks into active_df passed to FULL_COLUMN generators
In Step 4d, active_records comes from get_current_batch(as_dataframe=False), which returns the raw buffer list — including __skipped__ keys (Python set objects) that previous columns have written. The DataFrame is then built directly:
active_df = lazy.pd.DataFrame(active_records)
result_df = generator.generate(active_df)The serialization boundary in Step 8 strips __skipped__ only in get_current_batch(as_dataframe=True) and write(), not in this ad-hoc pd.DataFrame() call. Generators therefore receive an unexpected column containing set objects, which can trigger pandas dtype-inference errors or silently expose the internal sentinel to plugin generators.
Fix by stripping __skipped__ when building active_df:
active_df = lazy.pd.DataFrame(
[{k: v for k, v in r.items() if k != "__skipped__"} for r in active_records]
)The same strip is needed in Step 5c (_run_batch) when constructing the filtered batch_df for generator.agenerate().
Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/479/skip-when-conditional-generation.md
Line: 407-414
Comment:
**`__skipped__` key leaks into `active_df` passed to FULL_COLUMN generators**
In Step 4d, `active_records` comes from `get_current_batch(as_dataframe=False)`, which returns the raw buffer list — including `__skipped__` keys (Python `set` objects) that previous columns have written. The DataFrame is then built directly:
```python
active_df = lazy.pd.DataFrame(active_records)
result_df = generator.generate(active_df)
```
The serialization boundary in Step 8 strips `__skipped__` only in `get_current_batch(as_dataframe=True)` and `write()`, not in this ad-hoc `pd.DataFrame()` call. Generators therefore receive an unexpected column containing `set` objects, which can trigger `pandas` dtype-inference errors or silently expose the internal sentinel to plugin generators.
Fix by stripping `__skipped__` when building `active_df`:
```python
active_df = lazy.pd.DataFrame(
[{k: v for k, v in r.items() if k != "__skipped__"} for r in active_records]
)
```
The same strip is needed in Step 5c (`_run_batch`) when constructing the filtered `batch_df` for `generator.agenerate()`.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
This was already addressed in 7046378. Step 4d builds active_df via strip_skip_metadata_from_records(active_records) (line 478-479), and Step 5c / the usage contract in §3b require the same for the async path. The inline {k: v ...} pattern you suggested is intentionally avoided — all stripping goes through skip_provenance.py helpers.
- Document new skip_provenance.py (key constant, read/write/strip API) - Point sync builder, async scheduler, and batch buffers at shared helpers - Strip metadata before every DataFrame from buffer dicts, including FULL_COLUMN active subsets - Split §3 into skip_evaluator vs skip_provenance; extend verification Refs #479 Made-with: Cursor
…ng, caller-owns-deserialization
- SkipConfig._validate_when_syntax now checks find_undeclared_variables
is non-empty, rejecting expressions without {{ }} delimiters that
would silently skip every row
- evaluate_skip_when centralizes try/except so both sync and async
engines get identical fail-safe behavior on eval errors
- evaluate_skip_when takes a single pre-deserialized record; caller
runs deserialize_json_values once and passes to both skip eval and
generator (no double deserialization, no redundant parameter)
- Update _should_skip_cell, async _run_cell, Files Modified table,
and verification section accordingly
Refs #479
Made-with: Cursor
Document _side_effects_by_producer inverse map and get_side_effect_columns() accessor on ExecutionGraph, needed by _write_skip_to_record / apply_skip_to_record to clear __trace, __reasoning_content, etc. on skip. Added to both Step 2b metadata section and Files Modified table. The __skipped__ leak into active_df (greptile's other P1) was already fixed in 7046378 via strip_skip_metadata_from_records. Refs #479 Made-with: Cursor
johnnygreco
left a comment
There was a problem hiding this comment.
love this one @nabinchha !!
📋 Summary
Implementation plan for
skip_when— a new field onSingleColumnConfigthat enables conditional column generation. When the Jinja2 expression evaluates truthy for a row, the cell is set toNoneand the generator is skipped. Skips auto-propagate through the DAG to downstream columns.Closes #479
🔄 Changes
✨ Added
plans/479/skip-when-conditional-generation.md— detailed implementation plan covering config, DAG, sync engine, async engine, validation, and test strategyPlan Highlights
skip_whenfield + Jinja2 validator +skip_when_columnsproperty onSingleColumnConfigskip_whenreferenced columns become edges ensuring correct execution orderrequired_columnsinclude a skipped cell auto-skipNonevs dedicated type) and auto-removal of skipped rows🤖 Generated with AI