Skip to content

docs: add skip column config option for conditional column generation (#479)#480

Merged
nabinchha merged 12 commits intomainfrom
nmulepati/feat/479-skip-when-plan
Apr 6, 2026
Merged

docs: add skip column config option for conditional column generation (#479)#480
nabinchha merged 12 commits intomainfrom
nmulepati/feat/479-skip-when-plan

Conversation

@nabinchha
Copy link
Copy Markdown
Contributor

📋 Summary

Implementation plan for skip_when — a new field on SingleColumnConfig that enables conditional column generation. When the Jinja2 expression evaluates truthy for a row, the cell is set to None and the generator is skipped. Skips auto-propagate through the DAG to downstream columns.

Closes #479

🔄 Changes

✨ Added

Plan Highlights

  • Config: skip_when field + Jinja2 validator + skip_when_columns property on SingleColumnConfig
  • DAG: skip_when referenced columns become edges ensuring correct execution order
  • Engine: Skip checks inserted before cell dispatch in both sync and async paths
  • Auto-propagation: Downstream columns whose required_columns include a skipped cell auto-skip
  • Open questions: Sentinel value (None vs dedicated type) and auto-removal of skipped rows

🤖 Generated with AI

nabinchha and others added 4 commits March 30, 2026 14:19
Adds implementation plan for a `skip_when` field on `SingleColumnConfig`
that enables conditional column generation. When the Jinja2 expression
evaluates truthy, the cell is set to None and the generator is skipped.
Skips auto-propagate through the DAG to downstream columns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@nabinchha nabinchha requested a review from a team as a code owner March 30, 2026 20:26
@nabinchha nabinchha changed the title plan: add skip_when for conditional column generation (#479) docs: add skip_when for conditional column generation (#479) Mar 30, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 30, 2026

Greptile Summary

This PR adds a detailed implementation plan for skip_when / SkipConfig conditional column generation, covering config modeling, DAG ordering, sync/async engine integration, and test strategy. The plan has been significantly revised (commits 351bc29, 9790f95) and all previously raised concerns are addressed. One correctness gap remains: in _run_full_column_generator() (Step 4d) and its async counterpart (Step 5c), the merge-back replaces non-skipped buffer records wholesale with records derived from the generator's output DataFrame — which was built from strip_skip_metadata_from_records() and therefore carries no __skipped__ key, silently erasing upstream skip provenance for those rows and breaking propagate_skip=True propagation in any pipeline that mixes CELL_BY_CELL and FULL_COLUMN generators.

Confidence Score: 4/5

Plan is nearly ready to implement but has one P1 correctness bug in the FULL_COLUMN merge-back that would silently break transitive propagation in mixed-generator pipelines

All eight previously raised issues are addressed. One new P1 remains: the FULL_COLUMN merge-back in Step 4d and Step 5c replaces non-skipped buffer records with generator output records stripped of skipped, breaking downstream propagate_skip logic whenever a FULL_COLUMN generator is interleaved between a skipped column and its propagation targets.

plans/479/skip-when-conditional-generation.md — specifically Step 4d (_run_full_column_generator merge-back, lines 486-494) and Step 5c (async _run_batch merge-back)

Important Files Changed

Filename Overview
plans/479/skip-when-conditional-generation.md Implementation plan for skip_when conditional generation; mostly sound after prior revisions, but one P1 correctness bug in the FULL_COLUMN merge-back silently erases upstream skipped provenance

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Column A (CELL_BY_CELL)\nskipped for row 3"] -->|"record.__skipped__ = {'A'}"| B
    B["Column B (FULL_COLUMN)\nnot skipped for row 3"] --> C["active_records built\n(row 3 included)"]
    C --> D["strip_skip_metadata_from_records\nrow 3 loses __skipped__"]
    D --> E["generator.generate(active_df)"]
    E --> F["result_df.to_dict()\nno __skipped__ key"]
    F --> G["merged.append(next(result_iter))\n❌ __skipped__ LOST for row 3"]
    G --> H["Column C\npropagate_skip=True, required_columns=['A']"]
    H --> I["get_skipped_column_names = set()\npropagation check: False"]
    I --> J["❌ Column C generates for row 3\nshould have been auto-skipped"]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: plans/479/skip-when-conditional-generation.md
Line: 486-494

Comment:
**`__skipped__` provenance erased for non-skipped rows after FULL_COLUMN generator**

In the merge-back loop, non-skipped rows are replaced wholesale with records from `result_df.to_dict(orient="records")`. Those result records were derived from `active_df`, which was built via `strip_skip_metadata_from_records(active_records)` — so they carry no `__skipped__` key. Any upstream skip provenance (e.g., column A was skipped for row 3) that existed on the original buffer record is silently dropped.

Concrete failure: column A is CELL_BY_CELL-skipped for row 3 → `record.__skipped__ = {'column_A'}`. Column B is a FULL_COLUMN generator that does **not** skip row 3. After the merge-back, row 3's record is replaced with `next(result_iter)`, which has no `__skipped__`. Column C runs next with `propagate_skip=True` and `required_columns=['column_A']`; `get_skipped_column_names(record)` returns `set()`, so propagation does not fire — C generates for row 3 even though it should have been auto-skipped.

Fix: carry the existing `__skipped__` set forward into each merged result record:

```python
for i, record in enumerate(batch):
    if i in skip_indices:
        merged.append(record)
    else:
        result_record = next(result_iter)
        if SKIPPED_COLUMNS_RECORD_KEY in record:
            result_record[SKIPPED_COLUMNS_RECORD_KEY] = record[SKIPPED_COLUMNS_RECORD_KEY]
        merged.append(result_record)
```

The same fix is needed in the async `_run_batch()` merge-back (Step 5c).

How can I resolve this? If you propose a fix, please make it concise.

Reviews (6): Last reviewed commit: "plan: add get_side_effect_columns access..." | Re-trigger Greptile

@nabinchha nabinchha added the plan Agent-assisted development plan label Mar 31, 2026
@andreatgretel
Copy link
Copy Markdown
Contributor

plans/479/skip-when-conditional-generation.md:172-195 (async engine sections)

worth noting the overlap with #362 (keep failed-parse fields as null). Both features need a "cell became unavailable, propagate through the DAG" mechanism - skip_when is intentional/config-driven, #362 is runtime failure. Might be worth designing one shared propagation path (especially on the async side where CompletionTracker already has dependency-aware state) instead of building two separate cascades. Not blocking for the plan, just something to keep in mind before the implementation PRs land.

Copy link
Copy Markdown
Contributor

@johnnygreco johnnygreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! only question to sort out is the skip value

…e propagation

- Introduce SkipConfig(when, value) as nested model on SingleColumnConfig
- Move propagate_skip to SingleColumnConfig as independent field, fixing
  bug where columns with no SkipConfig couldn't participate in propagation
- Add full sync engine implementation (Steps 4a-4d) covering both
  _fan_out_with_threads and _run_full_column_generator dispatch paths
- Add serialization boundary stripping for both DatasetBatchManager (sync)
  and RowGroupBufferManager (async)
- Simplify architecture diagrams for readability
- Update all references, design decisions, verification plan

Made-with: Cursor
- Explain why propagation must not use get_upstream_columns() once
  skip.when adds DAG edges; add _required_columns and
  get_required_columns() to the execution graph plan
- Point async _run_cell at get_required_columns for parity with sync
- Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for
  DataFrames; tighten resolved-questions wording
- Extend DAG/graph verification with gating_col regression case

Refs #479

Made-with: Cursor
self, column_name: str, record: dict
) -> bool:
skipped_cols: set[str] = record.get("__skipped__", set())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_should_skip_cell() calls self._graph.get_required_columns(column_name) but the Files Modified table only adds get_skip_config() and should_propagate_skip() to ExecutionGraph - I don't think this method exists yet. get_upstream_columns() wouldn't work as a stand-in since after Step 2b it includes skip.columns edges too, and propagation should only consider template dependencies. Probably needs a _required_columns dict + accessor on the graph.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already addressed in 9790f95 (Step 2b lines 258–264 and Files Modified table line 591 both document _required_columns + get_required_columns()). Confirmed still present in c5d2dbc.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already addressed in 9790f95 (Step 2b lines 258–264 and Files Modified table line 591 both document _required_columns + get_required_columns()). Confirmed still present in c5d2dbc.

def _validate_when_syntax(cls, v: str) -> str:
from jinja2.sandbox import ImmutableSandboxedEnvironment
ImmutableSandboxedEnvironment().parse(v)
return v
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

env.parse("in_stock == 0") (no {{ }}) is valid Jinja2 - it's just literal text. columns would come back empty, no DAG edges get created, and the runtime renders the truthy string "in_stock == 0", silently skipping every row. This feels like the most likely user mistake. Worth checking that find_undeclared_variables() is non-empty or at least requiring {{ to be present.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — addressed in c5d2dbc. _validate_when_syntax now checks find_undeclared_variables(ast) is non-empty and raises ValueError with a hint about {{ }} delimiters if the expression references no columns.

- `record.setdefault("__skipped__", set()).add(task.column)` — records skip provenance.
- `record[task.column] = skip_config.value if skip_config else None` — the **column key must be present** in the record dict, not absent. Downstream `skip.when` expressions and Jinja2 templates may reference skipped columns (e.g., `{{ col is none }}`); an absent key would cause `UndefinedError`. Propagation-only skips (no `SkipConfig`) use `None`.
- `record[se_col] = None` for each side-effect column — side-effect columns (`__trace`, `__reasoning_content`, etc.) are always written as `None`, regardless of the parent column's `skip.value`. A skipped cell has no trace or reasoning.
- Return `None`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this gives async a clear policy for eval errors (log + skip), but the sync path in Steps 4a-4c would let the exception bubble up and crash the column. Same typo in skip.when would silently skip rows in async but abort in sync. Probably worth moving the try/except into evaluate_skip_when() itself so both engines get the same behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — addressed in c5d2dbc. The try/except is now inside evaluate_skip_when() itself (Step 3a), so both sync and async callers get identical fail-safe behavior. Removed the per-engine error handling from Step 5b.


@lru_cache(maxsize=64)
def _compile_skip_template(expression: str) -> Template:
return _env.from_string(expression)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: evaluate_skip_when calls deserialize_json_values(record) per row, and the generator does it again when it runs. For records with a lot of nested JSON that could add up. Worth deserializing once at the dispatch layer and passing the result to both.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c5d2dbc. Went a step further than the optional kwarg — evaluate_skip_when now takes a single record that the caller must have already deserialized. The dispatch layer (_fan_out_with_threads, _run_cell) runs deserialize_json_values once and passes the result to both skip eval and the generator. No redundant parameter, no double work.

Comment on lines +354 to +368
return evaluate_skip_when(skip_config.when, record)

return False
```

#### 4b. Helper: `_write_skip_to_record()`

When a cell is skipped, write provenance and the skip value into the record in-place. The skip value comes from `SkipConfig.value` if the column has one, otherwise `None` (propagation-only skips always use `None`):

```python
def _write_skip_to_record(
self, column_name: str, record: dict
) -> None:
skip_config = self._graph.get_skip_config(column_name)
skip_value = skip_config.value if skip_config is not None else None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 get_side_effect_columns() not defined — runtime AttributeError

_write_skip_to_record() calls self._graph.get_side_effect_columns(column_name) (line 366), but this method does not exist on ExecutionGraph. The existing graph only has set_side_effect() (maps a side-effect column to its producer) and resolve_side_effect() (reverse lookup) — neither iterates side-effect columns for a given producer.

The Files Modified section for execution_graph.py lists get_skip_config(), should_propagate_skip(), and (from the resolved Q6 thread) get_required_columns() as new accessors, but omits get_side_effect_columns(). Without it, every call to _write_skip_to_record() — in both sync and async engines — will raise AttributeError at runtime.

The implementation will need an inverse map and a new accessor, e.g.:

# In ExecutionGraph.__init__:
self._side_effects_by_producer: dict[str, list[str]] = {}

# In set_side_effect():
self._side_effects_by_producer.setdefault(producer, []).append(side_effect_col)

# New accessor:
def get_side_effect_columns(self, column: str) -> list[str]:
    return list(self._side_effects_by_producer.get(column, []))

This accessor should also be added to the Files Modified table.

Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/479/skip-when-conditional-generation.md
Line: 354-368

Comment:
**`get_side_effect_columns()` not defined — runtime `AttributeError`**

`_write_skip_to_record()` calls `self._graph.get_side_effect_columns(column_name)` (line 366), but this method does not exist on `ExecutionGraph`. The existing graph only has `set_side_effect()` (maps a side-effect column to its producer) and `resolve_side_effect()` (reverse lookup) — neither iterates side-effect columns for a given producer.

The Files Modified section for `execution_graph.py` lists `get_skip_config()`, `should_propagate_skip()`, and (from the resolved Q6 thread) `get_required_columns()` as new accessors, but **omits** `get_side_effect_columns()`. Without it, every call to `_write_skip_to_record()` — in both sync and async engines — will raise `AttributeError` at runtime.

The implementation will need an inverse map and a new accessor, e.g.:

```python
# In ExecutionGraph.__init__:
self._side_effects_by_producer: dict[str, list[str]] = {}

# In set_side_effect():
self._side_effects_by_producer.setdefault(producer, []).append(side_effect_col)

# New accessor:
def get_side_effect_columns(self, column: str) -> list[str]:
    return list(self._side_effects_by_producer.get(column, []))
```

This accessor should also be added to the Files Modified table.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — addressed in 75943d5. Added _side_effects_by_producer: dict[str, list[str]] (inverse map populated inside set_side_effect()) and get_side_effect_columns(column) -> list[str] accessor to both the Step 2b metadata section and the Files Modified table for execution_graph.py.

Comment on lines +407 to +414
if self._should_skip_cell(column_name, record):
self._write_skip_to_record(column_name, record)
self.batch_manager.update_record(i, record)
skip_indices.add(i)

# Build DataFrame excluding skipped rows
batch = self.batch_manager.get_current_batch(as_dataframe=False)
active_records = [r for i, r in enumerate(batch) if i not in skip_indices]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 __skipped__ key leaks into active_df passed to FULL_COLUMN generators

In Step 4d, active_records comes from get_current_batch(as_dataframe=False), which returns the raw buffer list — including __skipped__ keys (Python set objects) that previous columns have written. The DataFrame is then built directly:

active_df = lazy.pd.DataFrame(active_records)
result_df = generator.generate(active_df)

The serialization boundary in Step 8 strips __skipped__ only in get_current_batch(as_dataframe=True) and write(), not in this ad-hoc pd.DataFrame() call. Generators therefore receive an unexpected column containing set objects, which can trigger pandas dtype-inference errors or silently expose the internal sentinel to plugin generators.

Fix by stripping __skipped__ when building active_df:

active_df = lazy.pd.DataFrame(
    [{k: v for k, v in r.items() if k != "__skipped__"} for r in active_records]
)

The same strip is needed in Step 5c (_run_batch) when constructing the filtered batch_df for generator.agenerate().

Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/479/skip-when-conditional-generation.md
Line: 407-414

Comment:
**`__skipped__` key leaks into `active_df` passed to FULL_COLUMN generators**

In Step 4d, `active_records` comes from `get_current_batch(as_dataframe=False)`, which returns the raw buffer list — including `__skipped__` keys (Python `set` objects) that previous columns have written. The DataFrame is then built directly:

```python
active_df = lazy.pd.DataFrame(active_records)
result_df = generator.generate(active_df)
```

The serialization boundary in Step 8 strips `__skipped__` only in `get_current_batch(as_dataframe=True)` and `write()`, not in this ad-hoc `pd.DataFrame()` call. Generators therefore receive an unexpected column containing `set` objects, which can trigger `pandas` dtype-inference errors or silently expose the internal sentinel to plugin generators.

Fix by stripping `__skipped__` when building `active_df`:

```python
active_df = lazy.pd.DataFrame(
    [{k: v for k, v in r.items() if k != "__skipped__"} for r in active_records]
)
```

The same strip is needed in Step 5c (`_run_batch`) when constructing the filtered `batch_df` for `generator.agenerate()`.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already addressed in 7046378. Step 4d builds active_df via strip_skip_metadata_from_records(active_records) (line 478-479), and Step 5c / the usage contract in §3b require the same for the async path. The inline {k: v ...} pattern you suggested is intentionally avoided — all stripping goes through skip_provenance.py helpers.

- Document new skip_provenance.py (key constant, read/write/strip API)
- Point sync builder, async scheduler, and batch buffers at shared helpers
- Strip metadata before every DataFrame from buffer dicts, including
  FULL_COLUMN active subsets
- Split §3 into skip_evaluator vs skip_provenance; extend verification

Refs #479

Made-with: Cursor
Drop legacy skip_when naming in headings and #362 cross-reference.

Refs #479

Made-with: Cursor
@nabinchha nabinchha changed the title docs: add skip_when for conditional column generation (#479) docs: add skip column config option for conditional column generation (#479) Apr 6, 2026
…ng, caller-owns-deserialization

- SkipConfig._validate_when_syntax now checks find_undeclared_variables
  is non-empty, rejecting expressions without {{ }} delimiters that
  would silently skip every row
- evaluate_skip_when centralizes try/except so both sync and async
  engines get identical fail-safe behavior on eval errors
- evaluate_skip_when takes a single pre-deserialized record; caller
  runs deserialize_json_values once and passes to both skip eval and
  generator (no double deserialization, no redundant parameter)
- Update _should_skip_cell, async _run_cell, Files Modified table,
  and verification section accordingly

Refs #479

Made-with: Cursor
Document _side_effects_by_producer inverse map and
get_side_effect_columns() accessor on ExecutionGraph, needed by
_write_skip_to_record / apply_skip_to_record to clear __trace,
__reasoning_content, etc. on skip. Added to both Step 2b metadata
section and Files Modified table.

The __skipped__ leak into active_df (greptile's other P1) was already
fixed in 7046378 via strip_skip_metadata_from_records.

Refs #479

Made-with: Cursor
@nabinchha nabinchha requested a review from andreatgretel April 6, 2026 19:56
Copy link
Copy Markdown
Contributor

@johnnygreco johnnygreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love this one @nabinchha !!

@nabinchha nabinchha merged commit d4443d7 into main Apr 6, 2026
47 checks passed
@nabinchha nabinchha deleted the nmulepati/feat/479-skip-when-plan branch April 6, 2026 21:22
Copy link
Copy Markdown

@andy9trider69-bit andy9trider69-bit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

plan Agent-assisted development plan

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: support simple conditional column generation

4 participants