fix: persist side-effect columns to row buffer in async engine by nabinchha · Pull Request #524 · NVIDIA-NeMo/DataDesigner

nabinchha · 2026-04-09T23:22:27Z

📋 Summary

The async engine's _run_cell, _run_batch, and _run_from_scratch only wrote columns tracked in _instance_to_columns back to the RowGroupBufferManager, silently dropping side-effect columns produced by generators (e.g. __trace, __reasoning_content). Downstream columns referencing these values would fail with missing column errors.

This fix introduces an explicit _instance_to_write_columns map that extends _instance_to_columns with side-effect columns declared by each config. The original _instance_to_columns remains unchanged for completion tracking and dispatch dedup, while _instance_to_write_columns is used at all three buffer write-back sites.

🔗 Related Issue

Fixes #523

🔄 Changes

🔧 Changed

dataset_builder.py: In _prepare_async_run, collect a side_effect_map (side-effect column → primary column) from each generator's config and pass it to AsyncTaskScheduler
async_scheduler.py: Accept side_effect_map in __init__, build _instance_to_write_columns by extending _instance_to_columns with side-effect entries. Use _instance_to_write_columns at the three buffer write sites (_run_from_scratch, _run_cell, _run_batch)

🧪 Tests

test_async_scheduler.py: Added MockCellGeneratorWithSideEffect and test_scheduler_side_effect_columns_written_to_buffer — verifies side-effect columns are persisted to the buffer and available to downstream generators. Config uses extract_reasoning_content=True so the ExecutionGraph properly registers the side-effect alias.

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the following:

async_scheduler.py — _instance_to_write_columns construction: New map that extends _instance_to_columns with side-effect columns. _instance_to_columns is intentionally left unchanged for CompletionTracker and dispatch dedup.

🧪 Testing

make test passes
Unit tests added/updated
E2E tests added/updated — N/A, covered by unit test on the scheduler

✅ Checklist

Follows commit message conventions
Commits are signed off (DCO)
Architecture docs updated — N/A, no architectural change

Made with Cursor

`_run_cell` and `_run_batch` only wrote columns tracked in `_instance_to_columns` back to the buffer, silently dropping side-effect columns like `__trace` and `__reasoning_content`. Downstream columns referencing these values would fail with missing column errors. After writing tracked output columns, now also persist any new keys from the generator result that weren't in the input row data. Made-with: Cursor

github-actions · 2026-04-09T23:25:14Z

Code Review: PR #524 — fix: persist side-effect columns to row buffer in async engine

Summary

This PR fixes a bug where the async engine's _run_cell and _run_batch methods silently dropped side-effect columns (e.g., __trace, __reasoning_content) produced by generators but not tracked in _instance_to_columns. Downstream columns referencing these values would fail with missing-column errors.

The fix adds a second write pass in both _run_cell and _run_batch that persists any new keys from the generator result that were not present in the input row data. A unit test is included for the cell-by-cell code path.

Files changed: 2 (86 additions, 1 deletion)

packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py

CI status: Lint passes. Most test/E2E jobs still pending at review time. Config tests all pass.

Findings

1. `_run_from_scratch` not updated — potential silent drop (Medium)

File: async_scheduler.py:757-775

The _run_from_scratch method has the same pattern as _run_cell and _run_batch — it only writes output_cols to the buffer. If a from-scratch generator ever produces side-effect columns, they will be silently dropped, exactly the same bug being fixed here for the other two paths.

This may be intentional (from-scratch generators might not produce side-effects today), but it creates an asymmetry across the three execution paths. At minimum, a comment explaining why _run_from_scratch doesn't need the same treatment would prevent a future reader from filing the same bug again.

2. No test coverage for `_run_batch` side-effect handling (Medium)

The test (test_scheduler_side_effect_columns_written_to_buffer) only exercises cell-by-cell generation via MockCellGeneratorWithSideEffect. The parallel change to _run_batch (lines 834-840) has no corresponding test. Since the batch path has different mechanics (iterating over a DataFrame vs. a dict, tracking existing_cols from the batch input), a dedicated test would reduce regression risk.

3. Side-effect column collision semantics are implicit (Low)

In _run_cell, the filter col not in row_data means that if two sequential generators both produce the same side-effect column name, the second generator's value is silently dropped (the first writer wins). This is likely fine given the naming convention (column__reasoning_content), but a brief comment noting the first-writer-wins behavior would help future maintainers understand the design choice.

4. Minor: `batch_df is not None` check may be unnecessary (Nit)

File: async_scheduler.py:837

existing_cols = set(batch_df.columns) if batch_df is not None else set()

batch_df is assigned from self._buffer_manager.get_dataframe(...) just above, and we are inside the if self._buffer_manager is not None guard. If get_dataframe can never return None, this defensive check is dead code. If it can, this is fine. Minor either way.

Positives

Focused, minimal fix. Only the necessary lines are changed, no unrelated refactoring.
Good comments. The inline comments explaining what side-effect columns are and why they need persisting add valuable context.
Clean test design. MockCellGeneratorWithSideEffect is a clear, well-scoped test helper. The test verifies both presence and content of the side-effect column across all rows.
Correct set operations. Using result.keys() - written for the cell path is idiomatic and efficient.

Verdict

Approve with suggestions. The core fix is correct and well-implemented. The two medium findings (missing _run_from_scratch symmetry and missing batch test) are worth addressing to complete the fix, but are not blockers for merge. The low/nit items are optional improvements.

greptile-apps · 2026-04-09T23:26:18Z

Greptile Summary

This PR fixes a silent data-loss bug in the async scheduler where side-effect columns produced by generators (e.g. __trace, __reasoning_content) were never written back to the RowGroupBufferManager, causing downstream columns that reference them to fail with missing-column errors.

The fix introduces a separate _instance_to_write_columns dict (a superset of _instance_to_columns) that includes side-effect columns alongside primary ones, so all three write-back paths (_run_from_scratch, _run_cell, _run_batch) now persist them. _instance_to_columns is intentionally left unchanged for completion tracking and dispatch dedup.

Confidence Score: 5/5

This PR is safe to merge — the fix is targeted, correct, and well-tested.

All three write-back paths are consistently updated. The two-dict design (_instance_to_columns for completion tracking, _instance_to_write_columns for buffer writes) is clean and avoids breaking dispatch dedup. The dataset_builder.py side-effect map construction correctly handles both MultiColumnConfig and single-column paths. No P0/P1 findings.

No files require special attention.

Important Files Changed

Filename	Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py	Adds `side_effect_map` parameter and `_instance_to_write_columns` dict; all three write-back paths correctly switched from `_instance_to_columns` to `_instance_to_write_columns`.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py	Correctly builds `side_effect_map` from `gen.config.side_effect_columns` / `sub.side_effect_columns` and passes it to `AsyncTaskScheduler`; handles both `MultiColumnConfig` and single-column cases.
packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py	New test verifies that side-effect columns are written to the buffer after a cell-by-cell generator run; covers the core bug scenario with `answer__reasoning_content`.

Sequence Diagram

sequenceDiagram
    participant DB as DatasetBuilder
    participant AS as AsyncTaskScheduler
    participant Gen as ColumnGenerator
    participant BM as RowGroupBufferManager

    DB->>DB: build side_effect_map
    DB->>AS: AsyncTaskScheduler(..., side_effect_map)
    AS->>AS: build _instance_to_write_columns

    loop Cell-by-cell task for answer
        AS->>BM: get_row(rg, ri)
        AS->>Gen: agenerate(row_data)
        Gen-->>AS: result with answer + answer__reasoning_content
        AS->>BM: update_cell answer
        AS->>BM: update_cell answer__reasoning_content
    end

    loop Cell-by-cell task for judge
        AS->>BM: get_row includes answer__reasoning_content
        AS->>Gen: agenerate(row_data)
    end

_{Reviews (2): Last reviewed commit: "try a better fix" | Re-trigger Greptile}

andreatgretel · 2026-04-10T13:50:38Z

packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py

                strategies[gen.config.name] = gen.get_generation_strategy()
                gen_map[gen.config.name] = gen
+                for se_col in gen.config.side_effect_columns:
+                    side_effect_map[se_col] = gen.config.name


ExecutionGraph.create() on line 342 already builds the same {side_effect_col: primary_col} mapping internally (_side_effect_map). might be worth exposing that as a read-only property on the graph and reading it here instead of building it independently - keeps the graph as the single source of truth and avoids the two copies drifting apart if side-effect logic gets more complex later.

andreatgretel · 2026-04-10T13:50:40Z

packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py

+        row_groups=row_groups,
+        buffer_manager=buffer_manager,
+        side_effect_map={side_effect_col: "answer"},
+    )


nit: this test exercises the scheduler directly, which proves the mechanism works, but doesn't go through _prepare_async_run() where the production wiring happens. if someone refactors the builder and forgets to pass side_effect_map, this test would still pass. maybe worth an integration-level test through build_preview() too?

andreatgretel · 2026-04-10T13:50:42Z

packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py

+        "seed": MockSeedGenerator(config=_expr_config("seed"), resource_provider=provider),
+        "answer": answer_gen,
+        "judge": MockCellGenerator(config=_expr_config("judge"), resource_provider=provider),
+    }


nit: the judge mock doesn't actually consume answer__reasoning_content from its input data, so this doesn't fully prove that downstream columns can read the side-effect value from the buffer. maybe worth swapping in a mock that asserts data[side_effect_col] is present? current test still catches the original bug though, so no blocker.

andreatgretel · 2026-04-10T13:50:44Z

packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py

        tracker: CompletionTracker,
        row_groups: list[tuple[int, int]],
        buffer_manager: RowGroupBufferManager | None = None,
+        side_effect_map: dict[str, str] | None = None,


minor: side_effect_map is positional here but all the other optional config params are keyword-only (after *). might want to move it after * for consistency?

andreatgretel · 2026-04-10T14:49:58Z

Hey @nabinchha - heads up that #509 addresses the same root cause and also fixes a related bug in ExecutionGraph.set_side_effect (first-writer-wins semantics to prevent false DAGCircularDependencyError when multiple stages declare the same side-effect). The scheduler fix in #509 also avoids the extra wiring in dataset_builder.py by reading side-effect columns directly from generator configs inside AsyncTaskScheduler.__init__.

Want to sync up so we're not duplicating effort? Happy to merge the approaches - the _instance_to_write_columns split is the right call either way.

nabinchha · 2026-04-10T15:39:34Z

@andreatgretel you are right! This is a duplicate of #509. Will close.

nabinchha requested a review from a team as a code owner April 9, 2026 23:22

try a better fix

00ead6c

andreatgretel reviewed Apr 10, 2026

View reviewed changes

nabinchha closed this Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: persist side-effect columns to row buffer in async engine#524

fix: persist side-effect columns to row buffer in async engine#524
nabinchha wants to merge 2 commits intomainfrom
nmulepati/fix/523-async-side-effect-columns-not-persisted

nabinchha commented Apr 9, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

greptile-apps bot commented Apr 9, 2026 •

edited

Loading

Confidence Score: 5/5

Sequence Diagram

Uh oh!

andreatgretel Apr 10, 2026

Uh oh!

andreatgretel Apr 10, 2026

Uh oh!

andreatgretel Apr 10, 2026

Uh oh!

andreatgretel Apr 10, 2026

Uh oh!

andreatgretel commented Apr 10, 2026

Uh oh!

nabinchha commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nabinchha commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📋 Summary

🔗 Related Issue

🔄 Changes

🔧 Changed

🧪 Tests

🔍 Attention Areas

🧪 Testing

✅ Checklist

Uh oh!

github-actions bot commented Apr 9, 2026

Code Review: PR #524 — fix: persist side-effect columns to row buffer in async engine

Summary

Findings

1. _run_from_scratch not updated — potential silent drop (Medium)

2. No test coverage for _run_batch side-effect handling (Medium)

3. Side-effect column collision semantics are implicit (Low)

4. Minor: batch_df is not None check may be unnecessary (Nit)

Positives

Verdict

Uh oh!

greptile-apps bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

andreatgretel Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

andreatgretel Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

andreatgretel Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

andreatgretel Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

andreatgretel commented Apr 10, 2026

Uh oh!

nabinchha commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nabinchha commented Apr 9, 2026 •

edited

Loading

1. `_run_from_scratch` not updated — potential silent drop (Medium)

2. No test coverage for `_run_batch` side-effect handling (Medium)

4. Minor: `batch_df is not None` check may be unnecessary (Nit)

greptile-apps bot commented Apr 9, 2026 •

edited

Loading