Skip to content

fix: persist side-effect columns to row buffer in async engine#524

Closed
nabinchha wants to merge 2 commits intomainfrom
nmulepati/fix/523-async-side-effect-columns-not-persisted
Closed

fix: persist side-effect columns to row buffer in async engine#524
nabinchha wants to merge 2 commits intomainfrom
nmulepati/fix/523-async-side-effect-columns-not-persisted

Conversation

@nabinchha
Copy link
Copy Markdown
Contributor

@nabinchha nabinchha commented Apr 9, 2026

📋 Summary

The async engine's _run_cell, _run_batch, and _run_from_scratch only wrote columns tracked in _instance_to_columns back to the RowGroupBufferManager, silently dropping side-effect columns produced by generators (e.g. __trace, __reasoning_content). Downstream columns referencing these values would fail with missing column errors.

This fix introduces an explicit _instance_to_write_columns map that extends _instance_to_columns with side-effect columns declared by each config. The original _instance_to_columns remains unchanged for completion tracking and dispatch dedup, while _instance_to_write_columns is used at all three buffer write-back sites.

🔗 Related Issue

Fixes #523

🔄 Changes

🔧 Changed

  • dataset_builder.py: In _prepare_async_run, collect a side_effect_map (side-effect column → primary column) from each generator's config and pass it to AsyncTaskScheduler
  • async_scheduler.py: Accept side_effect_map in __init__, build _instance_to_write_columns by extending _instance_to_columns with side-effect entries. Use _instance_to_write_columns at the three buffer write sites (_run_from_scratch, _run_cell, _run_batch)

🧪 Tests

  • test_async_scheduler.py: Added MockCellGeneratorWithSideEffect and test_scheduler_side_effect_columns_written_to_buffer — verifies side-effect columns are persisted to the buffer and available to downstream generators. Config uses extract_reasoning_content=True so the ExecutionGraph properly registers the side-effect alias.

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the following:

🧪 Testing

  • make test passes
  • Unit tests added/updated
  • E2E tests added/updated — N/A, covered by unit test on the scheduler

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated — N/A, no architectural change

Made with Cursor

`_run_cell` and `_run_batch` only wrote columns tracked in
`_instance_to_columns` back to the buffer, silently dropping side-effect
columns like `__trace` and `__reasoning_content`. Downstream columns
referencing these values would fail with missing column errors.

After writing tracked output columns, now also persist any new keys from
the generator result that weren't in the input row data.

Made-with: Cursor
@nabinchha nabinchha requested a review from a team as a code owner April 9, 2026 23:22
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Code Review: PR #524 — fix: persist side-effect columns to row buffer in async engine

Summary

This PR fixes a bug where the async engine's _run_cell and _run_batch methods silently dropped side-effect columns (e.g., __trace, __reasoning_content) produced by generators but not tracked in _instance_to_columns. Downstream columns referencing these values would fail with missing-column errors.

The fix adds a second write pass in both _run_cell and _run_batch that persists any new keys from the generator result that were not present in the input row data. A unit test is included for the cell-by-cell code path.

Files changed: 2 (86 additions, 1 deletion)

  • packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
  • packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py

CI status: Lint passes. Most test/E2E jobs still pending at review time. Config tests all pass.


Findings

1. _run_from_scratch not updated — potential silent drop (Medium)

File: async_scheduler.py:757-775

The _run_from_scratch method has the same pattern as _run_cell and _run_batch — it only writes output_cols to the buffer. If a from-scratch generator ever produces side-effect columns, they will be silently dropped, exactly the same bug being fixed here for the other two paths.

This may be intentional (from-scratch generators might not produce side-effects today), but it creates an asymmetry across the three execution paths. At minimum, a comment explaining why _run_from_scratch doesn't need the same treatment would prevent a future reader from filing the same bug again.

2. No test coverage for _run_batch side-effect handling (Medium)

The test (test_scheduler_side_effect_columns_written_to_buffer) only exercises cell-by-cell generation via MockCellGeneratorWithSideEffect. The parallel change to _run_batch (lines 834-840) has no corresponding test. Since the batch path has different mechanics (iterating over a DataFrame vs. a dict, tracking existing_cols from the batch input), a dedicated test would reduce regression risk.

3. Side-effect column collision semantics are implicit (Low)

In _run_cell, the filter col not in row_data means that if two sequential generators both produce the same side-effect column name, the second generator's value is silently dropped (the first writer wins). This is likely fine given the naming convention (column__reasoning_content), but a brief comment noting the first-writer-wins behavior would help future maintainers understand the design choice.

4. Minor: batch_df is not None check may be unnecessary (Nit)

File: async_scheduler.py:837

existing_cols = set(batch_df.columns) if batch_df is not None else set()

batch_df is assigned from self._buffer_manager.get_dataframe(...) just above, and we are inside the if self._buffer_manager is not None guard. If get_dataframe can never return None, this defensive check is dead code. If it can, this is fine. Minor either way.


Positives

  • Focused, minimal fix. Only the necessary lines are changed, no unrelated refactoring.
  • Good comments. The inline comments explaining what side-effect columns are and why they need persisting add valuable context.
  • Clean test design. MockCellGeneratorWithSideEffect is a clear, well-scoped test helper. The test verifies both presence and content of the side-effect column across all rows.
  • Correct set operations. Using result.keys() - written for the cell path is idiomatic and efficient.

Verdict

Approve with suggestions. The core fix is correct and well-implemented. The two medium findings (missing _run_from_scratch symmetry and missing batch test) are worth addressing to complete the fix, but are not blockers for merge. The low/nit items are optional improvements.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 9, 2026

Greptile Summary

This PR fixes a silent data-loss bug in the async scheduler where side-effect columns produced by generators (e.g. __trace, __reasoning_content) were never written back to the RowGroupBufferManager, causing downstream columns that reference them to fail with missing-column errors.

The fix introduces a separate _instance_to_write_columns dict (a superset of _instance_to_columns) that includes side-effect columns alongside primary ones, so all three write-back paths (_run_from_scratch, _run_cell, _run_batch) now persist them. _instance_to_columns is intentionally left unchanged for completion tracking and dispatch dedup.

Confidence Score: 5/5

This PR is safe to merge — the fix is targeted, correct, and well-tested.

All three write-back paths are consistently updated. The two-dict design (_instance_to_columns for completion tracking, _instance_to_write_columns for buffer writes) is clean and avoids breaking dispatch dedup. The dataset_builder.py side-effect map construction correctly handles both MultiColumnConfig and single-column paths. No P0/P1 findings.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py Adds side_effect_map parameter and _instance_to_write_columns dict; all three write-back paths correctly switched from _instance_to_columns to _instance_to_write_columns.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py Correctly builds side_effect_map from gen.config.side_effect_columns / sub.side_effect_columns and passes it to AsyncTaskScheduler; handles both MultiColumnConfig and single-column cases.
packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py New test verifies that side-effect columns are written to the buffer after a cell-by-cell generator run; covers the core bug scenario with answer__reasoning_content.

Sequence Diagram

sequenceDiagram
    participant DB as DatasetBuilder
    participant AS as AsyncTaskScheduler
    participant Gen as ColumnGenerator
    participant BM as RowGroupBufferManager

    DB->>DB: build side_effect_map
    DB->>AS: AsyncTaskScheduler(..., side_effect_map)
    AS->>AS: build _instance_to_write_columns

    loop Cell-by-cell task for answer
        AS->>BM: get_row(rg, ri)
        AS->>Gen: agenerate(row_data)
        Gen-->>AS: result with answer + answer__reasoning_content
        AS->>BM: update_cell answer
        AS->>BM: update_cell answer__reasoning_content
    end

    loop Cell-by-cell task for judge
        AS->>BM: get_row includes answer__reasoning_content
        AS->>Gen: agenerate(row_data)
    end
Loading

Reviews (2): Last reviewed commit: "try a better fix" | Re-trigger Greptile

strategies[gen.config.name] = gen.get_generation_strategy()
gen_map[gen.config.name] = gen
for se_col in gen.config.side_effect_columns:
side_effect_map[se_col] = gen.config.name
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExecutionGraph.create() on line 342 already builds the same {side_effect_col: primary_col} mapping internally (_side_effect_map). might be worth exposing that as a read-only property on the graph and reading it here instead of building it independently - keeps the graph as the single source of truth and avoids the two copies drifting apart if side-effect logic gets more complex later.

row_groups=row_groups,
buffer_manager=buffer_manager,
side_effect_map={side_effect_col: "answer"},
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this test exercises the scheduler directly, which proves the mechanism works, but doesn't go through _prepare_async_run() where the production wiring happens. if someone refactors the builder and forgets to pass side_effect_map, this test would still pass. maybe worth an integration-level test through build_preview() too?

"seed": MockSeedGenerator(config=_expr_config("seed"), resource_provider=provider),
"answer": answer_gen,
"judge": MockCellGenerator(config=_expr_config("judge"), resource_provider=provider),
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the judge mock doesn't actually consume answer__reasoning_content from its input data, so this doesn't fully prove that downstream columns can read the side-effect value from the buffer. maybe worth swapping in a mock that asserts data[side_effect_col] is present? current test still catches the original bug though, so no blocker.

tracker: CompletionTracker,
row_groups: list[tuple[int, int]],
buffer_manager: RowGroupBufferManager | None = None,
side_effect_map: dict[str, str] | None = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: side_effect_map is positional here but all the other optional config params are keyword-only (after *). might want to move it after * for consistency?

@andreatgretel
Copy link
Copy Markdown
Contributor

Hey @nabinchha - heads up that #509 addresses the same root cause and also fixes a related bug in ExecutionGraph.set_side_effect (first-writer-wins semantics to prevent false DAGCircularDependencyError when multiple stages declare the same side-effect). The scheduler fix in #509 also avoids the extra wiring in dataset_builder.py by reading side-effect columns directly from generator configs inside AsyncTaskScheduler.__init__.

Want to sync up so we're not duplicating effort? Happy to merge the approaches - the _instance_to_write_columns split is the right call either way.

@nabinchha
Copy link
Copy Markdown
Contributor Author

@andreatgretel you are right! This is a duplicate of #509. Will close.

@nabinchha nabinchha closed this Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Async engine: side-effect columns not persisted to row buffer

2 participants