refactor: remove sync engine#767
Conversation
|
Fern preview: https://nvidia-preview-pr-767.docs.buildwithfern.com/nemo/datadesigner
|
johnnygreco
left a comment
There was a problem hiding this comment.
Nice work on this one, @andreatgretel — removing the old engine is a big simplification, and the core builder path looks much easier to reason about now.
Summary
This PR removes the legacy sync dataset-builder path and makes generation/check-model readiness use the async scheduler and async model clients by default. The implementation matches the stated intent for generation, but I found one public helper regression around direct model use and one stale public doc reference to the removed opt-out.
Findings
Warnings — Worth addressing
Design issues, missing error handling, test gaps, or violations of project standards that could cause problems later.
packages/data-designer/src/data_designer/interface/data_designer.py:690 — get_models() now returns facades whose sync methods are unusable
- What:
_create_resource_provider()now always passesClientConcurrencyMode.ASYNC, andget_models()calls that helper unchanged at lines 617-619. Those facades are plainModelFacades, not the_AsyncBridgedModelFacadeused inside async custom columns, so a directmodel.generate()call goes throughModelRequestExecutor.completion()to an async-mode HTTP client and hitsSyncClientUnavailableErrorfrom_get_sync_client(). - Why:
get_models()is a public helper specifically documented for testing custom-column functions outside the full pipeline, and the docs still showresult = my_generator({"name": "Alice"}, None, models)where the generator examples callmodels["..."].generate(...). This breaks that workflow even though the sync dataset engine removal itself is intentional. - Suggestion: Could we keep
get_models()on sync-mode clients, e.g. let_create_resource_provider()accept aclient_concurrency_modeoverride and passClientConcurrencyMode.SYNCfromget_models()? If we want direct testing to be async-only instead, we should update the helper/docs together and add coverage for the new expected usage.
README.md:29 — README still advertises the removed sync-engine fallback
- What: The root README still tells users they can set
DATA_DESIGNER_ASYNC_ENGINE=0to fall back to the legacy sync engine, but this PR deletesengine/flags.pyand removes the code paths that read that environment variable. - Why: After this lands in the major release, users following the README will think a rollback path exists when it no longer does; worse, setting the env var is now silently ignored by generation.
- Suggestion: Update this section to say the async engine is the only execution path, and remove the
DATA_DESIGNER_ASYNC_ENGINE=0fallback language.
Suggestions — Take it or leave it
Style improvements, minor simplifications, or optional enhancements that would improve code quality.
scripts/benchmarks/benchmark_engine_v2.py:667 — Benchmark compare mode still toggles the removed env var
- What: The benchmark script still uses
DATA_DESIGNER_ASYNC_ENGINEto label subprocesses as sync vs async, and--engine syncnow just runs the async engine with a different label. - Why: This can produce misleading speedup numbers by comparing async against async, especially because the script name and output still frame it as a dual-engine comparison.
- Suggestion: Consider removing sync compare mode, making
--engine syncfail with a clear message, or repurposing the script as an async-only benchmark.
What Looks Good
- The core
DatasetBuilderpath is much cleaner with the sync branch removed, and the async scheduler wiring is now the obvious path throughbuild()andbuild_preview(). - Resume behavior kept the important crash-window safeguards: filesystem-derived row-group progress, immutable original targets, and terminal handling after
process_after_generation(). - The docs under
architecture/and Fern mostly track the new async-only execution model, and the focused tests around builder/resume/readiness give good confidence in the main path.
Verdict
Needs changes — I’d address the get_models() regression and README fallback reference before merge. The benchmark cleanup is optional but worth doing while this context is fresh.
This review was generated by an AI assistant.
231a656 to
2c2eb90
Compare
|
@johnnygreco thanks for the review. I pushed
Greptile is green on the latest commit and the active checks are passing. |
📋 Summary
Removes the legacy sync dataset-builder engine and the DATA_DESIGNER_ASYNC_ENGINE opt-out so generation always uses the async scheduler. This is stacked on #766.
🔗 Related Issue
Stacked on #766.
🔄 Changes
🧪 Testing
Ran:
✅ Checklist