refactor: unify duplicate DAG construction (dag.py + ExecutionGraph)#511
refactor: unify duplicate DAG construction (dag.py + ExecutionGraph)#511przemekboruta wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
Conversation
Greptile SummaryThis PR consolidates two overlapping DAG implementations by deleting
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py | Adds topologically_sort_column_configs as a module-level function using a correct inline Kahn's algorithm; the existing ExecutionGraph class is unchanged. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/config_compiler.py | Import updated from the deleted dag module to execution_graph; no logic changes. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dag.py | File deleted as part of the consolidation; functionality migrated to execution_graph.py. |
| packages/data-designer-engine/tests/engine/dataset_builders/utils/test_dag.py | Import path updated to execution_graph; test logic is identical to the pre-PR version. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["config_compiler.py\ncompile_dataset_builder_column_configs()"]
B["execution_graph.py\ntopologically_sort_column_configs()"]
C["ExecutionGraph.create()"]
D["ExecutionGraph.get_topological_order()\n(Kahn's algorithm)"]
E["DAGCircularDependencyError"]
A -- "calls" --> B
B -- "builds upstream/downstream\ndicts + Kahn's sort" --> D
C -- "calls after building edges" --> D
D -- "raises on cycle" --> E
B -- "raises on cycle" --> E
style B fill:#d4edda,stroke:#28a745
style E fill:#f8d7da,stroke:#dc3545
Reviews (3): Last reviewed commit: "test: relax non-deterministic ordering a..." | Re-trigger Greptile
89857f7 to
b25ebf6
Compare
…ution_graph.py Eliminates dag.py and its networkx dependency by moving topologically_sort_column_configs into execution_graph.py as a module-level function. Side-effect resolution is now O(1) via a side_effect_map dict (previously O(n²) linear scan). Kahn's algorithm is reused in-place rather than leaning on networkx.topological_sort. Closes NVIDIA-NeMo#510 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
b25ebf6 to
775f147
Compare
test_judge and test_code_and_depends_on_validation_reasoning_traces have no mutual dependency and reach in-degree 0 simultaneously in Kahn's algorithm. Set iteration order varies with PYTHONHASHSEED, making the strict list assertion flaky. Assert only the topological invariants. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@przemekboruta there's a larger PR in flight that touches these DAG abstractions. Let's wait until that merges before this one can re-base and merge! |
Summary
Closes #510
dag.pyand movestopologically_sort_column_configsintoexecution_graph.pyas a module-level functionnetworkx.topological_sortwith an inline Kahn's algorithm, consistent withExecutionGraph.get_topological_orderside_effect_mapdict — the previous implementation did a linear scan oversum(side_effect_dict.values(), [])which was O(n²)config_compiler.pyandtest_dag.pyDesign note
The function is intentionally a module-level function, not a
@classmethodonExecutionGraph.ExecutionGraphis an execution abstraction (requires strategies, manages task scheduling); this function is a compilation step that works on rawColumnConfigTwithout strategies. Mixing the two responsibilities would require either dummy strategies or a significant signature change toadd_column.Test plan
test_dag.pytests (test_dag_construction,test_circular_dependencies) cover the migrated function with updated import pathruff checkpasses on all modified files