You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: architecture/dataset-builders.md
+18-43Lines changed: 18 additions & 43 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,30 +1,21 @@
1
1
# Dataset Builders
2
2
3
-
The dataset builder subsystem orchestrates the end-to-end generation of a dataset from compiled column configs. It supports two execution modes: a sequential batch loop and an async DAG-based scheduler.
3
+
The dataset builder subsystem orchestrates the end-to-end generation of a dataset from compiled column configs using an async DAG-based scheduler.
`DatasetBuilder` is the central orchestrator. It receives a compiled `DataDesignerConfig`, instantiates column generators from the registry, and executes them in dependency order. The execution mode is selected by the `DATA_DESIGNER_ASYNC_ENGINE` environment variable.
9
+
`DatasetBuilder` is the central orchestrator. It receives a compiled `DataDesignerConfig`, instantiates column generators from the registry, and executes them through `AsyncTaskScheduler`.
10
10
11
-
Both modes produce the same output: batched parquet files managed by `DatasetBatchManager`, with post-generation processing and profiling.
11
+
The scheduler produces row-group parquet files managed by `RowGroupBufferManager`, with post-generation processing and profiling.
-`GenerationStrategy` per column (CELL_BY_CELL or FULL_COLUMN)
45
36
- Kahn topological sort for execution order
46
37
-`split_upstream_by_strategy` — separates batch-level from cell-level dependencies
47
-
- Skip metadata per column — `get_skip_config`, `should_propagate_skip`, `get_required_columns`, and `get_side_effect_columns` — queried at runtime by both engines to evaluate skip decisions
38
+
- Skip metadata per column — `get_skip_config`, `should_propagate_skip`, `get_required_columns`, and `get_side_effect_columns` — queried at runtime to evaluate skip decisions
48
39
49
40
### CompletionTracker
50
41
@@ -64,21 +55,18 @@ Columns can be conditionally skipped per-row via `SkipConfig` (defined in `data_
64
55
Skip evaluation is handled by two utility modules:
65
56
66
57
-**`skip_evaluator.py`** — `evaluate_skip_when` renders the expression in a `NativeSandboxedEnvironment` (native Python types, `StrictUndefined`). `should_skip_by_propagation` checks set intersection between required columns and skipped columns.
67
-
-**`skip_tracker.py`** — manages the `__internal_skipped_columns` metadata key on record dicts. Each record carries a `__internal_skipped_columns` set listing which columns were skipped for that row. `apply_skip_to_record` adds the column name to that set, writes the skip value into the cell, and clears any side-effect columns. `strip_skip_metadata_from_records` removes the `__internal_skipped_columns` key before DataFrame construction so it never reaches parquet (called by `DatasetBatchManager`, `RowGroupBufferManager`, and inline in both engines).
58
+
-**`skip_tracker.py`** — manages the `__internal_skipped_columns` metadata key on record dicts. Each record carries a `__internal_skipped_columns` set listing which columns were skipped for that row. `apply_skip_to_record` adds the column name to that set, writes the skip value into the cell, and clears any side-effect columns. `strip_skip_metadata_from_records` removes the `__internal_skipped_columns` key before DataFrame construction so it never reaches parquet.
68
59
69
-
Both execution modes integrate skip at the same points:
60
+
`_run_cell` and `_run_batch` in `AsyncTaskScheduler` call `_should_skip_record` / `_apply_skip_to_record`. Skipped cells report as skipped (not success) in progress tracking.
70
61
71
-
-**Sequential**: `_run_full_column_generator` and the fan-out methods (`_fan_out_with_threads`, `_fan_out_with_async`) call `_should_skip_cell` per record. Skipped rows are excluded from the generator input, then merged back with skip metadata preserved. A fast `_column_can_skip` check short-circuits the per-record evaluation when no skip config or propagation applies.
72
-
-**Async**: `_run_cell` and `_run_batch` in `AsyncTaskScheduler` call `_should_skip_record` / `_apply_skip_to_record` with the same logic. Skipped cells report as skipped (not success) in progress tracking.
62
+
DAG edges are added for `skip.when` column references in both `topologically_sort_column_configs` (compile-time sort) and `ExecutionGraph.create` (runtime graph) so skip-gate columns are generated before the gated column.
73
63
74
-
DAG edges are added for `skip.when` column references in both `topologically_sort_column_configs` (compile-time sort) and `ExecutionGraph.create` (async runtime) so skip-gate columns are generated before the gated column.
64
+
### RowGroupBufferManager
75
65
76
-
### DatasetBatchManager
77
-
78
-
Manages in-memory row buffers and persistence:
79
-
-`finish_batch` → writes parquet via `ArtifactStorage`
80
-
- Updates dataset metadata between batches
81
-
- The async path uses `RowGroupBufferManager` for per-row-group DataFrames and checkpointing
66
+
Manages per-row-group DataFrames and persistence:
67
+
-`checkpoint_row_group` → writes parquet via `ArtifactStorage`
68
+
- Updates dataset metadata between row groups
69
+
- Tracks dropped rows and actual record counts for resume
82
70
83
71
### Resume Checkpointing
84
72
@@ -90,7 +78,7 @@ Manages in-memory row buffers and persistence:
90
78
91
79
Checkpoint state lives in `metadata.json`. Each metadata write includes the config fingerprint (`config_hash`, `config_hash_algo`, and `config_hash_version`) so compatibility checks do not need to deserialize `builder_config.json` for the common path. `builder_config.json` remains the human-readable record of the run configuration and the fallback for older datasets.
92
80
93
-
Both engines resume the same way: they scan `parquet-files/batch_*.parquet` and read parquet metadata to recover the completed row-group IDs and their actual persisted row counts. `metadata.json` remains the source of truth for the run *configuration* (`buffer_size`, `target_num_records`, `original_target_num_records`, config fingerprint), but the filesystem is the source of truth for *progress* (`num_completed_batches`, `actual_num_records`). Splitting the two sources is what lets resume survive a crash between writing a batch parquet and updating metadata — the filesystem reflects the durable state even when metadata lags by a step. Reading actual row counts also matters for async early-shutdown salvage, where a completed parquet file can contain fewer rows than the requested row-group size. The async engine tolerates non-contiguous IDs because row groups can complete out of order; the sync engine writes batches sequentially and rejects holes (likely external mutation or a directory written by an incompatible engine).
81
+
Resume scans `parquet-files/batch_*.parquet` and reads parquet metadata to recover the completed row-group IDs and their actual persisted row counts. `metadata.json` remains the source of truth for the run *configuration* (`buffer_size`, `target_num_records`, `original_target_num_records`, config fingerprint), but the filesystem is the source of truth for *progress* (`num_completed_batches`, `actual_num_records`). Splitting the two sources is what lets resume survive a crash between writing a row-group parquet and updating metadata - the filesystem reflects the durable state even when metadata lags by a step. Reading actual row counts also matters for early-shutdown salvage, where a completed parquet file can contain fewer rows than the requested row-group size. Resume tolerates non-contiguous IDs because row groups can complete out of order.
94
82
95
83
Resume relies on stable row-group boundaries within a run. It treats datasets that have completed `process_after_generation()` as terminal: after-generation processors operate on the whole dataset and can re-chunk rows or change schema, invalidating row-group identity for later resume/extension. The terminal-state check raises a clear `DatasetGenerationError` (not a `TypeError`) when the persisted metadata is missing required fields such as `target_num_records`.
96
84
@@ -102,19 +90,6 @@ Metadata writes are atomic (`tmp` file + `fsync` + `os.replace`) because `metada
@@ -136,15 +111,15 @@ When request admission is available, async scheduling may use request-pressure s
136
111
137
112
## Design Decisions
138
113
139
-
-**Dual execution engines behind one API.** The sequential engine is simpler and easier to debug; the async engine adds row-group parallelism for throughput. Users switch via an environment variable without changing their code.
114
+
-**One execution engine behind the API.** The async scheduler handles row-group parallelism, DAG-aware dispatch, resume, and checkpointing for all generation runs.
140
115
-**DAG-driven ordering** ensures columns with dependencies (e.g., a judge column that depends on a text column) are generated in the correct order, regardless of the order they appear in the config.
141
116
-**Fair async admission with bounded borrow by default** keeps the scheduler flowing across ready columns and model groups. `FairTaskQueue.select_next(...)` chooses eligible ready work, `TaskAdmissionController` leases scheduler resources before spawn, and `FairTaskQueue.commit(...)` removes the selected task only after admission succeeds. The default `BoundedBorrowTaskAdmissionPolicyConfig` computes a strict per-group share, lets solo groups borrow only up to a capacity-derived reserve, and makes borrowed groups yield when eligible peer pressure appears. Passing `bounded_borrow=None` selects strict-fair admission for tests and benchmark comparisons. Per-group virtual-time ordering prevents a large ready frontier from degenerating into a column-by-column wave, and scheduler-resource accounting remains separate from provider/model request admission.
142
-
-**Salvage rounds in async mode** retry failed tasks after all other tasks in a round complete, improving resilience against transient LLM failures without blocking the entire generation.
117
+
-**Salvage rounds** retry failed tasks after all other tasks in a round complete, improving resilience against transient LLM failures without blocking the entire generation.
143
118
-**Unified DAG construction.**`topologically_sort_column_configs` (in `execution_graph.py`) determines column ordering using Kahn's algorithm; the runtime `ExecutionGraph` adds strategy-aware dependency tracking for the async scheduler.
144
119
145
120
## Cross-References
146
121
147
-
-[System Architecture](overview.md)— end-to-end data flow
122
+
-[System Architecture](overview.md)- end-to-end data flow
148
123
-[Engine Layer](engine.md) — compilation and generator hierarchy
149
124
-[Models](models.md) — how generators access LLMs
150
125
-[Config Layer](config.md) — column configs and dependency declarations
|`MCPFacade` / `MCPRegistry`|`data-designer-engine`| Tool execution via Model Context Protocol |
35
35
|`ColumnGeneratorRegistry`|`data-designer-engine`| Maps column types to generator implementations |
@@ -42,9 +42,7 @@ Users declare what their data should look like through config objects (columns,
42
42
43
43
2.**Compilation** — `compile_data_designer_config` enriches the config (seed columns, internal UUID column), runs static validation (Jinja references, code columns, processors), and produces a compiled column order via topological sort.
44
44
45
-
3.**Generation** — `DatasetBuilder` instantiates column generators from the registry, then executes one of two paths:
46
-
-**Sequential** (default): batch loop over columns in topological order. Each generator produces its column via `CELL_BY_CELL` (threaded fan-out) or `FULL_COLUMN` strategy.
47
-
-**Async** (`DATA_DESIGNER_ASYNC_ENGINE=1`): builds an `ExecutionGraph`, partitions rows into groups, and dispatches tasks via `AsyncTaskScheduler` with `FairTaskQueue` selection, `TaskAdmissionController` scheduler-resource leases, salvage rounds, and per-row-group checkpointing.
45
+
3.**Generation** — `DatasetBuilder` instantiates column generators from the registry, builds an `ExecutionGraph`, partitions rows into groups, and dispatches tasks via `AsyncTaskScheduler` with `FairTaskQueue` selection, `TaskAdmissionController` scheduler-resource leases, salvage rounds, and per-row-group checkpointing.
@@ -54,15 +52,15 @@ Users declare what their data should look like through config objects (columns,
54
52
55
53
-**PEP 420 namespace packages** allow the three packages to be installed independently while sharing the `data_designer` namespace. This enables lighter installs (e.g., config-only for validation tooling) without import conflicts.
56
54
-**Lazy imports throughout** — `__getattr__`-based lazy loading in `data_designer.config` and `data_designer.interface`, plus `lazy_heavy_imports` for numpy/pandas, keep startup fast.
57
-
-**Dual execution engines**share the same `DatasetBuilder`API. The async engine adds row-group parallelism and DAG-aware scheduling without changing the public interface.
55
+
-**Async-only execution**gives `DatasetBuilder`one scheduling path with row-group parallelism and DAG-aware dispatch behind the public interface.
58
56
-**`TaskRegistry` subclasses: one instance per class** — `TaskRegistry.__new__` (`registry/base.py`) ensures a single instance of each concrete registry (column generators, profilers, processors). **`ModelRegistry`** and **`MCPRegistry`** are ordinary classes, constructed per run with injected dependencies. **`PluginRegistry`** (`plugins/registry.py`) uses `__new__` so entry points are discovered once per process.
59
57
60
58
## Cross-References
61
59
62
60
-[Config Layer](config.md) — builder API, column types, model configs, plugin system
0 commit comments