diff --git a/README.md b/README.md index 46da974c4..9e4342ae7 100644 --- a/README.md +++ b/README.md @@ -22,11 +22,11 @@ Data Designer helps you create synthetic datasets that go beyond simple LLM prom --- -### πŸ“£ Heads-up: async engine is now the default +### πŸ“£ Heads-up: async engine Data Designer now runs pipelines on a cell-level async engine that overlaps independent columns and adapts concurrency per (provider, model). On most pipelines this is faster with no config changes; on slow self-hosted endpoints, set `inference_parameters.timeout` to your real per-request latency. See [Architecture & Performance β†’ Async Engine](https://docs.nvidia.com/nemo/datadesigner/concepts/architecture-performance#async-engine) for the behaviors worth knowing about. -If you hit anything unexpected, fall back to the legacy sync engine for one transitional release with `DATA_DESIGNER_ASYNC_ENGINE=0`, and please [open an issue](https://github.com/NVIDIA-NeMo/DataDesigner/issues/new) so we can fix the async path. +If you hit anything unexpected, please [open an issue](https://github.com/NVIDIA-NeMo/DataDesigner/issues/new). --- diff --git a/architecture/dataset-builders.md b/architecture/dataset-builders.md index f9bd4d85c..1e9203f24 100644 --- a/architecture/dataset-builders.md +++ b/architecture/dataset-builders.md @@ -1,30 +1,21 @@ # Dataset Builders -The dataset builder subsystem orchestrates the end-to-end generation of a dataset from compiled column configs. It supports two execution modes: a sequential batch loop and an async DAG-based scheduler. +The dataset builder subsystem orchestrates the end-to-end generation of a dataset from compiled column configs using an async DAG-based scheduler. Source: `packages/data-designer-engine/src/data_designer/engine/dataset_builders/` ## Overview -`DatasetBuilder` is the central orchestrator. It receives a compiled `DataDesignerConfig`, instantiates column generators from the registry, and executes them in dependency order. The execution mode is selected by the `DATA_DESIGNER_ASYNC_ENGINE` environment variable. +`DatasetBuilder` is the central orchestrator. It receives a compiled `DataDesignerConfig`, instantiates column generators from the registry, and executes them through `AsyncTaskScheduler`. -Both modes produce the same output: batched parquet files managed by `DatasetBatchManager`, with post-generation processing and profiling. +The scheduler produces row-group parquet files managed by `RowGroupBufferManager`, with post-generation processing and profiling. ## Key Components ### DatasetBuilder -Entry point for generation. `build()` branches: -- **Sequential path** (default): `DatasetBatchManager.start` β†’ batch loop β†’ `_run_batch` per batch β†’ `finish()` β†’ `ProcessorRunner.run_after_generation` β†’ `model_registry.log_model_usage` -- **Async path** (`DATA_DESIGNER_ASYNC_ENGINE=1`): `_prepare_async_run` β†’ `AsyncTaskScheduler.run()` β†’ telemetry and metadata - -### Sequential Execution (`_run_batch`) - -Iterates compiled column order. For each generator: -1. `log_pre_generation()` β€” logs model and optional MCP tool alias -2. **From-scratch generators** (empty buffer): `generate_from_scratch` β†’ optional `run_pre_batch` after first seed column -3. **`CELL_BY_CELL` generators**: `_fan_out_with_threads` or `_fan_out_with_async` β€” parallel cell generation -4. **`FULL_COLUMN` generators**: `generate` on the whole batch DataFrame; output row count must match input row count +Entry point for generation. `build()` runs: +- `_prepare_async_run` β†’ `AsyncTaskScheduler.run()` β†’ telemetry and metadata ### Async Execution (`_build_async`) @@ -44,7 +35,7 @@ Preparation (`_prepare_async_run`): - `GenerationStrategy` per column (CELL_BY_CELL or FULL_COLUMN) - Kahn topological sort for execution order - `split_upstream_by_strategy` β€” separates batch-level from cell-level dependencies -- Skip metadata per column β€” `get_skip_config`, `should_propagate_skip`, `get_required_columns`, and `get_side_effect_columns` β€” queried at runtime by both engines to evaluate skip decisions +- Skip metadata per column β€” `get_skip_config`, `should_propagate_skip`, `get_required_columns`, and `get_side_effect_columns` β€” queried at runtime to evaluate skip decisions ### CompletionTracker @@ -64,21 +55,18 @@ Columns can be conditionally skipped per-row via `SkipConfig` (defined in `data_ Skip evaluation is handled by two utility modules: - **`skip_evaluator.py`** β€” `evaluate_skip_when` renders the expression in a `NativeSandboxedEnvironment` (native Python types, `StrictUndefined`). `should_skip_by_propagation` checks set intersection between required columns and skipped columns. -- **`skip_tracker.py`** β€” manages the `__internal_skipped_columns` metadata key on record dicts. Each record carries a `__internal_skipped_columns` set listing which columns were skipped for that row. `apply_skip_to_record` adds the column name to that set, writes the skip value into the cell, and clears any side-effect columns. `strip_skip_metadata_from_records` removes the `__internal_skipped_columns` key before DataFrame construction so it never reaches parquet (called by `DatasetBatchManager`, `RowGroupBufferManager`, and inline in both engines). +- **`skip_tracker.py`** β€” manages the `__internal_skipped_columns` metadata key on record dicts. Each record carries a `__internal_skipped_columns` set listing which columns were skipped for that row. `apply_skip_to_record` adds the column name to that set, writes the skip value into the cell, and clears any side-effect columns. `strip_skip_metadata_from_records` removes the `__internal_skipped_columns` key before DataFrame construction so it never reaches parquet. -Both execution modes integrate skip at the same points: +`_run_cell` and `_run_batch` in `AsyncTaskScheduler` call `_should_skip_record` / `_apply_skip_to_record`. Skipped cells report as skipped (not success) in progress tracking. -- **Sequential**: `_run_full_column_generator` and the fan-out methods (`_fan_out_with_threads`, `_fan_out_with_async`) call `_should_skip_cell` per record. Skipped rows are excluded from the generator input, then merged back with skip metadata preserved. A fast `_column_can_skip` check short-circuits the per-record evaluation when no skip config or propagation applies. -- **Async**: `_run_cell` and `_run_batch` in `AsyncTaskScheduler` call `_should_skip_record` / `_apply_skip_to_record` with the same logic. Skipped cells report as skipped (not success) in progress tracking. +DAG edges are added for `skip.when` column references in both `topologically_sort_column_configs` (compile-time sort) and `ExecutionGraph.create` (runtime graph) so skip-gate columns are generated before the gated column. -DAG edges are added for `skip.when` column references in both `topologically_sort_column_configs` (compile-time sort) and `ExecutionGraph.create` (async runtime) so skip-gate columns are generated before the gated column. +### RowGroupBufferManager -### DatasetBatchManager - -Manages in-memory row buffers and persistence: -- `finish_batch` β†’ writes parquet via `ArtifactStorage` -- Updates dataset metadata between batches -- The async path uses `RowGroupBufferManager` for per-row-group DataFrames and checkpointing +Manages per-row-group DataFrames and persistence: +- `checkpoint_row_group` β†’ writes parquet via `ArtifactStorage` +- Updates dataset metadata between row groups +- Tracks dropped rows and actual record counts for resume ### Resume Checkpointing @@ -90,7 +78,7 @@ Manages in-memory row buffers and persistence: Checkpoint state lives in `metadata.json`. Each metadata write includes the config fingerprint (`config_hash`, `config_hash_algo`, and `config_hash_version`) so compatibility checks do not need to deserialize `builder_config.json` for the common path. `builder_config.json` remains the human-readable record of the run configuration and the fallback for older datasets. -Both engines resume the same way: they scan `parquet-files/batch_*.parquet` and read parquet metadata to recover the completed row-group IDs and their actual persisted row counts. `metadata.json` remains the source of truth for the run *configuration* (`buffer_size`, `target_num_records`, `original_target_num_records`, config fingerprint), but the filesystem is the source of truth for *progress* (`num_completed_batches`, `actual_num_records`). Splitting the two sources is what lets resume survive a crash between writing a batch parquet and updating metadata β€” the filesystem reflects the durable state even when metadata lags by a step. Reading actual row counts also matters for async early-shutdown salvage, where a completed parquet file can contain fewer rows than the requested row-group size. The async engine tolerates non-contiguous IDs because row groups can complete out of order; the sync engine writes batches sequentially and rejects holes (likely external mutation or a directory written by an incompatible engine). +Resume scans `parquet-files/batch_*.parquet` and reads parquet metadata to recover the completed row-group IDs and their actual persisted row counts. `metadata.json` remains the source of truth for the run *configuration* (`buffer_size`, `target_num_records`, `original_target_num_records`, config fingerprint), but the filesystem is the source of truth for *progress* (`num_completed_batches`, `actual_num_records`). Splitting the two sources is what lets resume survive a crash between writing a row-group parquet and updating metadata - the filesystem reflects the durable state even when metadata lags by a step. Reading actual row counts also matters for early-shutdown salvage, where a completed parquet file can contain fewer rows than the requested row-group size. Resume tolerates non-contiguous IDs because row groups can complete out of order. Resume relies on stable row-group boundaries within a run. It treats datasets that have completed `process_after_generation()` as terminal: after-generation processors operate on the whole dataset and can re-chunk rows or change schema, invalidating row-group identity for later resume/extension. The terminal-state check raises a clear `DatasetGenerationError` (not a `TypeError`) when the persisted metadata is missing required fields such as `target_num_records`. @@ -102,19 +90,6 @@ Metadata writes are atomic (`tmp` file + `fsync` + `os.replace`) because `metada ## Data Flow -### Sequential -``` -DatasetBuilder.build() - β†’ DatasetBatchManager.start() - β†’ for each batch: - β†’ for each generator (topological order): - β†’ generate_from_scratch / generate (FULL_COLUMN) / fan_out (CELL_BY_CELL) - β†’ DatasetBatchManager.finish_batch() β†’ parquet - β†’ ProcessorRunner.run_after_generation() - β†’ model_registry.log_model_usage() -``` - -### Async ``` DatasetBuilder.build() β†’ _build_async() @@ -136,15 +111,15 @@ When request admission is available, async scheduling may use request-pressure s ## Design Decisions -- **Dual execution engines behind one API.** The sequential engine is simpler and easier to debug; the async engine adds row-group parallelism for throughput. Users switch via an environment variable without changing their code. +- **One execution engine behind the API.** The async scheduler handles row-group parallelism, DAG-aware dispatch, resume, and checkpointing for all generation runs. - **DAG-driven ordering** ensures columns with dependencies (e.g., a judge column that depends on a text column) are generated in the correct order, regardless of the order they appear in the config. - **Fair async admission with bounded borrow by default** keeps the scheduler flowing across ready columns and model groups. `FairTaskQueue.select_next(...)` chooses eligible ready work, `TaskAdmissionController` leases scheduler resources before spawn, and `FairTaskQueue.commit(...)` removes the selected task only after admission succeeds. The default `BoundedBorrowTaskAdmissionPolicyConfig` computes a strict per-group share, lets solo groups borrow only up to a capacity-derived reserve, and makes borrowed groups yield when eligible peer pressure appears. Passing `bounded_borrow=None` selects strict-fair admission for tests and benchmark comparisons. Per-group virtual-time ordering prevents a large ready frontier from degenerating into a column-by-column wave, and scheduler-resource accounting remains separate from provider/model request admission. -- **Salvage rounds in async mode** retry failed tasks after all other tasks in a round complete, improving resilience against transient LLM failures without blocking the entire generation. +- **Salvage rounds** retry failed tasks after all other tasks in a round complete, improving resilience against transient LLM failures without blocking the entire generation. - **Unified DAG construction.** `topologically_sort_column_configs` (in `execution_graph.py`) determines column ordering using Kahn's algorithm; the runtime `ExecutionGraph` adds strategy-aware dependency tracking for the async scheduler. ## Cross-References -- [System Architecture](overview.md) β€” end-to-end data flow +- [System Architecture](overview.md) - end-to-end data flow - [Engine Layer](engine.md) β€” compilation and generator hierarchy - [Models](models.md) β€” how generators access LLMs - [Config Layer](config.md) β€” column configs and dependency declarations diff --git a/architecture/overview.md b/architecture/overview.md index 10bde6c90..35c53e591 100644 --- a/architecture/overview.md +++ b/architecture/overview.md @@ -29,7 +29,7 @@ Users declare what their data should look like through config objects (columns, |-----------|---------|-------------| | `DataDesigner` | `data-designer` | Public API β€” `create()`, `preview()`, `validate()` | | `DataDesignerConfigBuilder` | `data-designer-config` | Fluent builder for dataset configs | -| `DatasetBuilder` | `data-designer-engine` | Orchestrates generation (sync or async) | +| `DatasetBuilder` | `data-designer-engine` | Orchestrates async generation | | `ModelFacade` / `ModelRegistry` | `data-designer-engine` | LLM client abstraction with retry, request admission, usage tracking | | `MCPFacade` / `MCPRegistry` | `data-designer-engine` | Tool execution via Model Context Protocol | | `ColumnGeneratorRegistry` | `data-designer-engine` | Maps column types to generator implementations | @@ -42,9 +42,7 @@ Users declare what their data should look like through config objects (columns, 2. **Compilation** β€” `compile_data_designer_config` enriches the config (seed columns, internal UUID column), runs static validation (Jinja references, code columns, processors), and produces a compiled column order via topological sort. -3. **Generation** β€” `DatasetBuilder` instantiates column generators from the registry, then executes one of two paths: - - **Sequential** (default): batch loop over columns in topological order. Each generator produces its column via `CELL_BY_CELL` (threaded fan-out) or `FULL_COLUMN` strategy. - - **Async** (`DATA_DESIGNER_ASYNC_ENGINE=1`): builds an `ExecutionGraph`, partitions rows into groups, and dispatches tasks via `AsyncTaskScheduler` with `FairTaskQueue` selection, `TaskAdmissionController` scheduler-resource leases, salvage rounds, and per-row-group checkpointing. +3. **Generation** β€” `DatasetBuilder` instantiates column generators from the registry, builds an `ExecutionGraph`, partitions rows into groups, and dispatches tasks via `AsyncTaskScheduler` with `FairTaskQueue` selection, `TaskAdmissionController` scheduler-resource leases, salvage rounds, and per-row-group checkpointing. 4. **Post-processing** β€” `ProcessorRunner` applies transformations (pre-batch, post-batch, after-generation). Profilers analyze the generated dataset. @@ -54,7 +52,7 @@ Users declare what their data should look like through config objects (columns, - **PEP 420 namespace packages** allow the three packages to be installed independently while sharing the `data_designer` namespace. This enables lighter installs (e.g., config-only for validation tooling) without import conflicts. - **Lazy imports throughout** β€” `__getattr__`-based lazy loading in `data_designer.config` and `data_designer.interface`, plus `lazy_heavy_imports` for numpy/pandas, keep startup fast. -- **Dual execution engines** share the same `DatasetBuilder` API. The async engine adds row-group parallelism and DAG-aware scheduling without changing the public interface. +- **Async-only execution** gives `DatasetBuilder` one scheduling path with row-group parallelism and DAG-aware dispatch behind the public interface. - **`TaskRegistry` subclasses: one instance per class** β€” `TaskRegistry.__new__` (`registry/base.py`) ensures a single instance of each concrete registry (column generators, profilers, processors). **`ModelRegistry`** and **`MCPRegistry`** are ordinary classes, constructed per run with injected dependencies. **`PluginRegistry`** (`plugins/registry.py`) uses `__new__` so entry points are discovered once per process. ## Cross-References @@ -62,7 +60,7 @@ Users declare what their data should look like through config objects (columns, - [Config Layer](config.md) β€” builder API, column types, model configs, plugin system - [Engine Layer](engine.md) β€” compilation, generators, registries - [Models](models.md) β€” model facade, adapters, retry, request admission -- [Dataset Builders](dataset-builders.md) β€” sync/async orchestration, DAG, batching +- [Dataset Builders](dataset-builders.md) β€” async orchestration, DAG, row groups - [MCP](mcp.md) β€” tool execution, session pooling - [Sampling](sampling.md) β€” statistical generators, person/entity data - [CLI](cli.md) β€” command structure, controller/service/repo pattern diff --git a/fern/versions/latest/pages/concepts/architecture-and-performance.mdx b/fern/versions/latest/pages/concepts/architecture-and-performance.mdx index 3b71e3ce5..0e37e0e49 100644 --- a/fern/versions/latest/pages/concepts/architecture-and-performance.mdx +++ b/fern/versions/latest/pages/concepts/architecture-and-performance.mdx @@ -50,36 +50,36 @@ This guide explains the architecture, execution model, and how to tune performan ## Execution Model - - The default execution path is the **async engine**, which dispatches work at the cell level and overlaps independent columns β€” see [Async Engine](#async-engine) below for its semantics. The legacy **sync engine** is still available for one transitional release via `DATA_DESIGNER_ASYNC_ENGINE=0` and is what this section describes. The configuration knobs documented below (`buffer_size`, `max_parallel_requests`, AIMD throttle config, error handling) apply to both engines; the differences are flagged inline. + + Data Designer uses the **async engine**, which dispatches work at the cell level and overlaps independent columns. The configuration knobs documented below (`buffer_size`, `max_parallel_requests`, AIMD throttle config, error handling) apply to that engine. -The sync engine processes datasets in **batches**, with **parallel** operations within each batch. +The async engine processes datasets in **row groups**, with **parallel** operations across ready cells and independent columns. -### How It Works (sync engine) +### How It Works -**Step 1: Split into batches** +**Step 1: Split into row groups** -Your dataset is divided into batches of `buffer_size` records. Each batch is processed completely before moving to the next. +Your dataset is divided into row groups of `buffer_size` records. Row groups checkpoint independently so interrupted runs can resume from durable parquet files. -**Step 2: Process columns sequentially** +**Step 2: Schedule ready work** -Within a batch, columns are generated one at a time following the dependency graph. The order depends on column dependenciesβ€”expression columns may come before LLM columns if the LLM columns depend on them. (The async engine relaxes this: columns whose per-cell dependencies are satisfied can run concurrently with columns earlier in the order.) +Data Designer builds an execution graph from column dependencies. Cells and full-column tasks whose dependencies are satisfied can run concurrently, while order-dependent columns are protected by scheduler locks. Example workflow: ``` -Batch 1 (100 records) +Row group 1 (100 records) β”‚ β”œβ”€β–Ί Column 1: category (Sampler) ──── All 100 values generated β”œβ”€β–Ί Column 2: prompt (LLM Text) ──── All 100 values generated β”œβ”€β–Ί Column 3: response (LLM Text) ──── All 100 values generated β”œβ”€β–Ί Column 4: score (Expression) ──── All 100 values computed β”‚ -└─► Write batch to disk +└─► Checkpoint row group to disk β”‚ β–Ό -Batch 2 (100 records) +Row group 2 (100 records) ...repeat... ``` @@ -97,9 +97,9 @@ Within each column, cells are processed **in parallel** up to the configured lim | Concept | Description | |---------|-------------| -| **Batching** | Records are split into batches of `buffer_size`. In the sync engine, each batch completes entirely before the next begins; in the async engine, multiple row groups (the async equivalent) can be in flight concurrently. | -| **Sequential columns** | Sync-engine only: columns within a batch are generated one at a time, respecting the dependency graph. The async engine schedules at the cell level instead. | -| **Parallel cells** | Within a column, individual cells (records) are generated in parallel up to the configured limit. Same on both engines. | +| **Row groups** | Records are split into checkpointable groups of `buffer_size`; multiple row groups can be in flight concurrently. | +| **Dependency graph** | Columns and cells become runnable when their upstream dependencies are complete. | +| **Parallel cells** | Individual cells are generated in parallel up to the configured limit. | ### Concurrency Formula @@ -107,7 +107,7 @@ At any moment, the number of concurrent LLM requests is: ```python concurrent_requests = min( - buffer_size, # Records in current batch + buffer_size, # Records in an active row group current_throttle_limit, # AIMD-managed limit (≀ max_parallel_requests) remaining_cells_in_column # Cells left to generate ) @@ -121,10 +121,6 @@ concurrent_requests = min( This means Data Designer automatically finds the right concurrency level for your server without manual tuning. - - AIMD adaptive concurrency is fully active on the default **async engine**. The legacy **sync engine** is available for one transitional release via `DATA_DESIGNER_ASYNC_ENGINE=0`; on that path 429s are first retried at the HTTP transport layer and AIMD only engages as a fallback. See [Async engine](#async-engine) below. - - **Example**: With `buffer_size=100` and `max_parallel_requests=32`, Data Designer can send up to 32 requests in parallel. If `rampup_seconds=30`, it starts at one request and climbs linearly toward 32 over 30 seconds. If the server returns 429s, startup ramp stops, concurrency drops automatically (e.g., to 24, then 18), and normal AIMD recovery takes over once the server catches up. --- @@ -133,7 +129,7 @@ This means Data Designer automatically finds the right concurrency level for you ### `buffer_size` (RunConfig) -Controls how many records are processed per batch. +Controls how many records are processed per row group. ```python import data_designer.config as dd @@ -179,7 +175,7 @@ Resume modes: - `ResumeMode.ALWAYS`: resume the existing dataset directory. Raises if the checkpoint is incompatible or cannot be resumed safely. - `ResumeMode.IF_POSSIBLE`: resume when the stored config fingerprint matches the current config; otherwise start a fresh timestamped run. -Data Designer stores the run configuration in `metadata.json` (`buffer_size`, `target_num_records`, config fingerprint) and `builder_config.json`. Both engines recover progress the same way: they scan completed `batch_*.parquet` row groups and read parquet metadata for the row count actually persisted. That keeps resume crash-safe even if a run was interrupted between writing a batch parquet and updating metadata, because the filesystem reflects the durable state even when metadata lags by a step. +Data Designer stores the run configuration in `metadata.json` (`buffer_size`, `target_num_records`, config fingerprint) and `builder_config.json`. Resume scans completed `batch_*.parquet` row groups and reads parquet metadata for the row count actually persisted. That keeps resume crash-safe even if a run was interrupted between writing a row-group parquet and updating metadata, because the filesystem reflects the durable state even when metadata lags by a step. Resume has a few important invariants: @@ -216,7 +212,7 @@ model = dd.ModelConfig( **Default**: 4 -**When to increase**: Your inference backend has high throughput capacity, you're using a cloud API with generous rate limits, or you're running vLLM/TensorRT-LLM with multiple GPUs. With AIMD, setting an aggressively high value is safer than before β€” the system will self-correct downward if the server can't keep up. The salvage queue on the async engine (default) reclaims failed rows; on the sync engine the initial burst of 429s before AIMD stabilizes can drop rows, so start with a more conservative ceiling if you've opted into sync. +**When to increase**: Your inference backend has high throughput capacity, you're using a cloud API with generous rate limits, or you're running vLLM/TensorRT-LLM with multiple GPUs. With AIMD, setting an aggressively high value is safer than before β€” the system will self-correct downward if the server can't keep up, and salvage rounds can reclaim transiently failed rows. **When to decrease**: You want to cap resource usage to a known safe level, or you want more predictable/debuggable execution. @@ -250,10 +246,6 @@ designer.set_run_config(run_config) Data Designer uses an AIMD (Additive Increase / Multiplicative Decrease) controller to automatically adjust concurrency per model based on rate-limit feedback from the inference server. The defaults work well for most workloads. Override them via `ThrottleConfig` only when you understand the trade-offs. - - Adaptive throttling is fully active on the default **async engine**, where 429 responses propagate directly to the AIMD controller. On the legacy **sync engine** (`DATA_DESIGNER_ASYNC_ENGINE=0`), 429s are first retried at the HTTP transport layer; `ThrottleConfig` settings only take effect as a fallback if transport retries are exhausted. - - ```python import data_designer.config as dd from data_designer.interface import DataDesigner @@ -316,7 +308,7 @@ designer.set_run_config(run_config) ## Async Engine -The async engine is the default execution path. It dispatches work at the cell level rather than the column level, so independent columns overlap in time and per-(provider, model) AIMD pools tune themselves independently. See the [Async All the Way Down](/dev-notes/async-all-the-way-down) dev note for the full architecture. +The async engine is the execution path. It dispatches work at the cell level rather than the column level, so independent columns overlap in time and per-(provider, model) AIMD pools tune themselves independently. See the [Async All the Way Down](/dev-notes/async-all-the-way-down) dev note for the full architecture. ### Per-model timeouts drive every deadline @@ -353,18 +345,6 @@ except DataDesignerEarlyShutdownError: ... ``` -### Opting out - - - `DATA_DESIGNER_ASYNC_ENGINE=0` selects the legacy sync engine. This is a deprecated escape hatch for the transitional release and will be removed in a future version. The opt-out also emits a `DeprecationWarning` at run time so it shows up in your logs. - - -```bash -DATA_DESIGNER_ASYNC_ENGINE=0 python my_pipeline.py -``` - ---- - ## Common Problems | Problem | Symptom | Solution | diff --git a/fern/versions/latest/pages/concepts/custom_columns.mdx b/fern/versions/latest/pages/concepts/custom_columns.mdx index 21e163c17..2fc9ce8a2 100644 --- a/fern/versions/latest/pages/concepts/custom_columns.mdx +++ b/fern/versions/latest/pages/concepts/custom_columns.mdx @@ -59,7 +59,7 @@ Model aliases are validated before generation starts. If an alias doesn't exist For `full_column`, set `generation_strategy=dd.GenerationStrategy.FULL_COLUMN`. - Sync `cell_by_cell` generators are dispatched concurrently across rows under the async engine. Module-level mutable state (counters, caches, non-thread-safe HTTP clients) needs synchronization or per-row instantiation. For network-bound work, prefer `async def fn(row)` β€” the engine runs it directly on its event loop and skips the thread bridge. + Synchronous `cell_by_cell` generator functions are dispatched concurrently across rows. Module-level mutable state (counters, caches, non-thread-safe HTTP clients) needs synchronization or per-row instantiation. For network-bound work, prefer `async def fn(row)` β€” the engine runs it directly on its event loop and skips the thread bridge. ## The Decorator @@ -150,7 +150,7 @@ from data_designer.engine.models.facade import ModelFacade mock_model = MagicMock(spec=ModelFacade) ``` -Mocking only `generate()` will silently no-op under the async engine because the bridge routes through `agenerate()`. +Mocking only `generate()` will silently no-op because the bridge routes through `agenerate()`. ## See Also diff --git a/fern/versions/latest/pages/concepts/processors.mdx b/fern/versions/latest/pages/concepts/processors.mdx index 6d0243591..6751eabcc 100644 --- a/fern/versions/latest/pages/concepts/processors.mdx +++ b/fern/versions/latest/pages/concepts/processors.mdx @@ -31,7 +31,7 @@ Processors can run at three stages, determined by which callback methods they im - Data Designer enforces row-count invariance in `process_before_batch()`. The async engine also enforces it in `process_after_batch()`. Run row-filtering or expansion logic in `process_after_generation()`, which operates on the final dataset and supports row-count changes, or put the transform at a workflow boundary. + Data Designer enforces row-count invariance in `process_before_batch()` and `process_after_batch()`. Run row-filtering or expansion logic in `process_after_generation()`, which operates on the final dataset and supports row-count changes, or put the transform at a workflow boundary. diff --git a/fern/versions/latest/pages/devnotes/posts/async-all-the-way-down.mdx b/fern/versions/latest/pages/devnotes/posts/async-all-the-way-down.mdx index 631d1ab18..bf503586a 100644 --- a/fern/versions/latest/pages/devnotes/posts/async-all-the-way-down.mdx +++ b/fern/versions/latest/pages/devnotes/posts/async-all-the-way-down.mdx @@ -206,7 +206,7 @@ Multi-model pipelines are where the architecture pays for itself. With independe ### **Adoption** -Adoption is opt-in. Set `DATA_DESIGNER_ASYNC_ENGINE=1` in your environment. Your existing pipeline definitions, dependency graph, column configs, model aliases all stay the same. We're keeping it behind an environment variable while we harden edge cases, but the goal is to make async the default. +The async engine is now Data Designer's execution engine. Your existing pipeline definitions, dependency graph, column configs, and model aliases stay the same. --- @@ -239,13 +239,7 @@ Generator authors implement whichever method is natural β€” sync for CPU-bound w ## **Try It** -Enable the async engine on any existing pipeline by setting an environment variable: - -```bash -DATA_DESIGNER_ASYNC_ENGINE=1 python my_pipeline.py -``` - -Pair it with the new progress bars for real-time feedback: +Use the progress bars for real-time feedback: ```py from data_designer.config.run_config import RunConfig @@ -269,7 +263,7 @@ The dependencies were always per-cell. Now the engine schedules them that way. ## **Update** -The async engine is now the default execution path. Set `DATA_DESIGNER_ASYNC_ENGINE=0` to opt back into the legacy sync engine for one transitional release. The [Architecture & Performance](/concepts/architecture-and-performance#async-engine) page covers the configuration knobs and behaviors worth knowing about. +The async engine is now the only execution path. The [Architecture & Performance](/concepts/architecture-and-performance#async-engine) page covers the configuration knobs and behaviors worth knowing about. --- diff --git a/fern/versions/latest/pages/devnotes/posts/owning-the-model-stack.mdx b/fern/versions/latest/pages/devnotes/posts/owning-the-model-stack.mdx index cfb03aa3a..15e2d8b02 100644 --- a/fern/versions/latest/pages/devnotes/posts/owning-the-model-stack.mdx +++ b/fern/versions/latest/pages/devnotes/posts/owning-the-model-stack.mdx @@ -136,7 +136,7 @@ Why? Because if the retry layer swallows 429s, the throttle manager never learns The split is clean and worth remembering. Transport retries handle *server problems*. Throttle adaptation handles *capacity problems*. The provider is working fine, you're just sending too many requests. Conflating the two is how you get retry storms. -One caveat: this boundary behaves differently depending on the execution mode. In async mode (currently experimental, enabled with `DATA_DESIGNER_ASYNC_ENGINE=1`), 429s bypass transport retries entirely and flow straight to `ThrottledModelClient` for AIMD feedback β€” this is the full adaptive loop described above. In sync mode, 429s are retried at the transport layer since there's no salvage queue to re-attempt failed rows. AIMD is still wired up but only fires if all transport retries are exhausted. This is temporary β€” once the async engine graduates from experimental, it will become the default path and the sync codepath will be retired. See [Async All the Way Down](/dev-notes/async-all-the-way-down) for the full story on the async engine. +Data Designer now uses the async engine for generation, so 429s bypass transport retries and flow straight to `ThrottledModelClient` for AIMD feedback. See [Async All the Way Down](/dev-notes/async-all-the-way-down) for the full story on the async engine. ## **Configuration** diff --git a/packages/data-designer-config/src/data_designer/config/run_config.py b/packages/data-designer-config/src/data_designer/config/run_config.py index 654edde84..589f38e75 100644 --- a/packages/data-designer-config/src/data_designer/config/run_config.py +++ b/packages/data-designer-config/src/data_designer/config/run_config.py @@ -130,9 +130,8 @@ class RunConfig(ConfigBase): early shutdown is enabled. Default is 0.5. shutdown_error_window: Minimum number of completed tasks before error rate monitoring begins. Must be >= 1. Default is 10. - buffer_size: Number of records to process in each batch during dataset generation. - A batch is processed end-to-end (column generation, post-batch processors, and writing the batch - to artifact storage) before moving on to the next batch. Must be > 0. Default is 1000. + buffer_size: Number of records in each row group during dataset generation. + Must be > 0. Default is 1000. max_in_flight_tasks: Maximum number of async scheduler tasks that may hold task leases at once. Tasks may be executing, awaiting I/O, or waiting on model request admission. Model API request concurrency is controlled separately by @@ -144,8 +143,7 @@ class RunConfig(ConfigBase): max_conversation_correction_steps: Maximum number of correction rounds permitted within a single conversation when generation tasks call `ModelFacade.generate(...)`. Must be >= 0. Default is 0. - async_trace: If True, collect per-task tracing data when using the async engine - (DATA_DESIGNER_ASYNC_ENGINE=1). Has no effect on the sync path. Default is False. + async_trace: If True, collect per-task tracing data. Default is False. progress_bar: If True, display sticky ANSI progress bars instead of periodic log lines during generation. Requires a TTY; falls back to log lines in non-TTY environments. Default is False. diff --git a/packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py b/packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py index 8479a6e06..dc3aab87c 100644 --- a/packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py +++ b/packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py @@ -50,12 +50,12 @@ def _compute_bridge_timeout( class _AsyncBridgedModelFacade: - """Proxy that bridges ``model.generate()`` to ``model.agenerate()`` in async engine mode. + """Proxy that bridges ``model.generate()`` to ``model.agenerate()``. - When a sync custom column runs inside ``asyncio.to_thread`` under the async engine, - the sync HTTP client is unavailable. This proxy intercepts the resulting - ``SyncClientUnavailableError`` and schedules ``agenerate()`` on the engine's persistent - event loop via ``run_coroutine_threadsafe``. + When a synchronous custom column runs inside ``asyncio.to_thread``, the sync + HTTP client is unavailable. This proxy intercepts the resulting + ``SyncClientUnavailableError`` and schedules ``agenerate()`` on the engine's + persistent event loop via ``run_coroutine_threadsafe``. All other attributes are forwarded to the underlying facade unchanged. """ @@ -82,7 +82,7 @@ def generate(self, *args: Any, **kwargs: Any) -> tuple[Any, list]: pass # No running loop - safe to bridge else: raise RuntimeError( - "model.generate() is not available in async engine mode from the event loop. " + "model.generate() is not available from the event loop. " "Use 'await model.agenerate()' in async custom columns." ) @@ -153,7 +153,7 @@ def get_scheduling_metadata(self) -> SchedulingMetadata: def get_generation_strategy(self) -> GenerationStrategy: """Return strategy based on config.""" - return self.config.generation_strategy + return GenerationStrategy(self.config.generation_strategy) def generate(self, data: dict | pd.DataFrame) -> dict | pd.DataFrame: """Generate column value(s) for a row (dict) or batch (DataFrame).""" diff --git a/packages/data-designer-engine/src/data_designer/engine/context.py b/packages/data-designer-engine/src/data_designer/engine/context.py index 621391617..44764efe1 100644 --- a/packages/data-designer-engine/src/data_designer/engine/context.py +++ b/packages/data-designer-engine/src/data_designer/engine/context.py @@ -5,8 +5,7 @@ from contextvars import ContextVar -# Set per row group by both engines: the async scheduler sets it before each -# task executes, and the sync engine's ``_run_batch`` sets it for each batch. +# Set per row group by the async scheduler before each task executes. # Value: (current_rg_index, total_rg_count) or None. current_row_group: ContextVar[tuple[int, int] | None] = ContextVar("current_row_group", default=None) diff --git a/packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py b/packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py index ace57190f..155aa48be 100644 --- a/packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py +++ b/packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py @@ -1823,10 +1823,10 @@ async def _run_from_scratch(self, task: Task, generator: ColumnGenerator) -> Any result_operation = "From-scratch generator" else: # Non-FromScratch generators dispatched as seeds (no upstream columns) - # operate on existing buffer rows β€” same contract as the sync engine's - # FULL_COLUMN path. Pass an ``rg_size``-row snapshot so the generator - # produces ``rg_size`` rows back, instead of an empty DataFrame which - # would yield zero values and fail ``update_batch``. + # operate on existing buffer rows. Pass an ``rg_size``-row snapshot + # so the generator produces ``rg_size`` rows back, instead of an + # empty DataFrame which would yield zero values and fail + # ``update_batch``. if self._buffer_manager is not None: records = [self._buffer_manager.get_row(task.row_group, ri) for ri in range(rg_size)] input_df = lazy.pd.DataFrame(records) diff --git a/packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py b/packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py index 478d272aa..199ed9c7b 100644 --- a/packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py +++ b/packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py @@ -3,7 +3,7 @@ from __future__ import annotations -import contextlib +import asyncio import functools import json import logging @@ -27,14 +27,12 @@ ) from data_designer.config.utils.type_helpers import StrEnum from data_designer.config.version import get_library_version -from data_designer.engine import flags from data_designer.engine.column_generators.generators.base import ( ColumnGenerator, - ColumnGeneratorWithModel, GenerationStrategy, ) from data_designer.engine.compiler import compile_data_designer_config -from data_designer.engine.context import current_row_group, current_row_group_start_offset +from data_designer.engine.dataset_builders.async_scheduler import AsyncTaskScheduler from data_designer.engine.dataset_builders.errors import DatasetGenerationError from data_designer.engine.dataset_builders.multi_column_configs import MultiColumnConfig from data_designer.engine.dataset_builders.row_group_plan import ( @@ -43,21 +41,12 @@ RowGroupPlanLike, normalize_row_group_plan, ) -from data_designer.engine.dataset_builders.utils.concurrency import ConcurrentThreadExecutor +from data_designer.engine.dataset_builders.scheduling.completion import CompletionTracker, FrontierDelta +from data_designer.engine.dataset_builders.utils.async_concurrency import ensure_async_engine_loop from data_designer.engine.dataset_builders.utils.config_compiler import compile_dataset_builder_column_configs -from data_designer.engine.dataset_builders.utils.dataset_batch_manager import DatasetBatchManager from data_designer.engine.dataset_builders.utils.execution_graph import ExecutionGraph from data_designer.engine.dataset_builders.utils.processor_runner import ProcessorRunner, ProcessorStage -from data_designer.engine.dataset_builders.utils.progress_tracker import ProgressTracker -from data_designer.engine.dataset_builders.utils.skip_evaluator import should_skip_column_for_record -from data_designer.engine.dataset_builders.utils.skip_tracker import ( - SKIPPED_COLUMNS_RECORD_KEY, - apply_skip_to_record, - prepare_records_for_skip_metadata_round_trip, - restore_skip_metadata, - strip_skip_metadata_from_records, -) -from data_designer.engine.dataset_builders.utils.sticky_progress_bar import StickyProgressBar +from data_designer.engine.dataset_builders.utils.row_group_buffer import RowGroupBufferManager from data_designer.engine.models.clients.adapters.http_model_client import ClientConcurrencyMode from data_designer.engine.models.telemetry import InferenceEvent, NemoSourceEnum, TaskStatusEnum, TelemetryHandler from data_designer.engine.processing.processors.base import Processor @@ -77,30 +66,11 @@ import pandas as pd from data_designer.config.run_config import RunConfig - from data_designer.engine.column_generators.generators.base import ColumnGeneratorWithModelRegistry from data_designer.engine.dataset_builders.scheduling.task_model import TaskTrace from data_designer.engine.models.usage import ModelUsageStats logger = logging.getLogger(__name__) -# The async-engine flag now lives in ``data_designer.engine.flags`` so the -# engine, the public interface, and the readiness module can share one source -# of truth. Always read ``flags.DATA_DESIGNER_ASYNC_ENGINE`` rather than caching -# a local copy so monkeypatches in tests are visible. - -if flags.DATA_DESIGNER_ASYNC_ENGINE: - import asyncio - - from data_designer.engine.dataset_builders.async_scheduler import ( - AsyncTaskScheduler, - ) - from data_designer.engine.dataset_builders.scheduling.completion import CompletionTracker, FrontierDelta - from data_designer.engine.dataset_builders.utils.async_concurrency import ( - AsyncConcurrentExecutor, - ensure_async_engine_loop, - ) - from data_designer.engine.dataset_builders.utils.row_group_buffer import RowGroupBufferManager - _CLIENT_VERSION: str = get_library_version() PRESERVE_DROPPED_COLUMNS_METADATA_KEY = "preserve_dropped_columns" @@ -186,17 +156,13 @@ def __init__( resource_provider: ResourceProvider, registry: DataDesignerRegistry | None = None, ): - self.batch_manager = DatasetBatchManager(resource_provider.artifact_storage) self._resource_provider = resource_provider - self._records_to_drop: set[int] = set() self._task_traces: list[TaskTrace] = [] self._registry = registry or DataDesignerRegistry() self._graph: ExecutionGraph | None = None - self._use_async: bool = flags.DATA_DESIGNER_ASYNC_ENGINE # Structured signal: set by _build_async if the scheduler hit early shutdown. - # Stays at defaults for sync-engine and successful async runs. Reset at - # the start of each public run path so reused builder instances don't - # leak state across runs. + # Reset at the start of each public run path so reused builder instances + # don't leak state across runs. self._early_shutdown: bool = False self._partial_row_groups: tuple[int, ...] = () # Number of records actually written by the most recent async run. @@ -293,12 +259,12 @@ def build( resume: Controls how interrupted runs are handled. - ``ResumeMode.NEVER`` (default): always start a fresh generation run. - - ``ResumeMode.ALWAYS``: resume from the last completed batch (sync) or row group - (async). ``buffer_size`` must match the original run. ``num_records`` may be - equal to or greater than what was already generated (you can extend the dataset); - ``num_records`` less than actual records so far raises ``DatasetGenerationError``. - If no checkpoint exists yet (interrupted before the first batch finished), silently - restarts from the beginning. Raises if the stored config is incompatible. + - ``ResumeMode.ALWAYS``: resume from the last completed row group. ``buffer_size`` + must match the original run. ``num_records`` may be equal to or greater than what + was already generated (you can extend the dataset); ``num_records`` less than + actual records so far raises ``DatasetGenerationError``. If no checkpoint exists + yet (interrupted before the first row group finished), silently restarts from the + beginning. Raises if the stored config is incompatible. - ``ResumeMode.IF_POSSIBLE``: like ``ALWAYS`` when the current config fingerprint matches the stored config; otherwise starts a fresh run without raising an error. @@ -309,12 +275,11 @@ def build( Path to the generated dataset directory. """ self._reset_run_state() - self._use_async = flags.DATA_DESIGNER_ASYNC_ENGINE run_readiness_check( self.single_column_configs, self._resource_provider, - client_concurrency_mode=ClientConcurrencyMode.ASYNC if self._use_async else ClientConcurrencyMode.SYNC, + client_concurrency_mode=ClientConcurrencyMode.ASYNC, ) # For IF_POSSIBLE and ALWAYS: check config compatibility before touching the artifact @@ -367,34 +332,18 @@ def build( buffer_size = self._resource_provider.run_config.buffer_size if resume == ResumeMode.ALWAYS and not self.artifact_storage.metadata_file_path.exists(): - # No metadata.json means the previous run was interrupted before any batch (sync) or - # row group (async) completed. Nothing to resume β€” discard any leftover partial - # results and start fresh. + # No metadata.json means the previous run was interrupted before any + # row group completed. Nothing to resume, so discard leftover + # partial results and start fresh. logger.info( - "▢️ No metadata.json found β€” the previous run was interrupted before any batch " + "▢️ No metadata.json found β€” the previous run was interrupted before any row group " "completed. Starting generation from the beginning." ) self.artifact_storage.clear_partial_results() resume = ResumeMode.NEVER self.artifact_storage.resume = ResumeMode.NEVER - if self._use_async: - self._build_async(generators, num_records, buffer_size, on_batch_complete, resume=resume) - elif resume == ResumeMode.ALWAYS: - self._build_with_resume(generators, num_records, buffer_size, on_batch_complete) - else: - group_id = uuid.uuid4().hex - self.batch_manager.start(num_records=num_records, buffer_size=buffer_size) - for batch_idx in range(self.batch_manager.num_batches): - logger.info(f"⏳ Processing batch {batch_idx + 1} of {self.batch_manager.num_batches}") - self._run_batch( - generators, - batch_mode="batch", - group_id=group_id, - current_batch_number=batch_idx, - on_batch_complete=on_batch_complete, - ) - self.batch_manager.finish() + self._build_async(generators, num_records, buffer_size, on_batch_complete, resume=resume) # After-generation processors run unconditionally on the on-disk dataset # (not gated on ``generated``). When resume sees every row group already @@ -504,12 +453,11 @@ def _load_resume_state(self, num_records: int, buffer_size: int) -> _ResumeState ``num_records`` must be >= the number of records already on disk (you may extend a dataset, but cannot shrink it below what has been written). ``buffer_size`` must match the original run because it determines row-group - boundaries. The sync engine additionally requires contiguous batch IDs; - the async engine tolerates holes from out-of-order completion. + boundaries. Resume tolerates holes from out-of-order row-group completion. Raises: DatasetGenerationError: If metadata is missing or incompatible, or if - the filesystem state is inconsistent with the engine in use. + the filesystem state is inconsistent. """ try: metadata = self.artifact_storage.read_metadata() @@ -525,7 +473,7 @@ def _load_resume_state(self, num_records: int, buffer_size: int) -> _ResumeState ) from exc num_completed_batches, actual_num_records, completed_row_groups = self._recover_progress_from_disk( - allow_holes=self._use_async, + allow_holes=True, ) if num_records < actual_num_records: @@ -581,75 +529,12 @@ def _load_resume_state(self, num_records: int, buffer_size: int) -> _ResumeState completed_row_groups=completed_row_groups, ) - def _build_with_resume( - self, - generators: list[ColumnGenerator], - num_records: int, - buffer_size: int, - on_batch_complete: Callable[[Path], None] | None, - ) -> bool: - """Resume generation from the last completed batch. - - Returns: - False if the dataset was already complete (no new records generated), - True after successfully generating the remaining batches. - """ - state = self._load_resume_state(num_records, buffer_size) - - # Compute the correct per-batch sizes. ceil(num_records/bs) is wrong for a - # non-aligned extension: original groups are immutable, so any extension always - # adds new groups beyond num_original_batches. - original_target = state.original_target_num_records - num_original_batches = -(-original_target // buffer_size) - extension_records = num_records - original_target - num_extension_batches = -(-extension_records // buffer_size) - original_sizes = [min(buffer_size, original_target - i * buffer_size) for i in range(num_original_batches)] - extension_sizes = [min(buffer_size, extension_records - i * buffer_size) for i in range(num_extension_batches)] - - self.batch_manager.start( - num_records=num_records, - buffer_size=buffer_size, - start_batch=state.num_completed_batches, - initial_actual_num_records=state.actual_num_records, - num_records_list=original_sizes + extension_sizes, - original_target_num_records=original_target, - ) - - if state.num_completed_batches >= self.batch_manager.num_batches: - logger.warning( - "⚠️ Dataset is already complete β€” all batches were found in the existing artifact directory. " - "Nothing to resume. Use resume=ResumeMode.NEVER if you want to generate a new dataset." - ) - return False - - logger.info( - f"▢️ Resuming from batch {state.num_completed_batches + 1} of {self.batch_manager.num_batches} " - f"({state.actual_num_records} records already generated)." - ) - - self.artifact_storage.clear_partial_results() - - group_id = uuid.uuid4().hex - for batch_idx in range(state.num_completed_batches, self.batch_manager.num_batches): - logger.info(f"⏳ Processing batch {batch_idx + 1} of {self.batch_manager.num_batches}") - self._run_batch( - generators, - batch_mode="batch", - group_id=group_id, - current_batch_number=batch_idx, - on_batch_complete=on_batch_complete, - row_group_start_offset=sum(self.batch_manager.num_records_list[:batch_idx]), - ) - self.batch_manager.finish() - return True - def build_preview(self, *, num_records: int) -> pd.DataFrame: self._reset_run_state() - self._use_async = flags.DATA_DESIGNER_ASYNC_ENGINE run_readiness_check( self.single_column_configs, self._resource_provider, - client_concurrency_mode=ClientConcurrencyMode.ASYNC if self._use_async else ClientConcurrencyMode.SYNC, + client_concurrency_mode=ClientConcurrencyMode.ASYNC, ) # Set media storage to DATAFRAME mode for preview - base64 stored directly in DataFrame @@ -659,14 +544,7 @@ def build_preview(self, *, num_records: int) -> pd.DataFrame: generators, self._graph = self._initialize_generators_and_graph() start_time = time.perf_counter() - if self._use_async: - dataset = self._build_async_preview(generators, num_records) - else: - group_id = uuid.uuid4().hex - self.batch_manager.start(num_records=num_records, buffer_size=num_records) - self._run_batch(generators, batch_mode="preview", save_partial_results=False, group_id=group_id) - dataset = self.batch_manager.get_current_batch(as_dataframe=True) - self.batch_manager.reset() + dataset = self._build_async_preview(generators, num_records) self._resource_provider.model_registry.log_model_usage(time.perf_counter() - start_time) @@ -682,7 +560,7 @@ def _reset_run_state(self) -> None: def _build_async_preview(self, generators: list[ColumnGenerator], num_records: int) -> pd.DataFrame: """Async preview path - single row group, no disk writes, returns in-memory DataFrame.""" - logger.info("⚑ DATA_DESIGNER_ASYNC_ENGINE is enabled - using async task-queue preview") + logger.info("⚑ Using async task-queue preview") settings = self._resource_provider.run_config trace_enabled = _is_async_trace_enabled(settings) @@ -739,14 +617,9 @@ def _recover_progress_from_disk(self, *, allow_holes: bool) -> tuple[int, int, d ``actual_num_records`` because a crash between ``move_partial_result_to_final_file_path`` and the metadata write that follows can leave parquet files on disk while metadata still reports stale counters. - Both engines use the same scan so resume semantics stay consistent. - Args: - allow_holes: ``True`` for the async engine, which schedules row groups - concurrently and may complete them out of order. ``False`` for the sync - engine, which writes batches sequentially β€” a non-contiguous set of IDs - indicates external mutation or a directory written by an incompatible - engine and is rejected with ``DatasetGenerationError``. + allow_holes: Whether to tolerate non-contiguous row-group IDs. Async + scheduling may complete row groups out of order. Returns: ``(num_completed_batches, actual_num_records, completed_row_groups)``. @@ -870,7 +743,7 @@ def _build_async( False if the dataset was already complete (no new records generated), True after successfully running the scheduler. """ - logger.info("⚑ DATA_DESIGNER_ASYNC_ENGINE is enabled - using async task-queue builder") + logger.info("⚑ Using async task-queue builder") settings = self._resource_provider.run_config trace_enabled = _is_async_trace_enabled(settings) @@ -1133,330 +1006,6 @@ def _write_builder_config(self) -> None: self.artifact_storage.base_dataset_path / SDG_CONFIG_FILENAME ) - def _run_batch( - self, - generators: list[ColumnGenerator], - *, - batch_mode: str, - save_partial_results: bool = True, - group_id: str, - current_batch_number: int | None = None, - on_batch_complete: Callable[[Path], None] | None = None, - row_group_start_offset: int | None = None, - ) -> None: - """Run one batch of generators in the sync engine. - - Sets two ContextVars for the duration of the batch so order-dependent - generators (e.g. seed dataset under ORDERED sampling) and log helpers - can observe the row group's place in the run: - - - ``current_row_group`` is set whenever ``current_batch_number`` is known - (both fresh and resumed sync runs), so ``format_row_group_tag()`` - produces a consistent ``(x/X)`` log prefix in either path. - - ``current_row_group_start_offset`` is set only when the caller supplies - the planned start offset (sync resume passes it; fresh sync and preview - do not), so generators can seek into the correct seed slice without - replaying already-consumed rows. - """ - row_group_token = None - start_offset_token = None - if current_batch_number is not None: - row_group_token = current_row_group.set((current_batch_number, self.batch_manager.num_batches)) - if row_group_start_offset is not None: - start_offset_token = current_row_group_start_offset.set(row_group_start_offset) - - try: - pre_batch_snapshot = self._resource_provider.model_registry.get_model_usage_snapshot() - ran_pre_batch = False - for generator in generators: - generator.log_pre_generation() - try: - generation_strategy = generator.get_generation_strategy() - if generator.can_generate_from_scratch and self.batch_manager.buffer_is_empty: - self._run_from_scratch_column_generator(generator) - # Run PRE_BATCH after seed generator, before other columns - if not ran_pre_batch: - self._processor_runner.run_pre_batch(self.batch_manager) - ran_pre_batch = True - elif generation_strategy == GenerationStrategy.CELL_BY_CELL: - self._run_cell_by_cell_generator(generator) - elif generation_strategy == GenerationStrategy.FULL_COLUMN: - self._run_full_column_generator(generator) - else: - logger.error(f"❌ Unknown generation strategy: {generation_strategy}") - raise DatasetGenerationError(f"πŸ›‘ Unknown generation strategy: {generation_strategy}") - if save_partial_results: - self.batch_manager.write() - except Exception as e: - column_error_str = ( - f"columns {generator.config.column_names}" - if hasattr(generator.config, "column_names") - else f"column {generator.config.name!r}" - ) - raise DatasetGenerationError(f"πŸ›‘ Failed to process {column_error_str}:\n{e}") - - try: - usage_deltas = self._resource_provider.model_registry.get_usage_deltas(pre_batch_snapshot) - self._emit_batch_inference_events(batch_mode, usage_deltas, group_id) - except Exception: - pass - - if current_batch_number is not None: - df_batch = self.batch_manager.get_current_batch(as_dataframe=True) - df_batch = self._processor_runner.run_post_batch(df_batch, current_batch_number=current_batch_number) - self._write_processed_batch(df_batch) - self.batch_manager.finish_batch(on_batch_complete) - finally: - if start_offset_token is not None: - current_row_group_start_offset.reset(start_offset_token) - if row_group_token is not None: - current_row_group.reset(row_group_token) - - def _run_from_scratch_column_generator(self, generator: ColumnGenerator) -> None: - df = generator.generate_from_scratch(self.batch_manager.num_records_batch) - self.batch_manager.add_records(df.to_dict(orient="records")) - - def _run_cell_by_cell_generator(self, generator: ColumnGenerator) -> None: - max_workers = self._resource_provider.run_config.non_inference_max_parallel_workers - if isinstance(generator, ColumnGeneratorWithModel): - max_workers = generator.inference_parameters.max_parallel_requests - if self._use_async: - logger.info("⚑ Using async engine for concurrent execution") - self._fan_out_with_async(generator, max_workers=max_workers) - else: - self._fan_out_with_threads(generator, max_workers=max_workers) - - def _column_display_name(self, config: ColumnConfigT) -> str: - return f"columns {config.column_names}" if hasattr(config, "column_names") else config.name - - def _require_graph(self) -> ExecutionGraph: - """Return the initialized execution graph for the current run.""" - graph = self._graph - if graph is None: - raise DatasetGenerationError("Execution graph accessed before generator initialization.") - return graph - - def _column_can_skip(self, column_name: str) -> bool: - """Fast check: can *column_name* ever be skipped (expression gate or propagation)? - - Returns ``True`` when the column has an expression gate or should - propagate skip metadata from required columns. - """ - if self._graph is None: - return False - if self._graph.get_skip_config(column_name) is not None: - return True - return self._graph.should_propagate_skip(column_name) and bool(self._graph.get_required_columns(column_name)) - - def _should_skip_cell(self, column_name: str, record: dict) -> bool: - """Decide whether a single cell should be skipped (propagation or expression gate).""" - skip_config = self._graph.get_skip_config(column_name) - return should_skip_column_for_record( - record, - propagate_skip=self._graph.should_propagate_skip(column_name), - required_columns=self._graph.get_required_columns(column_name), - skip_config_when=skip_config.when if skip_config is not None else None, - ) - - def _write_skip_to_record(self, column_name: str, record: dict) -> None: - """Write skip metadata and the skip value into *record* in-place.""" - skip_config = self._graph.get_skip_config(column_name) - skip_value = skip_config.value if skip_config is not None else None - apply_skip_to_record( - record, - column_name=column_name, - cell_value=skip_value, - side_effect_columns=self._graph.get_side_effect_columns(column_name), - ) - - def _run_full_column_generator(self, generator: ColumnGenerator) -> None: - column_name = generator.config.name if not isinstance(generator.config, MultiColumnConfig) else None - - if column_name is not None and self._column_can_skip(column_name): - self._run_full_column_generator_with_skip(generator, column_name) - else: - self._run_full_column_generator_without_skip(generator) - - def _run_full_column_generator_without_skip(self, generator: ColumnGenerator) -> None: - """Run the generator on the full batch, preserving skip metadata across the replace.""" - old_records = [record for _, record in self.batch_manager.iter_current_batch()] - input_records, restore_context = prepare_records_for_skip_metadata_round_trip(old_records) - - df = generator.generate(lazy.pd.DataFrame(input_records)) - new_records = df.to_dict(orient="records") - if restore_context is not None: - try: - restore_skip_metadata(new_records, context=restore_context) - except ValueError as exc: - raise DatasetGenerationError( - f"Unable to restore skip provenance after FULL_COLUMN generation for " - f"{self._column_display_name(generator.config)}: {exc}" - ) from exc - self.batch_manager.replace_buffer(new_records) - - def _run_full_column_generator_with_skip(self, generator: ColumnGenerator, column_name: str) -> None: - """Run a FULL_COLUMN generator with per-row skip evaluation and merge-back. - - Only reachable when ``_column_can_skip`` is True. - """ - active_records: list[dict] = [] - records_with_skip_status: list[tuple[bool, dict]] = [] - has_skipped = False - for _, record in self.batch_manager.iter_current_batch(): - skipped = self._should_skip_cell(column_name, record) - if skipped: - has_skipped = True - self._write_skip_to_record(column_name, record) - else: - active_records.append(record) - records_with_skip_status.append((skipped, record)) - - if not has_skipped: - # No rows were actually skipped β€” use the normal path to avoid the - # overhead of stripping metadata, building a separate active DataFrame, - # and merging results back. - self._run_full_column_generator_without_skip(generator) - return - - batch = self._merge_skipped_and_generated(generator, column_name, active_records, records_with_skip_status) - self.batch_manager.replace_buffer(batch) - - def _merge_skipped_and_generated( - self, - generator: ColumnGenerator, - column_name: str, - active_records: list[dict], - records_with_skip_status: list[tuple[bool, dict]], - ) -> list[dict]: - """Generate only for active (non-skipped) records and merge back with skipped ones.""" - if not active_records: - return [record for _, record in records_with_skip_status] - - active_df = lazy.pd.DataFrame(strip_skip_metadata_from_records(active_records)) - result_records = generator.generate(active_df).to_dict(orient="records") - if len(result_records) != len(active_records): - raise DatasetGenerationError( - f"Generator for '{column_name}' returned {len(result_records)} rows " - f"but {len(active_records)} active (non-skipped) records were expected." - ) - - result_iter = iter(result_records) - batch: list[dict] = [] - for skipped, record in records_with_skip_status: - if skipped: - batch.append(record) - continue - gen_result = next(result_iter) - prior_skipped = record.get(SKIPPED_COLUMNS_RECORD_KEY) - if prior_skipped is not None: - gen_result[SKIPPED_COLUMNS_RECORD_KEY] = prior_skipped - batch.append(gen_result) - return batch - - def _setup_fan_out( - self, - generator: ColumnGeneratorWithModelRegistry, - max_workers: int, - progress_bar: StickyProgressBar | None = None, - ) -> tuple[ProgressTracker, dict[str, Any]]: - if generator.get_generation_strategy() != GenerationStrategy.CELL_BY_CELL: - raise DatasetGenerationError( - f"Generator {generator.name} is not a {GenerationStrategy.CELL_BY_CELL} " - "generator so concurrent fan-out is not supported." - ) - - label = f"{generator.config.column_type} column '{generator.config.name}'" - progress_tracker = ProgressTracker( - total_records=self.batch_manager.num_records_batch, - label=label, - progress_bar=progress_bar, - progress_bar_key=generator.config.name, - ) - progress_tracker.log_start(max_workers) - - settings = self._resource_provider.run_config - executor_kwargs: dict = { - "column_name": generator.config.name, - "result_callback": self._make_result_callback(progress_tracker), - "error_callback": self._make_error_callback(progress_tracker), - "shutdown_error_rate": settings.shutdown_error_rate, - "shutdown_error_window": settings.shutdown_error_window, - "disable_early_shutdown": settings.disable_early_shutdown, - } - - return progress_tracker, executor_kwargs - - def _finalize_fan_out(self, progress_tracker: ProgressTracker) -> None: - progress_tracker.log_final() - - if len(self._records_to_drop) > 0: - self._cleanup_dropped_record_images(self._records_to_drop) - self.batch_manager.drop_records(self._records_to_drop) - self._records_to_drop.clear() - - def _fan_out_with_async(self, generator: ColumnGeneratorWithModelRegistry, max_workers: int) -> None: - if getattr(generator.config, "tool_alias", None): - logger.info("πŸ› οΈ Tool calling enabled") - bar = StickyProgressBar() if self._resource_provider.run_config.progress_bar else None - can_skip = self._column_can_skip(generator.config.name) - with bar or contextlib.nullcontext(): - progress_tracker, executor_kwargs = self._setup_fan_out(generator, max_workers, progress_bar=bar) - executor = AsyncConcurrentExecutor(max_workers=max_workers, **executor_kwargs) - work_items: list[tuple[Any, dict[str, Any]]] = [] - for i, record in self.batch_manager.iter_current_batch(): - if can_skip and self._should_skip_cell(generator.config.name, record): - self._write_skip_to_record(generator.config.name, record) - self.batch_manager.update_record(i, record) - progress_tracker.record_skipped() - continue - work_items.append( - ( - generator.agenerate(record), - {"index": i, "column_name": generator.config.name}, - ) - ) - executor.run(work_items) - self._finalize_fan_out(progress_tracker) - - def _fan_out_with_threads(self, generator: ColumnGeneratorWithModelRegistry, max_workers: int) -> None: - if getattr(generator.config, "tool_alias", None): - logger.info("πŸ› οΈ Tool calling enabled") - bar = StickyProgressBar() if self._resource_provider.run_config.progress_bar else None - can_skip = self._column_can_skip(generator.config.name) - with bar or contextlib.nullcontext(): - progress_tracker, executor_kwargs = self._setup_fan_out(generator, max_workers, progress_bar=bar) - with ConcurrentThreadExecutor(max_workers=max_workers, **executor_kwargs) as executor: - for i, record in self.batch_manager.iter_current_batch(): - if can_skip and self._should_skip_cell(generator.config.name, record): - self._write_skip_to_record(generator.config.name, record) - self.batch_manager.update_record(i, record) - progress_tracker.record_skipped() - continue - executor.submit( - lambda record: generator.generate(record), - record, - context={"index": i, "column_name": generator.config.name}, - ) - self._finalize_fan_out(progress_tracker) - - def _make_result_callback(self, progress_tracker: ProgressTracker) -> Callable[[dict], None]: - def callback(result: dict, *, context: dict | None = None) -> None: - self._worker_result_callback(result, context=context) - progress_tracker.record_success() - - return callback - - def _make_error_callback(self, progress_tracker: ProgressTracker) -> Callable[[Exception], None]: - def callback(exc: Exception, *, context: dict | None = None) -> None: - self._worker_error_callback(exc, context=context) - progress_tracker.record_failure() - - return callback - - def _write_processed_batch(self, dataframe: pd.DataFrame) -> None: - self.batch_manager.replace_buffer(dataframe.to_dict(orient="records")) - self.batch_manager.write() - def _validate_column_configs(self) -> None: if len(self._column_configs) == 0: raise DatasetGenerationError("πŸ›‘ No column configs provided.") @@ -1499,90 +1048,6 @@ def _initialize_processors(self, processor_configs: list[ProcessorConfig]) -> li return processors - def _cleanup_dropped_record_images(self, dropped_indices: set[int]) -> None: - """Remove saved image files for records that will be dropped. - - When a record fails during generation, any images already saved to disk - for that record in previous columns become dangling. This method deletes - those files so they don't accumulate. - """ - media_storage = self.artifact_storage.media_storage - if not self._has_image_columns() or media_storage is None or media_storage.mode != StorageMode.DISK: - return - - image_col_names = [ - col.name for col in self.single_column_configs if col.column_type == DataDesignerColumnType.IMAGE - ] - - buffer = self.batch_manager.get_current_batch(as_dataframe=False) - for idx in dropped_indices: - if idx < 0 or idx >= len(buffer): - continue - for col_name in image_col_names: - paths = buffer[idx].get(col_name, []) - for path in [paths] if isinstance(paths, str) else paths: - media_storage.delete_image(path) - - @staticmethod - def _extract_failure_detail(exc: Exception) -> str: - detail = getattr(exc, "detail", None) - if isinstance(detail, str): - normalized_detail = " ".join(detail.split()).strip() - if normalized_detail: - return normalized_detail - exc_str = str(exc).strip() - for line in exc_str.splitlines(): - if "Cause:" in line: - return " ".join(line.split("Cause:", maxsplit=1)[1].split()).strip() - return " ".join(exc_str.split()).strip() or type(exc).__name__ - - @classmethod - def _classify_worker_failure(cls, exc: Exception) -> str: - failure_kind = getattr(exc, "failure_kind", None) - if isinstance(failure_kind, str) and failure_kind.strip(): - return failure_kind.replace("_", " ") - - detail = cls._extract_failure_detail(exc).lower() - exc_name = type(exc).__name__.lower() - - if "timeout" in exc_name or "timed out" in detail: - return "timeout" - if "rate" in exc_name and "limit" in exc_name: - return "rate limit" - if "authentication" in exc_name: - return "authentication" - if "permission" in exc_name: - return "permission denied" - if "contextwindow" in exc_name or "context width" in detail: - return "context window" - if "response_schema" in detail or "schema" in detail: - return "schema validation" - if "validation" in exc_name or "validation" in detail: - return "validation" - return "generation error" - - @classmethod - def _format_worker_failure_warning(cls, exc: Exception, *, context: dict | None = None) -> str: - record_index = context["index"] if context is not None and "index" in context else "unknown" - column_name = context.get("column_name") if context is not None else None - context_label = f" in column {column_name!r}" if column_name else "" - failure_kind = cls._classify_worker_failure(exc) - failure_detail = cls._extract_failure_detail(exc) - return ( - f"⚠️ Generation for record at index {record_index} failed{context_label} ({failure_kind}). " - f"Will omit this record from the dataset. Detail: {failure_detail}" - ) - - def _worker_error_callback(self, exc: Exception, *, context: dict | None = None) -> None: - """If a worker fails, we can handle the exception here.""" - logger.warning(self._format_worker_failure_warning(exc, context=context)) - if context is None or "index" not in context: - raise RuntimeError("Worker error callback called without a valid context index.") - self._records_to_drop.add(context["index"]) - - def _worker_result_callback(self, result: dict, *, context: dict | None = None) -> None: - self.batch_manager.update_record(context["index"], result) - def _emit_batch_inference_events( self, batch_mode: str, usage_deltas: dict[str, ModelUsageStats], group_id: str ) -> None: diff --git a/packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dataset_batch_manager.py b/packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dataset_batch_manager.py deleted file mode 100644 index f0ec37638..000000000 --- a/packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dataset_batch_manager.py +++ /dev/null @@ -1,226 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -from __future__ import annotations - -import logging -import shutil -from pathlib import Path -from typing import TYPE_CHECKING, Callable, Container, Iterator - -import data_designer.lazy_heavy_imports as lazy -from data_designer.engine.dataset_builders.utils.errors import DatasetBatchManagementError -from data_designer.engine.dataset_builders.utils.skip_tracker import strip_skip_metadata_from_records -from data_designer.engine.storage.artifact_storage import ArtifactStorage, BatchStage - -if TYPE_CHECKING: - import pandas as pd - -logger = logging.getLogger(__name__) - - -class DatasetBatchManager: - def __init__(self, artifact_storage: ArtifactStorage): - self._buffer: list[dict] = [] - self._current_batch_number = 0 - self._num_records_list: list[int] | None = None - self._buffer_size: int | None = None - self._actual_num_records: int = 0 - self._original_target_num_records: int | None = None - self.artifact_storage = artifact_storage - - @property - def num_batches(self) -> int: - if self._num_records_list is None: - return 0 - return len(self._num_records_list) - - @property - def num_records_batch(self) -> int: - if self._num_records_list is None or self._current_batch_number >= len(self._num_records_list): - raise DatasetBatchManagementError("πŸ›‘ Invalid batch number or num_records_list not set.") - return self._num_records_list[self._current_batch_number] - - @property - def num_records_list(self) -> list[int]: - if self._num_records_list is None: - raise DatasetBatchManagementError("πŸ›‘ `num_records_list` is not set. Call start() first.") - return self._num_records_list - - @property - def num_records_in_buffer(self) -> int: - return len(self._buffer) - - @property - def buffer_is_empty(self) -> bool: - return len(self._buffer) == 0 - - @property - def buffer_size(self) -> int: - if self._buffer_size is None: - raise DatasetBatchManagementError("πŸ›‘ `buffer_size` is not set. Call start() first.") - return self._buffer_size - - def add_record(self, record: dict) -> None: - self.add_records([record]) - - def add_records(self, records: list[dict]) -> None: - self._buffer.extend(records) - if len(self._buffer) > self.buffer_size: - raise DatasetBatchManagementError( - f"πŸ›‘ Buffer size exceeded. Current: {len(self._buffer)}, Max: {self.buffer_size}. " - "Flush the batch before adding more records." - ) - - def drop_records(self, index: Container[int]) -> None: - self._buffer = [record for i, record in enumerate(self._buffer) if i not in index] - - def finish_batch(self, on_complete: Callable[[Path], None] | None = None) -> Path | None: - """Finish the batch by moving the results from the partial results path to the final parquet folder. - - Returns: - The path to the written parquet file. - """ - if self._current_batch_number >= self.num_batches: - raise DatasetBatchManagementError("πŸ›‘ All batches have been processed.") - - if self.write() is not None: - self._actual_num_records += len(self._buffer) - final_file_path = self.artifact_storage.move_partial_result_to_final_file_path(self._current_batch_number) - - target = sum(self.num_records_list) - self.artifact_storage.write_metadata( - { - "target_num_records": target, - "original_target_num_records": self._original_target_num_records or target, - "actual_num_records": self._actual_num_records, - "total_num_batches": self.num_batches, - "buffer_size": self._buffer_size, - "schema": {field.name: str(field.type) for field in lazy.pq.read_schema(final_file_path)}, - "file_paths": self.artifact_storage.get_file_paths(), - "num_completed_batches": self._current_batch_number + 1, - "dataset_name": self.artifact_storage.dataset_name, - } - ) - - if on_complete: - on_complete(final_file_path) - else: - final_file_path = None - - logger.warning( - f"⚠️ Batch {self._current_batch_number + 1} finished without any results to write. " - "A partial dataset containing the currently available columns has been written to the partial results " - f"directory: {self.artifact_storage.partial_results_path}" - ) - - self._current_batch_number += 1 - self._buffer: list[dict] = [] - - return final_file_path - - def finish(self) -> None: - """Finish the dataset writing process by deleting the partial results path if it exists and is empty.""" - - # If the partial results path is empty, delete it. - if not any(self.artifact_storage.partial_results_path.iterdir()): - self.artifact_storage.partial_results_path.rmdir() - - # Otherwise, log a warning, since existing partial results means the dataset is not complete. - else: - logger.warning("⚠️ Dataset writing finished with partial results.") - - self.reset() - - def get_current_batch_number(self) -> int: - return self._current_batch_number - - def get_current_batch(self, *, as_dataframe: bool = False) -> pd.DataFrame | list[dict]: - if as_dataframe: - return lazy.pd.DataFrame(strip_skip_metadata_from_records(self._buffer)) - return self._buffer - - def iter_current_batch(self) -> Iterator[tuple[int, dict]]: - for i, record in enumerate(self._buffer): - yield i, record - - def reset(self, delete_files: bool = False) -> None: - self._current_batch_number = 0 - self._buffer: list[dict] = [] - self._actual_num_records = 0 - if delete_files: - for dir_path in [ - self.artifact_storage.final_dataset_path, - self.artifact_storage.partial_results_path, - self.artifact_storage.dropped_columns_dataset_path, - self.artifact_storage.base_dataset_path, - ]: - if dir_path.exists(): - try: - shutil.rmtree(dir_path) - except OSError as e: - raise DatasetBatchManagementError(f"πŸ›‘ Failed to delete directory {dir_path}: {e}") - - def start( - self, - *, - num_records: int, - buffer_size: int, - start_batch: int = 0, - initial_actual_num_records: int = 0, - num_records_list: list[int] | None = None, - original_target_num_records: int | None = None, - ) -> None: - if num_records <= 0: - raise DatasetBatchManagementError("πŸ›‘ num_records must be positive.") - if buffer_size <= 0: - raise DatasetBatchManagementError("πŸ›‘ buffer_size must be positive.") - - self._buffer_size = buffer_size - self._original_target_num_records = original_target_num_records - if num_records_list is not None: - self._num_records_list = list(num_records_list) - else: - self._num_records_list = [buffer_size] * (num_records // buffer_size) - if remaining_records := num_records % buffer_size: - self._num_records_list.append(remaining_records) - self.reset() - self._current_batch_number = start_batch - self._actual_num_records = initial_actual_num_records - - def write(self) -> Path | None: - """Write the current batch to a parquet file. - - This method always writes results to the partial results path. - - Returns: - The path to the written parquet file. If the buffer is empty, returns None. - """ - if len(self._buffer) == 0: - return None - try: - file_path = self.artifact_storage.write_batch_to_parquet_file( - batch_number=self._current_batch_number, - dataframe=lazy.pd.DataFrame(strip_skip_metadata_from_records(self._buffer)), - batch_stage=BatchStage.PARTIAL_RESULT, - ) - return file_path - except Exception as e: - raise DatasetBatchManagementError(f"πŸ›‘ Failed to write batch {self._current_batch_number}: {e}") - - def update_record(self, index: int, record: dict) -> None: - if index < 0 or index >= len(self._buffer): - raise IndexError(f"πŸ›‘ Index {index} is out of bounds for buffer of size {len(self._buffer)}.") - self._buffer[index] = record - - def replace_buffer(self, records: list[dict]) -> None: - """Replace the buffer contents. - - Args: - records: New records to replace the buffer. - """ - if len(records) != len(self._buffer): - raise DatasetBatchManagementError( - f"πŸ›‘ Number of records ({len(records)}) must match the current buffer size ({len(self._buffer)})." - ) - self._buffer = records diff --git a/packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/processor_runner.py b/packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/processor_runner.py index 0c71f43ee..47cefb96b 100644 --- a/packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/processor_runner.py +++ b/packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/processor_runner.py @@ -14,7 +14,6 @@ if TYPE_CHECKING: import pandas as pd - from data_designer.engine.dataset_builders.utils.dataset_batch_manager import DatasetBatchManager from data_designer.engine.processing.processors.base import Processor from data_designer.engine.storage.artifact_storage import ArtifactStorage @@ -71,17 +70,6 @@ def _raise_if_pre_batch_resized(original_len: int, new_len: int) -> None: "Row-count changes in pre-batch processors are not supported; use workflow chaining instead." ) - def run_pre_batch(self, batch_manager: DatasetBatchManager) -> None: - """Run process_before_batch() on current batch.""" - if not self.has_processors_for(ProcessorStage.PRE_BATCH): - return - - df = batch_manager.get_current_batch(as_dataframe=True) - original_len = len(df) - df = self._run_stage(df, ProcessorStage.PRE_BATCH) - self._raise_if_pre_batch_resized(original_len, len(df)) - batch_manager.replace_buffer(df.to_dict(orient="records")) - def run_pre_batch_on_df(self, df: pd.DataFrame) -> pd.DataFrame: """Run PRE_BATCH processors on a DataFrame and return the result. diff --git a/packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/row_group_buffer.py b/packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/row_group_buffer.py index b4b28e9aa..98220e35f 100644 --- a/packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/row_group_buffer.py +++ b/packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/row_group_buffer.py @@ -24,8 +24,7 @@ class RowGroupBufferManager: writes (``update_cell``) are the only write path β€” whole-record replacement is unsafe under parallel column execution. - The existing ``DatasetBatchManager`` is untouched; this class is used - exclusively by the async scheduler. + This class is used by the async scheduler. """ def __init__( diff --git a/packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/skip_evaluator.py b/packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/skip_evaluator.py index 3fcd1f7df..0a3d90725 100644 --- a/packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/skip_evaluator.py +++ b/packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/skip_evaluator.py @@ -38,9 +38,8 @@ class NativeSandboxedEnvironment(SandboxedEnvironment, NativeEnvironment): def evaluate_skip_when(expression: str, record: dict) -> bool: """Render *expression* against *record*; return ``True`` if result is truthy. - The caller is responsible for passing a raw record dict β€” deserialization - of JSON string values is handled here so both sync and async engines get - identical behavior. On expected evaluation failures (``UndefinedError``, + The caller is responsible for passing a raw record dict. Deserialization + of JSON string values is handled here. On expected evaluation failures (``UndefinedError``, ``SecurityError``, ``TemplateSyntaxError``, ``TypeError``, ``ValueError``) a warning is logged and ``True`` is returned (fail-safe: skip the row rather than making an expensive LLM call on a row with unknown filter @@ -86,7 +85,6 @@ def should_skip_column_for_record( ) -> bool: """Unified skip decision for a single cell/record. - Shared by both the sync and async engines so the logic stays in sync. Checks propagation first (cheaper), then the expression gate. Args: diff --git a/packages/data-designer-engine/src/data_designer/engine/flags.py b/packages/data-designer-engine/src/data_designer/engine/flags.py deleted file mode 100644 index b1e9f9f00..000000000 --- a/packages/data-designer-engine/src/data_designer/engine/flags.py +++ /dev/null @@ -1,19 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -"""Engine-wide feature flags read from environment variables. - -This module exists so the engine, the public interface, and the readiness -module can share a single source of truth for runtime mode flags without -forming an import cycle. Tests patch values here to flip behavior for a -single test scope. -""" - -from __future__ import annotations - -import os - -# Async engine is the default execution path. Set ``DATA_DESIGNER_ASYNC_ENGINE=0`` -# to opt back into the legacy sync engine for one transitional release; the sync -# path is scheduled for removal afterwards. -DATA_DESIGNER_ASYNC_ENGINE: bool = os.environ.get("DATA_DESIGNER_ASYNC_ENGINE", "1") == "1" diff --git a/packages/data-designer-engine/src/data_designer/engine/models/clients/factory.py b/packages/data-designer-engine/src/data_designer/engine/models/clients/factory.py index 398d151a4..acdbe24dc 100644 --- a/packages/data-designer-engine/src/data_designer/engine/models/clients/factory.py +++ b/packages/data-designer-engine/src/data_designer/engine/models/clients/factory.py @@ -40,8 +40,8 @@ def create_model_client( model_provider_registry: Registry of model provider configurations used to look up endpoint, provider type, and API key reference. retry_config: Optional retry configuration for HTTP adapters. - client_concurrency_mode: ``"sync"`` (default) for the sync engine path, - ``"async"`` for the async engine path. Native HTTP adapters are + client_concurrency_mode: ``"sync"`` for synchronous adapter calls or + ``"async"`` for async adapter calls. Native HTTP adapters are constrained to a single concurrency mode. request_admission: Optional request-admission controller for per-request provider/model/domain admission. When provided, the returned client diff --git a/packages/data-designer-engine/src/data_designer/engine/readiness.py b/packages/data-designer-engine/src/data_designer/engine/readiness.py index d3e9996a1..92e67e335 100644 --- a/packages/data-designer-engine/src/data_designer/engine/readiness.py +++ b/packages/data-designer-engine/src/data_designer/engine/readiness.py @@ -88,8 +88,6 @@ def _run_model_health_check( return if client_concurrency_mode == ClientConcurrencyMode.ASYNC: - # Defer the async-engine imports to here so users on the legacy sync - # engine never pay the import cost. import asyncio from data_designer.engine.dataset_builders.utils.async_concurrency import ensure_async_engine_loop diff --git a/packages/data-designer-engine/src/data_designer/engine/resources/resource_provider.py b/packages/data-designer-engine/src/data_designer/engine/resources/resource_provider.py index 3ebfa8694..1a969e2b1 100644 --- a/packages/data-designer-engine/src/data_designer/engine/resources/resource_provider.py +++ b/packages/data-designer-engine/src/data_designer/engine/resources/resource_provider.py @@ -12,7 +12,6 @@ from data_designer.config.run_config import RunConfig from data_designer.config.seed_source import SeedSource from data_designer.config.utils.type_helpers import StrEnum -from data_designer.engine import flags from data_designer.engine.mcp.factory import create_mcp_registry from data_designer.engine.mcp.registry import MCPRegistry from data_designer.engine.model_provider import ( @@ -141,14 +140,8 @@ def create_resource_provider( mcp_provider_registry=mcp_provider_registry, ) - # Default the client mode from the env var when the caller hasn't decided. - # The interface (DataDesigner) computes the mode and passes the result - # explicitly. Direct callers still get the env-var default for backward - # compatibility. if client_concurrency_mode is None: - client_concurrency_mode = ( - ClientConcurrencyMode.ASYNC if flags.DATA_DESIGNER_ASYNC_ENGINE else ClientConcurrencyMode.SYNC - ) + client_concurrency_mode = ClientConcurrencyMode.ASYNC effective_run_config = run_config or RunConfig() diff --git a/packages/data-designer-engine/tests/engine/conftest.py b/packages/data-designer-engine/tests/engine/conftest.py index bf594ed26..cd4b53b48 100644 --- a/packages/data-designer-engine/tests/engine/conftest.py +++ b/packages/data-designer-engine/tests/engine/conftest.py @@ -35,6 +35,7 @@ def stub_resource_provider(tmp_path, stub_model_facade): mock_model_registry = Mock(spec=ModelRegistry) mock_model_registry.get_model.return_value = stub_model_facade mock_model_registry.model_configs = {} # Add empty model_configs dict + mock_model_registry.request_admission = None mock_provider.model_registry = mock_model_registry mock_provider.artifact_storage = ArtifactStorage(artifact_path=tmp_path) mock_provider.person_reader = Mock(spec=PersonReader) diff --git a/packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py b/packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py index e223ddd25..8609a731f 100644 --- a/packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py +++ b/packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py @@ -225,17 +225,6 @@ def __init__(self, **kwargs: object) -> None: assert peak_bytes < 5 * 1024 * 1024 -# -- Test that existing sync path is unaffected -------------------------------- - - -def test_sync_path_unaffected_by_async_engine_flag() -> None: - """DATA_DESIGNER_ASYNC_ENGINE=0 keeps the sync path unchanged.""" - from data_designer.engine import flags - - assert hasattr(flags, "DATA_DESIGNER_ASYNC_ENGINE") - assert isinstance(flags.DATA_DESIGNER_ASYNC_ENGINE, bool) - - # -- Test execution graph integration with real column configs ----------------- diff --git a/packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py b/packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py index d937372be..115450bea 100644 --- a/packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py +++ b/packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py @@ -29,16 +29,10 @@ from data_designer.config.seed import IndexRange, PartitionBlock, SamplingStrategy from data_designer.config.seed_source import LocalFileSeedSource from data_designer.config.seed_source_dataframe import DataFrameSeedSource -from data_designer.engine import flags from data_designer.engine.column_generators.generators.base import GenerationStrategy from data_designer.engine.dataset_builders.dataset_builder import DatasetBuilder, build_row_group_resume_plan from data_designer.engine.dataset_builders.errors import DatasetGenerationError, DatasetProcessingError from data_designer.engine.dataset_builders.row_group_plan import CompactRowGroupPlan -from data_designer.engine.models.errors import ( - FormattedLLMErrorMessage, - ModelGenerationValidationFailureError, - ModelTimeoutError, -) from data_designer.engine.models.telemetry import InferenceEvent, NemoSourceEnum, TaskStatusEnum from data_designer.engine.models.usage import ModelUsageStats, TokenUsageStats from data_designer.engine.processing.processors.base import Processor @@ -50,18 +44,6 @@ import pandas as pd -@pytest.fixture(autouse=True) -def _force_sync_engine(monkeypatch: pytest.MonkeyPatch) -> None: - """Pin tests in this file to the legacy sync engine. - - These tests use Mock-based stub resource providers that don't satisfy the - contracts expected by the async task-queue scheduler. They cover sync-engine - behavior; the async path has dedicated coverage in - ``test_async_builder_integration.py`` and ``test_async_scheduler.py``. - """ - monkeypatch.setattr(flags, "DATA_DESIGNER_ASYNC_ENGINE", False) - - @pytest.fixture def stub_test_column_configs(): return [ @@ -89,26 +71,6 @@ def stub_test_config_builder(stub_test_column_configs, stub_model_configs): return config_builder -@pytest.fixture -def stub_batch_manager(): - mock_batch_manager = Mock() - mock_batch_manager.num_batches = 2 - mock_batch_manager.num_records_batch = 3 - mock_batch_manager.finish = Mock() - mock_batch_manager.write = Mock() - mock_batch_manager.add_records = Mock() - mock_batch_manager.replace_buffer = Mock() - mock_batch_manager.update_record = Mock() - mock_batch_manager.get_current_batch = Mock() - mock_batch_manager.get_current_batch.side_effect = [ - lazy.pd.DataFrame({"test_column": [1, 2, 3], "column_to_drop": [1, 2, 3]}), - lazy.pd.DataFrame({"test_column": [4, 5, 6], "column_to_drop": [4, 5, 6]}), - ] - mock_batch_manager.get_current_batch_number = Mock() - mock_batch_manager.get_current_batch_number.side_effect = [1, 2] - return mock_batch_manager - - @pytest.fixture def stub_dataset_builder(stub_resource_provider, stub_test_config_builder): return DatasetBuilder( @@ -182,86 +144,6 @@ def test_dataset_builder_artifact_storage_property(stub_dataset_builder, stub_re assert stub_dataset_builder.artifact_storage == stub_resource_provider.artifact_storage -def test_dataset_builder_records_to_drop_initialization(stub_dataset_builder): - assert stub_dataset_builder._records_to_drop == set() - - -def test_worker_error_callback_logs_schema_validation_detail( - stub_dataset_builder: DatasetBuilder, - caplog: pytest.LogCaptureFixture, -) -> None: - exc = ModelGenerationValidationFailureError( - FormattedLLMErrorMessage( - cause=( - "The model output from 'test-model' could not be parsed into the requested format while " - "running generation for column 'test_column'. Validation detail: Response doesn't match " - "requested 'name' is a required property." - ), - solution="Simplify the schema and retry.", - ), - detail="Response doesn't match requested 'name' is a required property.", - failure_kind="schema_validation", - ) - - with caplog.at_level(logging.WARNING): - stub_dataset_builder._worker_error_callback(exc, context={"index": 248, "column_name": "test_column"}) - - assert "record at index 248" in caplog.text - assert "column 'test_column'" in caplog.text - assert "(schema validation)" in caplog.text - assert "Response doesn't match requested 'name' is a required property." in caplog.text - assert 248 in stub_dataset_builder._records_to_drop - - -def test_worker_error_callback_logs_timeout_detail( - stub_dataset_builder: DatasetBuilder, - caplog: pytest.LogCaptureFixture, -) -> None: - exc = ModelTimeoutError( - FormattedLLMErrorMessage( - cause="The request to model 'test-model' timed out while running generation for column 'test_column'.", - solution="Increase the timeout setting for the model and retry.", - ) - ) - - with caplog.at_level(logging.WARNING): - stub_dataset_builder._worker_error_callback(exc, context={"index": 17, "column_name": "test_column"}) - - assert "record at index 17" in caplog.text - assert "column 'test_column'" in caplog.text - assert "(timeout)" in caplog.text - assert ( - "The request to model 'test-model' timed out while running generation for column 'test_column'." in caplog.text - ) - assert 17 in stub_dataset_builder._records_to_drop - - -def test_worker_error_callback_requires_context_index( - stub_dataset_builder: DatasetBuilder, - caplog: pytest.LogCaptureFixture, -) -> None: - exc = ModelTimeoutError( - FormattedLLMErrorMessage( - cause="The request to model 'test-model' timed out while running generation for column 'test_column'.", - solution="Increase the timeout setting for the model and retry.", - ) - ) - - with ( - caplog.at_level(logging.WARNING), - pytest.raises(RuntimeError, match="Worker error callback called without a valid context index."), - ): - stub_dataset_builder._worker_error_callback(exc, context=None) - - assert "record at index unknown" in caplog.text - assert len(stub_dataset_builder._records_to_drop) == 0 - - -def test_dataset_builder_batch_manager_initialization(stub_dataset_builder, stub_resource_provider): - assert stub_dataset_builder.batch_manager is not None - assert stub_dataset_builder.batch_manager.artifact_storage == stub_resource_provider.artifact_storage - - @pytest.mark.parametrize( "config_type,expected_single_configs", [ @@ -304,37 +186,6 @@ def test_dataset_builder_single_column_configs_property( assert builder.single_column_configs == expected_single_configs -def test_dataset_builder_build_method_basic_flow( - stub_dataset_builder, - stub_batch_manager, - stub_resource_provider, -): - stub_resource_provider.run_config = RunConfig(buffer_size=50) - stub_resource_provider.seed_reader = None # No seed data for this basic flow test - stub_resource_provider.model_registry.run_health_check = Mock() - stub_resource_provider.model_registry.get_model_usage_stats = Mock(return_value={"test": "stats"}) - stub_resource_provider.model_registry.models = {} - - # Mock the model config to return proper max_parallel_requests - mock_model_config = Mock() - mock_model_config.inference_parameters.max_parallel_requests = 4 - mock_model_config.inference_parameters.get_formatted_params.return_value = [] - stub_resource_provider.model_registry.get_model_config.return_value = mock_model_config - - # Mock the batch manager's iter_current_batch method - stub_batch_manager.iter_current_batch.return_value = [(0, {"test": "data"})] - - stub_dataset_builder.batch_manager = stub_batch_manager - stub_dataset_builder.set_processor_runner([]) # No processors for basic flow test - - result_path = stub_dataset_builder.build(num_records=100) - - stub_resource_provider.model_registry.run_health_check.assert_called_once() - stub_batch_manager.start.assert_called_once_with(num_records=100, buffer_size=50) - stub_batch_manager.finish.assert_called_once() - assert result_path == stub_resource_provider.artifact_storage.final_dataset_path - - @pytest.mark.parametrize( "column_configs,expected_error", [ @@ -465,156 +316,7 @@ def test_emit_batch_inference_events_handles_multiple_models( assert model_names == {"model-a", "model-b"} -@pytest.mark.parametrize( - "disable_early_shutdown,configured_rate,expected_rate,shutdown_error_window", - [ - (False, 0.7, 0.7, 20), # enabled: use configured rate - (True, 0.7, 1.0, 20), # disabled: use 1.0 to effectively disable - (False, 0.5, 0.5, 10), # defaults - ], -) -@patch("data_designer.engine.dataset_builders.dataset_builder.ConcurrentThreadExecutor") -def test_fan_out_with_threads_uses_early_shutdown_settings_from_resource_provider( - mock_executor_class: Mock, - stub_resource_provider: Mock, - stub_test_column_configs: list, - stub_test_processor_configs: list, - disable_early_shutdown: bool, - configured_rate: float, - expected_rate: float, - shutdown_error_window: int, -) -> None: - """Test that _fan_out_with_threads uses run settings from resource_provider.""" - stub_resource_provider.run_config = RunConfig( - disable_early_shutdown=disable_early_shutdown, - shutdown_error_rate=configured_rate, - shutdown_error_window=shutdown_error_window, - ) - - config_builder = DataDesignerConfigBuilder(model_configs=[]) - for column_config in stub_test_column_configs: - config_builder.add_column(column_config) - for processor_config in stub_test_processor_configs: - config_builder.add_processor(processor_config) - - builder = DatasetBuilder( - data_designer_config=config_builder.build(), - resource_provider=stub_resource_provider, - ) - - mock_executor_class.return_value.__enter__ = Mock(return_value=Mock()) - mock_executor_class.return_value.__exit__ = Mock(return_value=False) - - mock_generator = Mock() - mock_generator.get_generation_strategy.return_value = GenerationStrategy.CELL_BY_CELL - mock_generator.config.name = "test" - mock_generator.config.column_type = "llm_text" - mock_generator.config.tool_alias = None # Avoid triggering tool usage code path - - builder.batch_manager = Mock() - builder.batch_manager.num_records_batch = 10 - builder.batch_manager.iter_current_batch.return_value = [] - builder.batch_manager.num_records_batch = 0 - - builder._fan_out_with_threads(mock_generator, max_workers=4) - - call_kwargs = mock_executor_class.call_args[1] - assert call_kwargs["shutdown_error_rate"] == expected_rate - assert call_kwargs["shutdown_error_window"] == shutdown_error_window - assert call_kwargs["disable_early_shutdown"] == disable_early_shutdown - - -@patch("data_designer.engine.dataset_builders.dataset_builder.ConcurrentThreadExecutor") -def test_fan_out_with_threads_passes_column_name_in_context( - mock_executor_class: Mock, - stub_resource_provider: Mock, - stub_model_configs: dict[str, object], -) -> None: - config_builder = DataDesignerConfigBuilder(model_configs=stub_model_configs) - config_builder.add_column( - SamplerColumnConfig(name="some_id", sampler_type=SamplerType.UUID, params=UUIDSamplerParams()) - ) - builder = DatasetBuilder( - data_designer_config=config_builder.build(), - resource_provider=stub_resource_provider, - ) - builder.build_preview(num_records=1) - - mock_executor = Mock() - mock_executor_class.return_value.__enter__ = Mock(return_value=mock_executor) - mock_executor_class.return_value.__exit__ = Mock(return_value=False) - - mock_generator = Mock() - mock_generator.get_generation_strategy.return_value = GenerationStrategy.CELL_BY_CELL - mock_generator.config.name = "test_column" - mock_generator.config.column_type = "llm_text" - mock_generator.config.tool_alias = None - - builder.batch_manager = Mock() - builder.batch_manager.num_records_batch = 2 - builder.batch_manager.num_records_in_buffer = 2 - builder.batch_manager.iter_current_batch.return_value = [(0, {"seed": "a"}), (1, {"seed": "b"})] - - builder._fan_out_with_threads(mock_generator, max_workers=2) - - submitted_contexts = [call.kwargs["context"] for call in mock_executor.submit.call_args_list] - assert submitted_contexts == [ - {"index": 0, "column_name": "test_column"}, - {"index": 1, "column_name": "test_column"}, - ] - - -@patch("data_designer.engine.dataset_builders.dataset_builder.AsyncConcurrentExecutor", create=True) -def test_fan_out_with_async_passes_column_name_in_context( - mock_executor_class: Mock, - stub_resource_provider: Mock, - stub_model_configs: dict[str, object], -) -> None: - config_builder = DataDesignerConfigBuilder(model_configs=stub_model_configs) - config_builder.add_column( - SamplerColumnConfig(name="some_id", sampler_type=SamplerType.UUID, params=UUIDSamplerParams()) - ) - builder = DatasetBuilder( - data_designer_config=config_builder.build(), - resource_provider=stub_resource_provider, - ) - builder.build_preview(num_records=1) - - mock_executor = Mock() - - def _run(work_items: list[tuple[object, dict[str, int | str]]]) -> None: - for coro, _context in work_items: - coro.close() - - mock_executor.run.side_effect = _run - mock_executor_class.return_value = mock_executor - - mock_generator = Mock() - mock_generator.get_generation_strategy.return_value = GenerationStrategy.CELL_BY_CELL - mock_generator.config.name = "test_column" - mock_generator.config.column_type = "llm_text" - mock_generator.config.tool_alias = None - - async def _agenerate(record: dict[str, str]) -> dict[str, str]: - return record - - mock_generator.agenerate.side_effect = _agenerate - - builder.batch_manager = Mock() - builder.batch_manager.num_records_batch = 2 - builder.batch_manager.iter_current_batch.return_value = [(0, {"seed": "a"}), (1, {"seed": "b"})] - - builder._fan_out_with_async(mock_generator, max_workers=2) - - work_items = mock_executor.run.call_args.args[0] - submitted_contexts = [context for _coro, context in work_items] - assert submitted_contexts == [ - {"index": 0, "column_name": "test_column"}, - {"index": 1, "column_name": "test_column"}, - ] - - -def test_full_column_custom_generator_error_is_descriptive(stub_resource_provider, stub_model_configs): +def test_full_column_custom_generator_failure_sets_first_error(stub_resource_provider, stub_model_configs): @custom_column_generator(required_columns=["some_id"]) def bad_fn(df: pd.DataFrame) -> pd.DataFrame: raise ValueError("something broke") @@ -624,8 +326,13 @@ def bad_fn(df: pd.DataFrame) -> pd.DataFrame: config.add_column(CustomColumnConfig(name="col", generator_function=bad_fn, generation_strategy="full_column")) builder = DatasetBuilder(data_designer_config=config.build(), resource_provider=stub_resource_provider) - with pytest.raises(DatasetGenerationError, match=r"(?s)Failed to process column 'col'.*something broke"): - builder.build_preview(num_records=3) + result = builder.build_preview(num_records=3) + + assert result.empty + assert builder.first_non_retryable_error is not None + assert "Custom generator function failed for column 'col': something broke" in str( + builder.first_non_retryable_error + ) def test_build_async_preview_returns_empty_dataframe_when_row_group_is_already_freed( @@ -1276,9 +983,6 @@ def _make_sampler_only_builder( def test_build_resume_ordered_seed_dataset_continues_from_next_planned_row(stub_resource_provider, tmp_path): """Regression for issue #709: resume must not replay ordered seed rows.""" - class StopAfterFirstBatch(RuntimeError): - pass - seed_source = DataFrameSeedSource(df=lazy.pd.DataFrame({"name": ["alpha", "beta", "gamma"]})) seed_reader = DataFrameSeedReader() seed_reader.attach(seed_source, Mock()) @@ -1301,11 +1005,7 @@ class StopAfterFirstBatch(RuntimeError): resource_provider=stub_resource_provider, ) - def stop(_path: Path) -> None: - raise StopAfterFirstBatch("simulated interruption") - - with pytest.raises(StopAfterFirstBatch, match="simulated interruption"): - builder.build(num_records=3, on_batch_complete=stop, resume=ResumeMode.NEVER) + builder.build(num_records=1, resume=ResumeMode.NEVER) resumed_seed_reader = DataFrameSeedReader() resumed_seed_reader.attach(seed_source, Mock()) @@ -1402,9 +1102,6 @@ def test_build_resume_ordered_seed_dataset_with_partition_block_continues_within branch end-to-end. """ - class StopAfterFirstBatch(RuntimeError): - pass - seed_source = DataFrameSeedSource(df=lazy.pd.DataFrame({"name": ["a", "b", "c", "d", "e", "f"]})) seed_reader = DataFrameSeedReader() seed_reader.attach(seed_source, Mock()) @@ -1430,11 +1127,7 @@ class StopAfterFirstBatch(RuntimeError): resource_provider=stub_resource_provider, ) - def stop(_path: Path) -> None: - raise StopAfterFirstBatch("simulated interruption") - - with pytest.raises(StopAfterFirstBatch, match="simulated interruption"): - builder.build(num_records=4, on_batch_complete=stop, resume=ResumeMode.NEVER) + builder.build(num_records=1, resume=ResumeMode.NEVER) resumed_seed_reader = DataFrameSeedReader() resumed_seed_reader.attach(seed_source, Mock()) @@ -1462,9 +1155,9 @@ def stop(_path: Path) -> None: def test_build_resume_starts_fresh_without_metadata(stub_resource_provider, stub_test_config_builder, tmp_path, caplog): """resume=True when only the folder exists (no metadata.json) logs an info message and starts fresh. - This covers the case where a run was interrupted before any batch completed β€” the + This covers the case where a run was interrupted before any row group completed - the folder was created by _write_builder_config but metadata.json was never written. - Previously this raised DatasetGenerationError; now it silently restarts from batch 0. + Previously this raised DatasetGenerationError; now it silently restarts from row group 0. """ # Pre-create the folder with content so resolved_dataset_name(resume=True) returns "dataset" dataset_dir = tmp_path / "dataset" @@ -1474,12 +1167,12 @@ def test_build_resume_starts_fresh_without_metadata(stub_resource_provider, stub builder = _make_resume_builder(stub_resource_provider, stub_test_config_builder, tmp_path) with caplog.at_level(logging.INFO): with patch.object(builder_mod, "run_readiness_check"): - with patch.object(builder, "_run_batch"): - with patch.object(builder.batch_manager, "finish"): - # resume=False is set internally; build dispatches to the normal (non-resume) path - builder.build(num_records=4, resume=ResumeMode.ALWAYS) + with patch.object(builder, "_build_async", return_value=True) as mock_async: + builder.build(num_records=4, resume=ResumeMode.ALWAYS) - assert any("interrupted before any batch completed" in record.message for record in caplog.records) + _, kwargs = mock_async.call_args + assert kwargs.get("resume") == ResumeMode.NEVER + assert any("interrupted before any row group completed" in record.message for record in caplog.records) def test_build_resume_raises_when_num_records_below_actual(stub_resource_provider, stub_test_config_builder, tmp_path): @@ -1554,7 +1247,7 @@ def test_build_resume_allows_larger_num_records(stub_resource_provider, stub_tes with patch.object(builder_mod, "run_readiness_check"): # 6 > 4 already generated β†’ not already complete, should start generating # Here we just verify it does NOT raise on the num_records check - with patch.object(builder, "_build_with_resume", return_value=True): + with patch.object(builder, "_build_async", return_value=True): builder.build(num_records=6, resume=ResumeMode.ALWAYS) @@ -1626,9 +1319,8 @@ def test_build_if_possible_starts_fresh_on_dropped_column_artifact_policy_mismat ) with patch.object(builder_mod, "run_readiness_check"): - with patch.object(builder, "_run_batch"): - with patch.object(builder.batch_manager, "finish"): - final_path = builder.build(num_records=4, resume=ResumeMode.IF_POSSIBLE) + with patch.object(builder, "_build_async", return_value=True): + final_path = builder.build(num_records=4, resume=ResumeMode.IF_POSSIBLE) assert storage.resume == ResumeMode.NEVER assert (dataset_dir / "sentinel.txt").exists() @@ -1799,7 +1491,7 @@ def test_build_marks_post_generation_started_before_running_processors( builder = _make_resume_builder(stub_resource_provider, stub_test_config_builder, tmp_path, buffer_size=2) with patch.object(builder, "_initialize_generators_and_graph", return_value=([], None)): - with patch.object(builder, "_build_with_resume", return_value=True): + with patch.object(builder, "_build_async", return_value=True): with patch.object(builder._processor_runner, "has_processors_for", return_value=True): with patch.object(builder._processor_runner, "run_after_generation", side_effect=RuntimeError("boom")): with pytest.raises(RuntimeError, match="boom"): @@ -1866,80 +1558,6 @@ def test_build_resume_post_generation_processed_missing_target_raises_clearly( builder.build(num_records=4, resume=ResumeMode.ALWAYS) -def test_build_resume_not_already_complete_when_extension_fits_in_slack( - stub_resource_provider, stub_test_config_builder, tmp_path -): - """Non-aligned extension fitting in the last group's slack must not falsely trigger 'already complete'. - - original_target=5, buffer_size=2 β†’ 3 batches [2,2,1]; extending to num_records=6: - ceil(6/2)=3 == num_completed_batches=3 used to trigger the false 'already complete' branch. - Correct total_batches = 3 + ceil(1/2) = 4, so batch 3 (1 record) must be scheduled. - """ - dataset_dir = tmp_path / "dataset" - _write_metadata(dataset_dir, target_num_records=5, buffer_size=2, num_completed_batches=3, actual_num_records=5) - # Three row groups [2, 2, 1] on disk so the unified resume path sees 3 completed batches. - _write_parquet_files(dataset_dir / "parquet-files", [0, 1, 2], row_counts={2: 1}) - - builder = _make_resume_builder(stub_resource_provider, stub_test_config_builder, tmp_path, buffer_size=2) - - with patch.object(builder, "_run_batch") as mock_run_batch: - with patch.object(builder.batch_manager, "finish"): - with patch.object(builder_mod, "run_readiness_check"): - builder.build(num_records=6, resume=ResumeMode.ALWAYS) - - mock_run_batch.assert_called_once() - assert mock_run_batch.call_args.kwargs["current_batch_number"] == 3 - - -def test_build_resume_recovers_progress_from_disk_when_metadata_lags( - stub_resource_provider, stub_test_config_builder, tmp_path, caplog -): - """Sync resume uses parquet files on disk as the source of truth for progress. - - Crash window: ``move_partial_result_to_final_file_path`` succeeded for batch 1 but - ``write_metadata`` had not yet committed the matching ``num_completed_batches`` / - ``actual_num_records`` update. Before unification, sync took the stale metadata - counters at face value and re-generated batch 1, double-counting records. After - unification, both engines derive progress from ``parquet-files/batch_*.parquet``, - so this scenario resolves to "already complete" and skips redundant generation. - """ - dataset_dir = tmp_path / "dataset" - # Metadata lags β€” claims only 1 batch / 2 records committed. - _write_metadata(dataset_dir, target_num_records=4, buffer_size=2, num_completed_batches=1, actual_num_records=2) - # Filesystem truth β€” both row groups written before the crash. - _write_parquet_files(dataset_dir / "parquet-files", [0, 1]) - - builder = _make_resume_builder(stub_resource_provider, stub_test_config_builder, tmp_path, buffer_size=2) - with caplog.at_level(logging.WARNING): - with patch.object(builder, "_run_batch") as mock_run_batch: - with patch.object(builder.batch_manager, "finish"): - with patch.object(builder_mod, "run_readiness_check"): - builder.build(num_records=4, resume=ResumeMode.ALWAYS) - - mock_run_batch.assert_not_called() - assert any("already complete" in record.message for record in caplog.records) - - -def test_build_resume_raises_on_non_contiguous_batch_ids_under_sync( - stub_resource_provider, stub_test_config_builder, tmp_path -): - """Sync resume rejects non-contiguous parquet IDs (likely written by an incompatible engine). - - The sync engine writes batches sequentially, so a hole between batch 0 and batch 2 - can only mean external mutation or data written by a different engine (e.g. the - async engine, which can complete row groups out of order). Letting sync proceed - would silently re-generate batch 1 with stale row counters; raising surfaces the - inconsistency loudly. - """ - dataset_dir = tmp_path / "dataset" - _write_metadata(dataset_dir, target_num_records=6, buffer_size=2, num_completed_batches=2, actual_num_records=4) - _write_parquet_files(dataset_dir / "parquet-files", [0, 2]) - - builder = _make_resume_builder(stub_resource_provider, stub_test_config_builder, tmp_path, buffer_size=2) - with pytest.raises(DatasetGenerationError, match="non-contiguous"): - builder.build(num_records=6, resume=ResumeMode.ALWAYS) - - # --------------------------------------------------------------------------- # Async resume via _build_async tests # --------------------------------------------------------------------------- @@ -1957,9 +1575,8 @@ def test_build_async_resume_logs_warning_when_already_complete( builder = _make_resume_builder(stub_resource_provider, stub_test_config_builder, tmp_path, buffer_size=2) with caplog.at_level(logging.WARNING): - with patch.object(flags, "DATA_DESIGNER_ASYNC_ENGINE", True): - with patch.object(builder_mod, "run_readiness_check"): - builder.build(num_records=4, resume=ResumeMode.ALWAYS) + with patch.object(builder_mod, "run_readiness_check"): + builder.build(num_records=4, resume=ResumeMode.ALWAYS) assert any("already complete" in record.message for record in caplog.records) @@ -1980,15 +1597,14 @@ def test_build_async_resume_starts_fresh_without_metadata( builder = _make_resume_builder(stub_resource_provider, stub_test_config_builder, tmp_path) with caplog.at_level(logging.INFO): - with patch.object(flags, "DATA_DESIGNER_ASYNC_ENGINE", True): - with patch.object(builder_mod, "run_readiness_check"): - with patch.object(builder, "_build_async", return_value=True) as mock_async: - builder.build(num_records=4, resume=ResumeMode.ALWAYS) + with patch.object(builder_mod, "run_readiness_check"): + with patch.object(builder, "_build_async", return_value=True) as mock_async: + builder.build(num_records=4, resume=ResumeMode.ALWAYS) # _build_async is called with resume=NEVER because the no-metadata path resets the mode _, kwargs = mock_async.call_args assert kwargs.get("resume") == ResumeMode.NEVER - assert any("interrupted before any batch completed" in record.message for record in caplog.records) + assert any("interrupted before any row group completed" in record.message for record in caplog.records) def test_build_async_resume_already_complete_does_not_run_after_generation_processors( @@ -2001,10 +1617,9 @@ def test_build_async_resume_already_complete_does_not_run_after_generation_proce builder = _make_resume_builder(stub_resource_provider, stub_test_config_builder, tmp_path, buffer_size=2) - with patch.object(flags, "DATA_DESIGNER_ASYNC_ENGINE", True): - with patch.object(builder_mod, "run_readiness_check"): - with patch.object(builder._processor_runner, "run_after_generation") as mock_after: - builder.build(num_records=4, resume=ResumeMode.ALWAYS) + with patch.object(builder_mod, "run_readiness_check"): + with patch.object(builder._processor_runner, "run_after_generation") as mock_after: + builder.build(num_records=4, resume=ResumeMode.ALWAYS) mock_after.assert_not_called() @@ -2027,10 +1642,9 @@ def test_find_completed_row_groups_used_for_initial_total_batches( builder = _make_resume_builder(stub_resource_provider, stub_test_config_builder, tmp_path, buffer_size=2) # Both row groups are on disk β†’ dataset is already complete β†’ generated=False - with patch.object(flags, "DATA_DESIGNER_ASYNC_ENGINE", True): - with patch.object(builder_mod, "run_readiness_check"): - with patch.object(builder._processor_runner, "run_after_generation") as mock_after: - builder.build(num_records=4, resume=ResumeMode.ALWAYS) + with patch.object(builder_mod, "run_readiness_check"): + with patch.object(builder._processor_runner, "run_after_generation") as mock_after: + builder.build(num_records=4, resume=ResumeMode.ALWAYS) # Already complete based on filesystem count (2 files β‰₯ 2 row groups) β€” no generation needed mock_after.assert_not_called() @@ -2072,16 +1686,12 @@ def capturing_prepare(*args, **kwargs): mock_future = Mock() mock_future.result = Mock(return_value=None) - # asyncio and ensure_async_engine_loop are lazy-imported in dataset_builder only when - # DATA_DESIGNER_ASYNC_ENGINE=True at module load time. Inject them for the duration - # of this test so _build_async can proceed past the early-return path. - with patch.object(flags, "DATA_DESIGNER_ASYNC_ENGINE", True): - with patch.object(builder_mod, "asyncio", stdlib_asyncio, create=True): - with patch.object(builder_mod, "ensure_async_engine_loop", Mock(return_value=Mock()), create=True): - with patch.object(stdlib_asyncio, "run_coroutine_threadsafe", return_value=mock_future): - with patch.object(builder_mod, "run_readiness_check"): - with patch.object(builder, "_prepare_async_run", side_effect=capturing_prepare): - builder.build(num_records=6, resume=ResumeMode.ALWAYS) + with patch.object(builder_mod, "asyncio", stdlib_asyncio, create=True): + with patch.object(builder_mod, "ensure_async_engine_loop", Mock(return_value=Mock()), create=True): + with patch.object(stdlib_asyncio, "run_coroutine_threadsafe", return_value=mock_future): + with patch.object(builder_mod, "run_readiness_check"): + with patch.object(builder, "_prepare_async_run", side_effect=capturing_prepare): + builder.build(num_records=6, resume=ResumeMode.ALWAYS) # Filesystem says 2 groups done (IDs 0+1) β†’ 2+2 = 4 records, not stale metadata value 2 assert captured["initial_actual_num_records"] == 4 @@ -2181,13 +1791,12 @@ def capturing_prepare(*args, **kwargs): mock_future = Mock() mock_future.result = Mock(return_value=None) - with patch.object(flags, "DATA_DESIGNER_ASYNC_ENGINE", True): - with patch.object(builder_mod, "asyncio", stdlib_asyncio, create=True): - with patch.object(builder_mod, "ensure_async_engine_loop", Mock(return_value=Mock()), create=True): - with patch.object(stdlib_asyncio, "run_coroutine_threadsafe", return_value=mock_future): - with patch.object(builder_mod, "run_readiness_check"): - with patch.object(builder, "_prepare_async_run", side_effect=capturing_prepare): - builder.build(num_records=6, resume=ResumeMode.ALWAYS) + with patch.object(builder_mod, "asyncio", stdlib_asyncio, create=True): + with patch.object(builder_mod, "ensure_async_engine_loop", Mock(return_value=Mock()), create=True): + with patch.object(stdlib_asyncio, "run_coroutine_threadsafe", return_value=mock_future): + with patch.object(builder_mod, "run_readiness_check"): + with patch.object(builder, "_prepare_async_run", side_effect=capturing_prepare): + builder.build(num_records=6, resume=ResumeMode.ALWAYS) assert captured["initial_actual_num_records"] == 3 assert captured["initial_total_num_batches"] == 2 @@ -2224,14 +1833,13 @@ def capturing_prepare(*args, **kwargs): mock_future = Mock() mock_future.result = Mock(return_value=None) - with patch.object(flags, "DATA_DESIGNER_ASYNC_ENGINE", True): - with patch.object(builder_mod, "asyncio", stdlib_asyncio, create=True): - with patch.object(builder_mod, "ensure_async_engine_loop", Mock(return_value=Mock()), create=True): - with patch.object(stdlib_asyncio, "run_coroutine_threadsafe", return_value=mock_future): - with patch.object(builder_mod, "run_readiness_check"): - with patch.object(builder, "_prepare_async_run", side_effect=capturing_prepare): - # Extend the dataset: new target is 7, original was 5 - builder.build(num_records=7, resume=ResumeMode.ALWAYS) + with patch.object(builder_mod, "asyncio", stdlib_asyncio, create=True): + with patch.object(builder_mod, "ensure_async_engine_loop", Mock(return_value=Mock()), create=True): + with patch.object(stdlib_asyncio, "run_coroutine_threadsafe", return_value=mock_future): + with patch.object(builder_mod, "run_readiness_check"): + with patch.object(builder, "_prepare_async_run", side_effect=capturing_prepare): + # Extend the dataset: new target is 7, original was 5 + builder.build(num_records=7, resume=ResumeMode.ALWAYS) # Row groups [2, 2, 1] from original 5-record run: 2+2+1=5, not 2+2+2=6 assert captured["initial_actual_num_records"] == 5 @@ -2272,13 +1880,12 @@ def capturing_prepare(*args, **kwargs): mock_future = Mock() mock_future.result = Mock(return_value=None) - with patch.object(flags, "DATA_DESIGNER_ASYNC_ENGINE", True): - with patch.object(builder_mod, "asyncio", stdlib_asyncio, create=True): - with patch.object(builder_mod, "ensure_async_engine_loop", Mock(return_value=Mock()), create=True): - with patch.object(stdlib_asyncio, "run_coroutine_threadsafe", return_value=mock_future): - with patch.object(builder_mod, "run_readiness_check"): - with patch.object(builder, "_prepare_async_run", side_effect=capturing_prepare): - builder.build(num_records=9, resume=ResumeMode.ALWAYS) + with patch.object(builder_mod, "asyncio", stdlib_asyncio, create=True): + with patch.object(builder_mod, "ensure_async_engine_loop", Mock(return_value=Mock()), create=True): + with patch.object(stdlib_asyncio, "run_coroutine_threadsafe", return_value=mock_future): + with patch.object(builder_mod, "run_readiness_check"): + with patch.object(builder, "_prepare_async_run", side_effect=capturing_prepare): + builder.build(num_records=9, resume=ResumeMode.ALWAYS) # 2+2+1 (original) + 2 (extension group 3) = 7, not 4 (which unguarded formula gives) assert captured["initial_actual_num_records"] == 7 @@ -2327,13 +1934,12 @@ def capturing_prepare(*args, **kwargs): mock_future = Mock() mock_future.result = Mock(return_value=None) - with patch.object(flags, "DATA_DESIGNER_ASYNC_ENGINE", True): - with patch.object(builder_mod, "asyncio", stdlib_asyncio, create=True): - with patch.object(builder_mod, "ensure_async_engine_loop", Mock(return_value=Mock()), create=True): - with patch.object(stdlib_asyncio, "run_coroutine_threadsafe", return_value=mock_future): - with patch.object(builder_mod, "run_readiness_check"): - with patch.object(builder, "_prepare_async_run", side_effect=capturing_prepare): - builder.build(num_records=9, resume=ResumeMode.ALWAYS) + with patch.object(builder_mod, "asyncio", stdlib_asyncio, create=True): + with patch.object(builder_mod, "ensure_async_engine_loop", Mock(return_value=Mock()), create=True): + with patch.object(stdlib_asyncio, "run_coroutine_threadsafe", return_value=mock_future): + with patch.object(builder_mod, "run_readiness_check"): + with patch.object(builder, "_prepare_async_run", side_effect=capturing_prepare): + builder.build(num_records=9, resume=ResumeMode.ALWAYS) # original_target=5 β†’ groups 0,1 β†’ 2+2; group 2 β†’ 1; group 3 (ext) β†’ min(2,9-6)=2. Total=7 assert captured["initial_actual_num_records"] == 7 @@ -2370,13 +1976,12 @@ def capturing_prepare(*args, **kwargs): mock_future = Mock() mock_future.result = Mock(return_value=None) - with patch.object(flags, "DATA_DESIGNER_ASYNC_ENGINE", True): - with patch.object(builder_mod, "asyncio", stdlib_asyncio, create=True): - with patch.object(builder_mod, "ensure_async_engine_loop", Mock(return_value=Mock()), create=True): - with patch.object(stdlib_asyncio, "run_coroutine_threadsafe", return_value=mock_future): - with patch.object(builder_mod, "run_readiness_check"): - with patch.object(builder, "_prepare_async_run", side_effect=capturing_prepare): - builder.build(num_records=6, resume=ResumeMode.ALWAYS) + with patch.object(builder_mod, "asyncio", stdlib_asyncio, create=True): + with patch.object(builder_mod, "ensure_async_engine_loop", Mock(return_value=Mock()), create=True): + with patch.object(stdlib_asyncio, "run_coroutine_threadsafe", return_value=mock_future): + with patch.object(builder_mod, "run_readiness_check"): + with patch.object(builder, "_prepare_async_run", side_effect=capturing_prepare): + builder.build(num_records=6, resume=ResumeMode.ALWAYS) # Only rg_id=1 remains; rg_id=0 and rg_id=2 are already on disk assert list(captured["precomputed_row_groups"]) == [(1, 2)] @@ -2415,13 +2020,12 @@ def capturing_prepare(*args, **kwargs): mock_future = Mock() mock_future.result = Mock(return_value=None) - with patch.object(flags, "DATA_DESIGNER_ASYNC_ENGINE", True): - with patch.object(builder_mod, "asyncio", stdlib_asyncio, create=True): - with patch.object(builder_mod, "ensure_async_engine_loop", Mock(return_value=Mock()), create=True): - with patch.object(stdlib_asyncio, "run_coroutine_threadsafe", return_value=mock_future): - with patch.object(builder_mod, "run_readiness_check"): - with patch.object(builder, "_prepare_async_run", side_effect=capturing_prepare): - builder.build(num_records=7, resume=ResumeMode.ALWAYS) + with patch.object(builder_mod, "asyncio", stdlib_asyncio, create=True): + with patch.object(builder_mod, "ensure_async_engine_loop", Mock(return_value=Mock()), create=True): + with patch.object(stdlib_asyncio, "run_coroutine_threadsafe", return_value=mock_future): + with patch.object(builder_mod, "run_readiness_check"): + with patch.object(builder, "_prepare_async_run", side_effect=capturing_prepare): + builder.build(num_records=7, resume=ResumeMode.ALWAYS) # rg_id=3 should have 2 records (7-5=2 extension records, buffer_size=2), not 1 assert list(captured["precomputed_row_groups"]) == [(3, 2)] @@ -2454,13 +2058,12 @@ def capturing_prepare(*args, **kwargs): mock_future = Mock() mock_future.result = Mock(return_value=None) - with patch.object(flags, "DATA_DESIGNER_ASYNC_ENGINE", True): - with patch.object(builder_mod, "asyncio", stdlib_asyncio, create=True): - with patch.object(builder_mod, "ensure_async_engine_loop", Mock(return_value=Mock()), create=True): - with patch.object(stdlib_asyncio, "run_coroutine_threadsafe", return_value=mock_future): - with patch.object(builder_mod, "run_readiness_check"): - with patch.object(builder, "_prepare_async_run", side_effect=capturing_prepare) as mock_prepare: - builder.build(num_records=6, resume=ResumeMode.ALWAYS) + with patch.object(builder_mod, "asyncio", stdlib_asyncio, create=True): + with patch.object(builder_mod, "ensure_async_engine_loop", Mock(return_value=Mock()), create=True): + with patch.object(stdlib_asyncio, "run_coroutine_threadsafe", return_value=mock_future): + with patch.object(builder_mod, "run_readiness_check"): + with patch.object(builder, "_prepare_async_run", side_effect=capturing_prepare) as mock_prepare: + builder.build(num_records=6, resume=ResumeMode.ALWAYS) # _prepare_async_run must be called β€” the dataset is NOT already complete mock_prepare.assert_called_once() diff --git a/packages/data-designer-engine/tests/engine/dataset_builders/utils/test_dataset_batch_manager.py b/packages/data-designer-engine/tests/engine/dataset_builders/utils/test_dataset_batch_manager.py deleted file mode 100644 index d477a39c5..000000000 --- a/packages/data-designer-engine/tests/engine/dataset_builders/utils/test_dataset_batch_manager.py +++ /dev/null @@ -1,455 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -from __future__ import annotations - -import json -from unittest.mock import Mock, patch - -import pytest - -import data_designer.lazy_heavy_imports as lazy -from data_designer.engine.dataset_builders.utils.dataset_batch_manager import DatasetBatchManager -from data_designer.engine.dataset_builders.utils.errors import DatasetBatchManagementError -from data_designer.engine.storage.artifact_storage import BatchStage - - -@pytest.fixture -def stub_batch_manager(artifact_storage): - return DatasetBatchManager(artifact_storage) - - -@pytest.fixture -def stub_batch_manager_with_data(artifact_storage): - manager = DatasetBatchManager(artifact_storage) - manager.start(num_records=10, buffer_size=3) - return manager - - -def test_initial_state_properties(stub_batch_manager): - assert stub_batch_manager.num_batches == 0 - assert stub_batch_manager.num_records_in_buffer == 0 - assert stub_batch_manager.buffer_is_empty is True - - with pytest.raises(DatasetBatchManagementError, match="num_records_list.*not set"): - _ = stub_batch_manager.num_records_list - - with pytest.raises(DatasetBatchManagementError, match="buffer_size.*not set"): - _ = stub_batch_manager.buffer_size - - with pytest.raises(DatasetBatchManagementError, match="Invalid batch number"): - _ = stub_batch_manager.num_records_batch - - -# Test start method -def test_start_with_valid_parameters(stub_batch_manager): - stub_batch_manager.start(num_records=10, buffer_size=3) - - assert stub_batch_manager.num_batches == 4 - assert stub_batch_manager.buffer_size == 3 - assert stub_batch_manager.num_records_list == [3, 3, 3, 1] - assert stub_batch_manager._current_batch_number == 0 - assert stub_batch_manager.buffer_is_empty is True - - -def test_start_with_exact_buffer_size(stub_batch_manager): - """Test start method when num_records is exactly divisible by buffer_size.""" - stub_batch_manager.start(num_records=9, buffer_size=3) - - assert stub_batch_manager.num_batches == 3 - assert stub_batch_manager.num_records_list == [3, 3, 3] - - -def test_start_with_single_batch(stub_batch_manager): - stub_batch_manager.start(num_records=2, buffer_size=5) - - assert stub_batch_manager.num_batches == 1 - assert stub_batch_manager.num_records_list == [2] - - -def test_start_with_invalid_num_records(stub_batch_manager): - with pytest.raises(DatasetBatchManagementError, match="num_records must be positive"): - stub_batch_manager.start(num_records=0, buffer_size=3) - - with pytest.raises(DatasetBatchManagementError, match="num_records must be positive"): - stub_batch_manager.start(num_records=-1, buffer_size=3) - - -def test_start_with_invalid_buffer_size(stub_batch_manager): - with pytest.raises(DatasetBatchManagementError, match="buffer_size must be positive"): - stub_batch_manager.start(num_records=10, buffer_size=0) - - with pytest.raises(DatasetBatchManagementError, match="buffer_size must be positive"): - stub_batch_manager.start(num_records=10, buffer_size=-1) - - -# Test buffer management -def test_add_single_record(stub_batch_manager_with_data): - record = {"id": 1, "name": "test"} - stub_batch_manager_with_data.add_record(record) - - assert stub_batch_manager_with_data.num_records_in_buffer == 1 - assert stub_batch_manager_with_data.buffer_is_empty is False - assert stub_batch_manager_with_data._buffer[0] == record - - -def test_add_multiple_records(stub_batch_manager_with_data): - records = [{"id": i, "name": f"test{i}"} for i in range(3)] - stub_batch_manager_with_data.add_records(records) - - assert stub_batch_manager_with_data.num_records_in_buffer == 3 - assert stub_batch_manager_with_data._buffer == records - - -def test_add_records_exceeds_buffer_size(stub_batch_manager_with_data): - invalid_buffer_size = stub_batch_manager_with_data.buffer_size + 1 - records = [{"id": i, "name": f"test{i}"} for i in range(invalid_buffer_size)] - - with pytest.raises(DatasetBatchManagementError, match="Buffer size exceeded"): - stub_batch_manager_with_data.add_records(records) - - -def test_drop_records(stub_batch_manager_with_data): - records = [{"id": i, "name": f"test{i}"} for i in range(3)] - stub_batch_manager_with_data.add_records(records) - - stub_batch_manager_with_data.drop_records([0, 2]) - - assert stub_batch_manager_with_data.num_records_in_buffer == 1 - assert stub_batch_manager_with_data._buffer[0] == {"id": 1, "name": "test1"} - - -def test_drop_records_with_empty_list(stub_batch_manager_with_data): - records = [{"id": i, "name": f"test{i}"} for i in range(3)] - stub_batch_manager_with_data.add_records(records) - - stub_batch_manager_with_data.drop_records([]) - - assert stub_batch_manager_with_data.num_records_in_buffer == 3 - - -def test_update_record(stub_batch_manager_with_data): - records = [{"id": i, "name": f"test{i}"} for i in range(3)] - stub_batch_manager_with_data.add_records(records) - - new_record = {"id": 1, "name": "updated"} - stub_batch_manager_with_data.update_record(1, new_record) - - assert stub_batch_manager_with_data._buffer[1] == new_record - - -def test_update_record_invalid_index(stub_batch_manager_with_data): - records = [{"id": i, "name": f"test{i}"} for i in range(3)] - stub_batch_manager_with_data.add_records(records) - - with pytest.raises(IndexError, match="Index.*out of bounds"): - stub_batch_manager_with_data.update_record(5, {"id": 5, "name": "test"}) - - with pytest.raises(IndexError, match="Index.*out of bounds"): - stub_batch_manager_with_data.update_record(-1, {"id": -1, "name": "test"}) - - -def test_replace_buffer(stub_batch_manager_with_data): - records = [{"id": i, "name": f"test{i}"} for i in range(3)] - stub_batch_manager_with_data.add_records(records) - - new_records = [{"id": i, "name": f"updated{i}"} for i in range(3)] - stub_batch_manager_with_data.replace_buffer(new_records) - - assert stub_batch_manager_with_data._buffer == new_records - - -def test_replace_buffer_wrong_length(stub_batch_manager_with_data): - records = [{"id": i, "name": f"test{i}"} for i in range(3)] - stub_batch_manager_with_data.add_records(records) - - wrong_length_records = [{"id": i, "name": f"test{i}"} for i in range(2)] - - with pytest.raises(DatasetBatchManagementError, match="Number of records.*must match"): - stub_batch_manager_with_data.replace_buffer(wrong_length_records) - - -# Test write method -def test_write_empty_buffer(stub_batch_manager_with_data): - result = stub_batch_manager_with_data.write() - assert result is None - - -def test_write_with_data(stub_batch_manager_with_data): - records = [{"id": i, "name": f"test{i}"} for i in range(3)] - stub_batch_manager_with_data.add_records(records) - - result = stub_batch_manager_with_data.write() - - assert result is not None - assert result.exists() - assert ( - result.name - == stub_batch_manager_with_data.artifact_storage.create_batch_file_path(0, BatchStage.PARTIAL_RESULT).name - ) - - df = lazy.pd.read_parquet(result) - expected_df = lazy.pd.DataFrame(records) - lazy.pd.testing.assert_frame_equal(df, expected_df) - - -def test_write_creates_partial_results_dir(stub_batch_manager_with_data): - records = [{"id": 1, "name": "test"}] - stub_batch_manager_with_data.add_records(records) - - stub_batch_manager_with_data.write() - - assert stub_batch_manager_with_data.artifact_storage.partial_results_path.exists() - - -# Test finish_batch method -def test_finish_batch_basic_functionality(stub_batch_manager_with_data): - records = [{"id": i, "name": f"test{i}"} for i in range(3)] - stub_batch_manager_with_data.add_records(records) - - result = stub_batch_manager_with_data.finish_batch() - - assert result.exists() - assert result.parent == stub_batch_manager_with_data.artifact_storage.final_dataset_path - assert ( - result.name - == stub_batch_manager_with_data.artifact_storage.create_batch_file_path(0, BatchStage.PARTIAL_RESULT).name - ) - - # Verify metadata file was created - assert stub_batch_manager_with_data.artifact_storage.metadata_file_path.exists() - - # Verify buffer was cleared - assert stub_batch_manager_with_data.buffer_is_empty - assert stub_batch_manager_with_data._current_batch_number == 1 - - -def test_finish_batch_all_batches_processed(stub_batch_manager_with_data): - # Process all batches with data - for i in range(stub_batch_manager_with_data.num_batches): - # Add some data for each batch - records = [{"id": j, "name": f"batch{i}_{j}"} for j in range(3)] - stub_batch_manager_with_data.add_records(records) - stub_batch_manager_with_data.finish_batch() - - with pytest.raises(DatasetBatchManagementError, match="All batches have been processed"): - stub_batch_manager_with_data.finish_batch() - - -def test_finish_batch_empty_buffer_logs_warning_and_returns_none( - stub_batch_manager_with_data, caplog: pytest.LogCaptureFixture -) -> None: - result = stub_batch_manager_with_data.finish_batch() - - assert result is None - assert "finished without any results to write" in caplog.text - assert stub_batch_manager_with_data.buffer_is_empty - assert stub_batch_manager_with_data.get_current_batch_number() == 1 - assert not stub_batch_manager_with_data.artifact_storage.metadata_file_path.exists() - - -def test_finish_batch_empty_buffer_does_not_call_on_complete(stub_batch_manager_with_data) -> None: - on_complete = Mock() - - result = stub_batch_manager_with_data.finish_batch(on_complete=on_complete) - - assert result is None - on_complete.assert_not_called() - - -def test_finish_batch_metadata_content(stub_batch_manager_with_data): - records = [{"id": i, "name": f"test{i}"} for i in range(3)] - stub_batch_manager_with_data.add_records(records) - - stub_batch_manager_with_data.finish_batch() - - with open(stub_batch_manager_with_data.artifact_storage.metadata_file_path) as f: - metadata = json.load(f) - - assert metadata["target_num_records"] == 10 - assert metadata["actual_num_records"] == 3 # actual records written in this batch - assert metadata["total_num_batches"] == 4 - assert metadata["buffer_size"] == 3 - assert metadata["num_completed_batches"] == 1 - assert metadata["dataset_name"] == stub_batch_manager_with_data.artifact_storage.dataset_name - assert "schema" in metadata - assert "file_paths" in metadata - assert isinstance(metadata["file_paths"], dict) - assert "parquet-files" in metadata["file_paths"] - assert isinstance(metadata["file_paths"]["parquet-files"], list) - assert len(metadata["file_paths"]["parquet-files"]) == 1 - assert metadata["file_paths"]["parquet-files"][0] == "parquet-files/batch_00000.parquet" - - # processor-files key should not exist if no processor files - assert "processor-files" not in metadata["file_paths"] - - -# Test finish method -def test_finish_with_empty_partial_results(stub_batch_manager_with_data): - # Accessing the partial results path to trigger the creation of the directory - stub_batch_manager_with_data.artifact_storage.mkdir_if_needed( - stub_batch_manager_with_data.artifact_storage.partial_results_path - ) - - stub_batch_manager_with_data.finish() - - # Directory should be removed - assert not ( - stub_batch_manager_with_data.artifact_storage.base_dataset_path - / stub_batch_manager_with_data.artifact_storage.partial_results_folder_name - ).exists() - - -def test_finish_with_partial_results(stub_batch_manager_with_data, caplog): - ( - stub_batch_manager_with_data.artifact_storage.mkdir_if_needed( - stub_batch_manager_with_data.artifact_storage.partial_results_path - ) - / "test_file.txt" - ).touch() - - stub_batch_manager_with_data.finish() - - # Directory should still exist - assert stub_batch_manager_with_data.artifact_storage.partial_results_path.exists() - - # Warning should be logged - assert "Dataset writing finished with partial results" in caplog.text - - -# Test reset method -def test_reset_without_delete_files(stub_batch_manager_with_data): - records = [{"id": i, "name": f"test{i}"} for i in range(3)] - stub_batch_manager_with_data.add_records(records) - stub_batch_manager_with_data._current_batch_number = 2 - - stub_batch_manager_with_data.reset() - - assert stub_batch_manager_with_data._current_batch_number == 0 - assert stub_batch_manager_with_data.buffer_is_empty - # final_dataset_path doesn't exist until finish_batch is called - assert not ( - stub_batch_manager_with_data.artifact_storage.base_dataset_path - / stub_batch_manager_with_data.artifact_storage.final_dataset_folder_name - ).exists() - - -def test_reset_with_delete_files(stub_batch_manager_with_data): - ( - stub_batch_manager_with_data.artifact_storage.mkdir_if_needed( - stub_batch_manager_with_data.artifact_storage.final_dataset_path - ) - / "test.parquet" - ).touch() - (stub_batch_manager_with_data.artifact_storage.metadata_file_path).touch() - - stub_batch_manager_with_data.reset(delete_files=True) - - assert stub_batch_manager_with_data._current_batch_number == 0 - assert stub_batch_manager_with_data.buffer_is_empty - assert not stub_batch_manager_with_data.artifact_storage.final_dataset_path.exists() - assert not stub_batch_manager_with_data.artifact_storage.metadata_file_path.exists() - - -# Test error handling and edge cases -def test_num_records_batch_invalid_batch_number(stub_batch_manager_with_data): - stub_batch_manager_with_data._current_batch_number = 10 # Beyond num_batches - - with pytest.raises(DatasetBatchManagementError, match="Invalid batch number"): - _ = stub_batch_manager_with_data.num_records_batch - - -def test_write_exception_handling(stub_batch_manager_with_data): - records = [{"id": i, "name": f"test{i}"} for i in range(3)] - stub_batch_manager_with_data.add_records(records) - - with patch("pandas.DataFrame.to_parquet", side_effect=Exception("Write error")): - with pytest.raises(DatasetBatchManagementError, match="Failed to write batch"): - stub_batch_manager_with_data.write() - - -def test_reset_delete_files_exception_handling(stub_batch_manager_with_data): - stub_batch_manager_with_data.artifact_storage.mkdir_if_needed( - stub_batch_manager_with_data.artifact_storage.final_dataset_path - ) - (stub_batch_manager_with_data.artifact_storage.final_dataset_path / "test.parquet").touch() - - with patch("shutil.rmtree", side_effect=OSError("Delete error")): - with pytest.raises(DatasetBatchManagementError, match="Failed to delete directory"): - stub_batch_manager_with_data.reset(delete_files=True) - - -def test_full_workflow(stub_batch_manager): - stub_batch_manager.start(num_records=7, buffer_size=3) - - assert stub_batch_manager.num_batches == 3 - assert stub_batch_manager.num_records_list == [3, 3, 1] - - # Process first batch - records1 = [{"id": i, "name": f"batch1_{i}"} for i in range(3)] - stub_batch_manager.add_records(records1) - result1 = stub_batch_manager.finish_batch() - - assert result1.exists() - assert stub_batch_manager.get_current_batch_number() == 1 - assert stub_batch_manager.buffer_is_empty - - # Process second batch - records2 = [{"id": i, "name": f"batch2_{i}"} for i in range(3)] - stub_batch_manager.add_records(records2) - result2 = stub_batch_manager.finish_batch() - - assert result2.exists() - assert stub_batch_manager.get_current_batch_number() == 2 - - # Process final batch - records3 = [{"id": 0, "name": "batch3_0"}] - stub_batch_manager.add_records(records3) - result3 = stub_batch_manager.finish_batch() - - assert result3.exists() - assert stub_batch_manager.get_current_batch_number() == 3 - - # Finish - stub_batch_manager.finish() - - # Verify all files exist - assert stub_batch_manager.artifact_storage.metadata_file_path.exists() - assert len(list(stub_batch_manager.artifact_storage.final_dataset_path.glob("*.parquet"))) == 3 - - -# --------------------------------------------------------------------------- -# start() with resume parameters -# --------------------------------------------------------------------------- - - -def test_start_with_start_batch(stub_batch_manager): - """start_batch shifts _current_batch_number so the loop skips already-done batches.""" - stub_batch_manager.start(num_records=10, buffer_size=3, start_batch=2) - - assert stub_batch_manager._current_batch_number == 2 - assert stub_batch_manager.num_batches == 4 - assert stub_batch_manager.buffer_is_empty is True - - -def test_start_with_initial_actual_num_records(stub_batch_manager): - """initial_actual_num_records pre-populates the running total for resumed runs.""" - stub_batch_manager.start(num_records=10, buffer_size=3, initial_actual_num_records=6) - - assert stub_batch_manager._actual_num_records == 6 - - -def test_start_with_start_batch_and_initial_actual_num_records(stub_batch_manager): - """Both resume params can be set together.""" - stub_batch_manager.start(num_records=10, buffer_size=3, start_batch=2, initial_actual_num_records=6) - - assert stub_batch_manager._current_batch_number == 2 - assert stub_batch_manager._actual_num_records == 6 - - -def test_start_default_values_unchanged(stub_batch_manager): - """Default call (no resume params) still starts at batch 0 with 0 actual records.""" - stub_batch_manager.start(num_records=10, buffer_size=3) - - assert stub_batch_manager._current_batch_number == 0 - assert stub_batch_manager._actual_num_records == 0 diff --git a/packages/data-designer-engine/tests/engine/models/test_async_engine_switch.py b/packages/data-designer-engine/tests/engine/models/test_async_engine_switch.py deleted file mode 100644 index b1a749155..000000000 --- a/packages/data-designer-engine/tests/engine/models/test_async_engine_switch.py +++ /dev/null @@ -1,49 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -from __future__ import annotations - -from unittest.mock import MagicMock - -from data_designer.config.column_configs import GenerationStrategy -from data_designer.engine.dataset_builders.dataset_builder import DatasetBuilder -from data_designer.engine.models.facade import ModelFacade - - -def test_model_facade_has_async_methods() -> None: - """ModelFacade exposes async variants of its core methods.""" - assert hasattr(ModelFacade, "acompletion") - assert hasattr(ModelFacade, "agenerate") - assert hasattr(ModelFacade, "agenerate_text_embeddings") - - -def test_model_facade_has_sync_methods() -> None: - """ModelFacade exposes synchronous core methods.""" - assert hasattr(ModelFacade, "completion") - assert hasattr(ModelFacade, "generate") - assert hasattr(ModelFacade, "generate_text_embeddings") - - -def test_async_engine_env_controls_builder_execution_path() -> None: - """When DATA_DESIGNER_ASYNC_ENGINE is set, _run_cell_by_cell_generator dispatches to async fan-out.""" - - mock_generator = MagicMock() - mock_generator.get_generation_strategy.return_value = GenerationStrategy.CELL_BY_CELL - mock_generator.inference_parameters.max_parallel_requests = 4 - - builder = MagicMock() - builder._resource_provider.run_config.non_inference_max_parallel_workers = 4 - - # Test with async enabled β€” uses max_parallel_requests from generator (same as sync) - builder._use_async = True - DatasetBuilder._run_cell_by_cell_generator(builder, mock_generator) - builder._fan_out_with_async.assert_called_once_with(mock_generator, max_workers=4) - builder._fan_out_with_threads.assert_not_called() - - builder.reset_mock() - - # Test with async disabled β€” uses max_parallel_requests from generator - builder._use_async = False - DatasetBuilder._run_cell_by_cell_generator(builder, mock_generator) - builder._fan_out_with_threads.assert_called_once_with(mock_generator, max_workers=4) - builder._fan_out_with_async.assert_not_called() diff --git a/packages/data-designer-engine/tests/engine/test_readiness.py b/packages/data-designer-engine/tests/engine/test_readiness.py index 2231ea3b2..5bfed588b 100644 --- a/packages/data-designer-engine/tests/engine/test_readiness.py +++ b/packages/data-designer-engine/tests/engine/test_readiness.py @@ -14,7 +14,6 @@ from data_designer.config.custom_column import custom_column_generator from data_designer.config.models import ModelConfig from data_designer.config.sampler_params import SamplerType, UUIDSamplerParams -from data_designer.engine import flags from data_designer.engine.dataset_builders.errors import DatasetGenerationError from data_designer.engine.mcp.registry import MCPRegistry from data_designer.engine.models.clients.adapters.http_model_client import ClientConcurrencyMode @@ -22,16 +21,6 @@ from data_designer.engine.resources.resource_provider import ResourceProvider -@pytest.fixture(autouse=True) -def _force_sync_engine(monkeypatch: pytest.MonkeyPatch) -> None: - """Pin readiness tests to the sync engine. - - Lets us assert against ``run_health_check`` directly without standing up - an event loop. - """ - monkeypatch.setattr(flags, "DATA_DESIGNER_ASYNC_ENGINE", False) - - def _build_columns( *, model_configs: list[ModelConfig], @@ -315,14 +304,8 @@ def test_run_readiness_check_passes_skip_flagged_aliases_to_registry( def test_run_readiness_check_dispatches_to_async_registry_under_async_engine( stub_resource_provider, stub_model_configs, - monkeypatch: pytest.MonkeyPatch, ) -> None: - """When the async engine is selected, model probes route through ``arun_health_check``. - - The autouse fixture pins sync; this test overrides for the async path so the - branch in ``readiness._run_model_health_check`` gets coverage. - """ - monkeypatch.setattr(flags, "DATA_DESIGNER_ASYNC_ENGINE", True) + """Async client mode routes model probes through ``arun_health_check``.""" stub_resource_provider.model_registry.arun_health_check = Mock() stub_resource_provider.mcp_registry = None @@ -356,10 +339,8 @@ def test_run_readiness_check_dispatches_to_async_registry_under_async_engine( def test_run_readiness_check_cancels_future_and_reraises_on_timeout( stub_resource_provider, stub_model_configs, - monkeypatch: pytest.MonkeyPatch, ) -> None: """A 180-second timeout cancels the future and re-raises ``TimeoutError``.""" - monkeypatch.setattr(flags, "DATA_DESIGNER_ASYNC_ENGINE", True) stub_resource_provider.model_registry.arun_health_check = Mock() stub_resource_provider.mcp_registry = None @@ -388,10 +369,8 @@ def test_run_readiness_check_cancels_future_and_reraises_on_timeout( def test_run_readiness_check_uses_sync_registry_for_sync_mode_clients( stub_resource_provider, stub_model_configs, - monkeypatch: pytest.MonkeyPatch, ) -> None: - """Readiness follows the explicit client mode, not only the raw async env flag.""" - monkeypatch.setattr(flags, "DATA_DESIGNER_ASYNC_ENGINE", True) + """Readiness follows the explicit client mode.""" stub_resource_provider.model_registry.run_health_check = Mock() stub_resource_provider.model_registry.arun_health_check = Mock() stub_resource_provider.mcp_registry = None diff --git a/packages/data-designer/src/data_designer/interface/data_designer.py b/packages/data-designer/src/data_designer/interface/data_designer.py index 32be9e9be..07cf4ff24 100644 --- a/packages/data-designer/src/data_designer/interface/data_designer.py +++ b/packages/data-designer/src/data_designer/interface/data_designer.py @@ -5,7 +5,6 @@ import asyncio import logging -import warnings from collections.abc import Callable from pathlib import Path from typing import TYPE_CHECKING @@ -36,7 +35,6 @@ MODEL_PROVIDERS_FILE_PATH, ) from data_designer.config.utils.info import InfoType, InterfaceInfo -from data_designer.engine import flags from data_designer.engine.analysis.dataset_profiler import DataDesignerDatasetProfiler, DatasetProfilerConfig from data_designer.engine.compiler import compile_data_designer_config from data_designer.engine.dataset_builders.dataset_builder import DatasetBuilder @@ -308,9 +306,6 @@ def create( # Surface the original task error when the run produced 0 records due to a # deterministic non-retryable failure (e.g. bad seed source). Without this, # the user sees a generic FileNotFoundError-on-parquet that obscures the cause. - # ``actual_num_records`` is set only on the async path; sync runs leave it at - # ``-1`` and ``first_non_retryable_error`` at ``None``, so this branch is - # async-only by construction. root_cause = builder.first_non_retryable_error if root_cause is not None and builder.actual_num_records == 0: raise DataDesignerGenerationError(f"πŸ›‘ {type(root_cause).__name__}: {root_cause}") from root_cause @@ -536,7 +531,7 @@ def check_models(self, config_builder: DataDesignerConfigBuilder) -> None: run_readiness_check( config_builder.build().columns, resource_provider, - client_concurrency_mode=self._resolve_client_concurrency_mode(config_builder), + client_concurrency_mode=ClientConcurrencyMode.ASYNC, ) def get_default_model_configs(self) -> list[ModelConfig]: @@ -620,7 +615,11 @@ def get_models(self, model_aliases: list[str]) -> dict[str, ModelFacade]: Dict mapping alias to ModelFacade instance. """ config_builder = DataDesignerConfigBuilder() - resource_provider = self._create_resource_provider("dev", config_builder) + resource_provider = self._create_resource_provider( + "dev", + config_builder, + client_concurrency_mode=ClientConcurrencyMode.SYNC, + ) return {alias: resource_provider.model_registry.get_model(model_alias=alias) for alias in model_aliases} def _resolve_model_providers(self, model_providers: list[ModelProvider] | None) -> list[ModelProvider]: @@ -673,6 +672,7 @@ def _create_resource_provider( *, resume: ResumeMode = ResumeMode.NEVER, artifact_path: Path | None = None, + client_concurrency_mode: ClientConcurrencyMode = ClientConcurrencyMode.ASYNC, ) -> ResourceProvider: artifact_path = artifact_path or self._artifact_path ArtifactStorage.mkdir_if_needed(artifact_path) @@ -692,7 +692,7 @@ def _create_resource_provider( run_config=self._run_config, mcp_providers=self._mcp_providers, tool_configs=config_builder.tool_configs, - client_concurrency_mode=self._resolve_client_concurrency_mode(config_builder), + client_concurrency_mode=client_concurrency_mode, request_admission=self._request_admission, ) @@ -701,25 +701,5 @@ def _create_request_admission_controller(self) -> AdaptiveRequestAdmissionContro return create_request_admission_controller(self._run_config) - @staticmethod - def _resolve_client_concurrency_mode(config_builder: DataDesignerConfigBuilder) -> ClientConcurrencyMode: - """Pick the model-client mode that matches the engine the run will use. - - The async engine is the default. Users can still opt into the legacy sync - engine with ``DATA_DESIGNER_ASYNC_ENGINE=0`` for the transitional release. - """ - if not flags.DATA_DESIGNER_ASYNC_ENGINE: - # Deliberate opt-out via env var. Surface the deprecation so users - # know the sync path is going away. - msg = ( - "DATA_DESIGNER_ASYNC_ENGINE=0 selects the legacy sync engine, which is " - "deprecated and will be removed in a future release. Unset the variable " - "(or set it to 1) to use the async engine." - ) - logger.warning(f"⚠️ {msg}") - warnings.warn(msg, DeprecationWarning, stacklevel=3) - return ClientConcurrencyMode.SYNC - return ClientConcurrencyMode.ASYNC - def _get_interface_info(self, model_providers: list[ModelProvider]) -> InterfaceInfo: return InterfaceInfo(model_providers=model_providers) diff --git a/packages/data-designer/tests/interface/test_data_designer.py b/packages/data-designer/tests/interface/test_data_designer.py index 69c1f626a..988a8137d 100644 --- a/packages/data-designer/tests/interface/test_data_designer.py +++ b/packages/data-designer/tests/interface/test_data_designer.py @@ -6,7 +6,6 @@ import contextlib import json import logging -import warnings from datetime import datetime from pathlib import Path from typing import Any @@ -36,7 +35,6 @@ FileContentsSeedSource, HuggingFaceSeedSource, ) -from data_designer.engine import flags from data_designer.engine.models.clients.adapters.http_model_client import ClientConcurrencyMode from data_designer.engine.resources.seed_reader import ( FileSystemSeedReader, @@ -392,49 +390,6 @@ def stub_seed_reader(): return StubHuggingFaceSeedReader() -@pytest.mark.parametrize( - "env_value,expected,expect_deprecation", - [ - ("1", "async", False), - ("0", "sync", True), - ], - ids=[ - "async-on-uses-async-clients", - "async-off-uses-sync-clients-and-warns", - ], -) -def test_resolve_client_concurrency_mode_matches_engine_choice( - env_value: str, - expected: str, - expect_deprecation: bool, - monkeypatch: pytest.MonkeyPatch, -) -> None: - """Client mode must match the engine the run will actually use. - - The ``DATA_DESIGNER_ASYNC_ENGINE=0`` opt-out path also emits a - ``DeprecationWarning`` so users on the legacy sync engine see a - pre-removal signal in their logs. - """ - monkeypatch.setattr(flags, "DATA_DESIGNER_ASYNC_ENGINE", env_value == "1") - builder = DataDesignerConfigBuilder() - builder.add_column( - SamplerColumnConfig( - name="seed", - sampler_type=SamplerType.CATEGORY, - params=CategorySamplerParams(values=["a"]), - ) - ) - - if expect_deprecation: - with pytest.warns(DeprecationWarning, match="legacy sync engine"): - mode = DataDesigner._resolve_client_concurrency_mode(builder) - else: - with warnings.catch_warnings(): - warnings.simplefilter("error", DeprecationWarning) - mode = DataDesigner._resolve_client_concurrency_mode(builder) - assert mode.value == expected - - def test_init_with_custom_secret_resolver(stub_artifact_path, stub_model_providers): """Test DataDesigner initialization with custom secret resolver.""" designer = DataDesigner( @@ -583,6 +538,20 @@ def test_run_config_normalizes_error_rate_when_disabled(stub_artifact_path, stub assert data_designer.run_config.shutdown_error_rate == 1.0 +def test_get_models_uses_sync_clients(stub_artifact_path, stub_model_providers): + data_designer = DataDesigner(artifact_path=stub_artifact_path, model_providers=stub_model_providers) + + with patch.object(data_designer, "_create_resource_provider") as mock_resource_provider_method: + mock_resource_provider = MagicMock() + mock_resource_provider.model_registry.get_model.return_value = MagicMock() + mock_resource_provider_method.return_value = mock_resource_provider + + data_designer.get_models(["stub-model"]) + + _, kwargs = mock_resource_provider_method.call_args + assert kwargs["client_concurrency_mode"] == ClientConcurrencyMode.SYNC + + def test_create_forwards_on_batch_complete_callback( stub_artifact_path: Path, stub_model_providers: list[ModelProvider], diff --git a/scripts/benchmarks/benchmark_engine_v2.py b/scripts/benchmarks/benchmark_engine_v2.py index 74c201993..5b35662b7 100644 --- a/scripts/benchmarks/benchmark_engine_v2.py +++ b/scripts/benchmarks/benchmark_engine_v2.py @@ -11,7 +11,6 @@ import hashlib import json import math -import os import random import statistics import subprocess @@ -277,18 +276,6 @@ def _format_stats(stats: MetricStats, *, unit: str, precision: int = 3) -> str: return f"{mean}{unit} Β± {ci}{unit} (stdev {stdev}{unit}, n={stats.n})" -def _format_speed_stats(stats: MetricStats, *, precision: int = 2) -> str: - fmt = f"{{:.{precision}f}}" - mean = fmt.format(stats.mean) - ci = fmt.format(stats.ci_half_width) - stdev = fmt.format(stats.stdev) - return f"{mean}x Β± {ci}x (stdev {stdev}x, n={stats.n})" - - -def _significant_diff(stats: MetricStats) -> bool: - return stats.n > 1 and abs(stats.mean) > stats.ci_half_width - - def _json_default(value: Any) -> Any: if isinstance(value, np.generic): return value.item() @@ -589,13 +576,12 @@ def _dataset_fingerprint(df: pd.DataFrame) -> str: def _run_single_benchmark(settings: BenchmarkSettings, engine_mode: str) -> BenchmarkResult: - # Imports are deferred so engine selection respects DATA_DESIGNER_ASYNC_ENGINE. - from data_designer.engine.dataset_builders.artifact_storage import ArtifactStorage from data_designer.engine.dataset_builders.dataset_builder import DatasetBuilder from data_designer.engine.model_provider import resolve_model_provider_registry from data_designer.engine.resources.resource_provider import create_resource_provider from data_designer.engine.resources.seed_reader import SeedReaderRegistry from data_designer.engine.secret_resolver import CompositeResolver, EnvironmentResolver, PlaintextResolver + from data_designer.engine.storage.artifact_storage import ArtifactStorage random.seed(settings.seed) np.random.seed(settings.seed) @@ -630,7 +616,6 @@ def _run_single_benchmark(settings: BenchmarkSettings, engine_mode: str) -> Benc secret_resolver=secret_resolver, model_provider_registry=model_provider_registry, seed_reader_registry=SeedReaderRegistry(readers=[]), - blob_storage=None, seed_dataset_source=None, run_config=run_config, mcp_providers=[mcp_provider], @@ -664,15 +649,9 @@ def _run_single_benchmark(settings: BenchmarkSettings, engine_mode: str) -> Benc def _run_subprocess(settings: BenchmarkSettings, engine_mode: str) -> BenchmarkResult: - env = os.environ.copy() - if engine_mode == "async": - env["DATA_DESIGNER_ASYNC_ENGINE"] = "1" - else: - env.pop("DATA_DESIGNER_ASYNC_ENGINE", None) - script_path = Path(__file__).resolve() cmd = [sys.executable, str(script_path), "--mode", "run", "--engine", engine_mode, *settings.to_cli_args()] - completed = subprocess.run(cmd, capture_output=True, text=True, env=env, check=False) + completed = subprocess.run(cmd, capture_output=True, text=True, check=False) if completed.returncode != 0: raise RuntimeError(f"Benchmark subprocess failed.\nstdout:\n{completed.stdout}\nstderr:\n{completed.stderr}") @@ -687,12 +666,6 @@ def _run_subprocess(settings: BenchmarkSettings, engine_mode: str) -> BenchmarkR ) -def _format_speedup(sync_time: float, async_time: float) -> str: - if async_time <= 0: - return "n/a" - return f"{(sync_time / async_time):.2f}x" - - def _run_with_progress(settings: BenchmarkSettings, engine_mode: str, iteration: int, total: int) -> BenchmarkResult: print(f"[{iteration}/{total}] Running {engine_mode} benchmark...", end="", flush=True) result = _run_subprocess(settings, engine_mode) @@ -701,68 +674,32 @@ def _run_with_progress(settings: BenchmarkSettings, engine_mode: str, iteration: def _compare_runs(settings: BenchmarkSettings, iterations: int) -> int: - sync_results: list[BenchmarkResult] = [] - async_results: list[BenchmarkResult] = [] + results: list[BenchmarkResult] = [] expected_hash: str | None = None for iteration in range(1, iterations + 1): - sync_result = _run_with_progress(settings, "sync", iteration, iterations) - async_result = _run_with_progress(settings, "async", iteration, iterations) - - if sync_result.dataset_hash != async_result.dataset_hash: - print( - "Content mismatch detected: " - f"sync hash {sync_result.dataset_hash} vs async hash {async_result.dataset_hash}" - ) - return 1 + result = _run_with_progress(settings, "async", iteration, iterations) if expected_hash is None: - expected_hash = sync_result.dataset_hash - elif expected_hash != sync_result.dataset_hash or expected_hash != async_result.dataset_hash: + expected_hash = result.dataset_hash + elif expected_hash != result.dataset_hash: print("Content mismatch detected across iterations.") return 1 - sync_results.append(sync_result) - async_results.append(async_result) - - build_sync = [result.build_time_sec for result in sync_results] - build_async = [result.build_time_sec for result in async_results] - total_sync = [result.total_time_sec for result in sync_results] - total_async = [result.total_time_sec for result in async_results] + results.append(result) - build_speedups = [sync / async_ for sync, async_ in zip(build_sync, build_async)] - total_speedups = [sync / async_ for sync, async_ in zip(total_sync, total_async)] - build_diffs = [sync - async_ for sync, async_ in zip(build_sync, build_async)] - total_diffs = [sync - async_ for sync, async_ in zip(total_sync, total_async)] + build_times = [result.build_time_sec for result in results] + total_times = [result.total_time_sec for result in results] - build_sync_stats = _compute_stats(build_sync) - build_async_stats = _compute_stats(build_async) - total_sync_stats = _compute_stats(total_sync) - total_async_stats = _compute_stats(total_async) - - build_speed_stats = _compute_stats(build_speedups) - total_speed_stats = _compute_stats(total_speedups) - build_diff_stats = _compute_stats(build_diffs) - total_diff_stats = _compute_stats(total_diffs) + build_stats = _compute_stats(build_times) + total_stats = _compute_stats(total_times) latency_label = "on" if settings.simulated_latency else "off" - print("\nEngine benchmark summary (95% CI)") + print("\nAsync engine benchmark summary (95% CI)") print(f"- runs: {iterations} | content match: yes | hash {expected_hash}") print(f"- simulated latency: {latency_label}") - print(f"- build time sync: {_format_stats(build_sync_stats, unit='s')}") - print(f"- build time async: {_format_stats(build_async_stats, unit='s')}") - print( - f"- build speedup: {_format_speed_stats(build_speed_stats)} | " - f"paired diff {_format_stats(build_diff_stats, unit='s')} | " - f"significant: {'yes' if _significant_diff(build_diff_stats) else 'no'}" - ) - print(f"- total time sync: {_format_stats(total_sync_stats, unit='s')}") - print(f"- total time async: {_format_stats(total_async_stats, unit='s')}") - print( - f"- total speedup: {_format_speed_stats(total_speed_stats)} | " - f"paired diff {_format_stats(total_diff_stats, unit='s')} | " - f"significant: {'yes' if _significant_diff(total_diff_stats) else 'no'}" - ) + print(f"- build time: {_format_stats(build_stats, unit='s')}") + print(f"- total time: {_format_stats(total_stats, unit='s')}") return 0 @@ -776,13 +713,13 @@ def _parse_args() -> argparse.Namespace: type=str, choices=("compare", "run"), default="compare", - help="Run both engines in subprocesses, or run once in the current process.", + help="Run repeated async benchmarks in subprocesses, or run once in the current process.", ) parser.add_argument( "--engine", type=str, - choices=("sync", "async"), - default="sync", + choices=("async",), + default="async", help="Engine mode for --mode run.", ) parser.add_argument("--num-records", type=int, default=DEFAULT_NUM_RECORDS, help="Records to generate.") @@ -792,7 +729,7 @@ def _parse_args() -> argparse.Namespace: "--iterations", type=int, default=DEFAULT_ITERATIONS, - help="Number of sync/async runs to include in the compare mode.", + help="Number of async runs to include in compare mode.", ) parser.add_argument( "--max-parallel-requests", @@ -829,11 +766,6 @@ def main() -> None: if args.mode == "compare": sys.exit(_compare_runs(settings, args.iterations)) - if args.engine == "async": - os.environ["DATA_DESIGNER_ASYNC_ENGINE"] = "1" - else: - os.environ.pop("DATA_DESIGNER_ASYNC_ENGINE", None) - print(f"Running {args.engine} benchmark...") result = _run_single_benchmark(settings, args.engine) print(f"{RESULT_PREFIX}{json.dumps(result.to_dict(), sort_keys=True)}")