Commit 47e52e5

authored

chore: plans for async generators and task-queue dataset builder (#347)

* chore: plans for async generators and task-queue dataset builder Part of #346 * address review feedback on async generators plan - Decouple scheduler semaphore (coarse resource guard) from PR #344's adaptive throttle manager (per-key API concurrency) - Add side-effect output column mapping to dependency resolution - Mandate cell-level merge writes, remove unsafe update_record option - Add is_stateful generator property for reentrancy control - Add retry & salvage policy for transient failures - Scope allow_resize out of async v1 (falls back to sync) - Fix asyncio.to_thread reference semantics, require defensive copies - Add new test cases for all above * add symmetric generate/agenerate bridge and plugin compatibility notes - Only one of generate/agenerate needs to be implemented; base class bridges in both directions (to_thread for sync→async, asyncio.run for async→sync) - Document impact on column generator plugins, custom columns, and processor plugins (all backward-compatible, no changes required) * add reference diagrams and clarify statefulness concept - Add plans/346/diagrams.md with 5 Mermaid diagrams: task lifecycle, scheduler main loop, dependency resolution example, concurrency layers, and row group pipelining - Clarify in plan that statefulness (concurrency safety) and sync/async (I/O model) are orthogonal concerns * add edge case handling and open questions - Eager row-drop propagation: failed rows are dropped across all columns to avoid wasting compute on incomplete rows - Out-of-order row group checkpointing via index-based file naming - Pre-batch processor failure skips the entire row group - Salvage rounds get separate error threshold and config knobs - Undersized last row group note - Open questions: thread pool sizing, silent task hangs * refine async scheduler plan safeguards - add submission budget controls to prevent unbounded parked tasks - clarify DAG validation, safe async-to-sync bridging, and row-scoped drop policy - align diagrams and unresolved risk wording with latest design decisions * refine plan with UX considerations and design clarifications - add UX considerations section: progress display strategy, peak memory cap, new config knobs, async custom columns and plugin upgrade path, what stays the same - replace allow_resize silent fallback with explicit DatasetGenerationError at startup; move to Follow-ups section - consolidate all deferred work into Out of scope / Follow-ups subsections - fix five internal inconsistencies: progress tracking in Step 4, missing async_max_concurrent_row_groups in scheduler constructor, annotate _ensure_async_engine_loop as existing, SamplerColumnGenerator dual-wrapper scope (applies to all FromScratchColumnGenerator subclasses), stateful serialization vs row group admission clarification - resolve previously open decisions: asyncio.Event over Condition, task_model as own module, async_max_concurrent_row_groups default 3, async_salvage_max_attempts_per_task dropped in favour of max_rounds+1 semantics, thread pool keep default for v1 - fix CustomColumnGenerator FULL_COLUMN async path (needs own agenerate branching on strategy); note ValidationColumnGenerator internal threading * document relation to PR #269 and fix scheduler diagram - add "Relation to PR #269" section explaining what we adopted (dependency source, trait inference, completion tracker design, statefulness separation) and what we changed (row-group tasks instead of cell-level nodes, ROW_STREAMABLE omitted) - fix scheduler main loop diagram: add async_max_concurrent_row_groups admission step, pre-batch failure path (skip row group + release slot), and loop back to ADMIT after row group completion * add profiling/tracing section to async scheduler plan - TaskTrace dataclass spec in Step 3 (opt-in, zero overhead when disabled) - trace=True param on AsyncTaskScheduler constructor in Step 4 - Step 8 benchmark references trace for timing measurements - New Profiling section: instrumentation points, example output table, usage snippet * refine async scheduler plan from review - Replace dependency map with static ExecutionGraph class (upstream, downstream, strategy, topological_order, critical_path, task_count, to_mermaid accessors) - Use row-group-local indices in CompletionTracker instead of global - Clarify from-scratch columns are FULL_COLUMN with empty upstream deps, not a separate strategy enum value - Remove reference to non-existent ColumnGeneratorCellByCell - Expand PR #269 comparison: ExecutionTraits → GenerationStrategy * fix graph complexity notation and clarify tracker API - Correct "N columns, N edges" to "O(C) nodes, O(C²) edges worst-case" - Add dispatched set param to get_ready_tasks to prevent double-dispatch - Clarify is_row_group_complete drop_row interaction * add PR breakdown, code sketches, and throttle note - Add PR breakdown section with 4 PRs, dependency graph, and "what works after merge" for each - Add code-sketches.md with structural sketches of main components - Reorganize test plan by PR (unit tests per PR, integration in PR 4) - Note that throttle manager (PR #344) is optional; scheduler works without it initially * refine concurrency model and add multi-column handling - Rename scheduler semaphore to execution semaphore for clarity - Split execution semaphore from submission budget as distinct concerns with separate semaphores - Add reacquire step to dispatch pattern after throttle wait - Add multi-column generator handling via instance dedup on the scheduler (graph stays column-level) * add compute-bound generator risk and follow-up GIL contention with CPU-bound custom generators and event loop starvation with native async compute are documented as v1 risks. ProcessPoolExecutor routing via is_cpu_bound noted as follow-up. * clarify compute-bound risk as thread pool starvation Compute-heavy tasks saturating the thread pool starve I/O-bound tasks (LLM calls) from acquiring threads, not just GIL contention. * add async guidance for plugins and custom columns Compute-bound plugins should implement generate(), not agenerate(), to keep CPU work off the event loop. Same rule for custom columns: only use async def for I/O-bound work.

1 parent 982ce79 commit 47e52e5Copy full SHA for 47e52e5

3 files changed

plans/346

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 47e52e5

Uh oh!

File tree

0 commit comments