Commit 47e52e5
authored
chore: plans for async generators and task-queue dataset builder (#347)
* chore: plans for async generators and task-queue dataset builder
Part of #346
* address review feedback on async generators plan
- Decouple scheduler semaphore (coarse resource guard) from PR #344's
adaptive throttle manager (per-key API concurrency)
- Add side-effect output column mapping to dependency resolution
- Mandate cell-level merge writes, remove unsafe update_record option
- Add is_stateful generator property for reentrancy control
- Add retry & salvage policy for transient failures
- Scope allow_resize out of async v1 (falls back to sync)
- Fix asyncio.to_thread reference semantics, require defensive copies
- Add new test cases for all above
* add symmetric generate/agenerate bridge and plugin compatibility notes
- Only one of generate/agenerate needs to be implemented; base class
bridges in both directions (to_thread for sync→async, asyncio.run
for async→sync)
- Document impact on column generator plugins, custom columns, and
processor plugins (all backward-compatible, no changes required)
* add reference diagrams and clarify statefulness concept
- Add plans/346/diagrams.md with 5 Mermaid diagrams: task lifecycle,
scheduler main loop, dependency resolution example, concurrency
layers, and row group pipelining
- Clarify in plan that statefulness (concurrency safety) and sync/async
(I/O model) are orthogonal concerns
* add edge case handling and open questions
- Eager row-drop propagation: failed rows are dropped across all columns
to avoid wasting compute on incomplete rows
- Out-of-order row group checkpointing via index-based file naming
- Pre-batch processor failure skips the entire row group
- Salvage rounds get separate error threshold and config knobs
- Undersized last row group note
- Open questions: thread pool sizing, silent task hangs
* refine async scheduler plan safeguards
- add submission budget controls to prevent unbounded parked tasks
- clarify DAG validation, safe async-to-sync bridging, and row-scoped drop policy
- align diagrams and unresolved risk wording with latest design decisions
* refine plan with UX considerations and design clarifications
- add UX considerations section: progress display strategy, peak memory
cap, new config knobs, async custom columns and plugin upgrade path,
what stays the same
- replace allow_resize silent fallback with explicit DatasetGenerationError
at startup; move to Follow-ups section
- consolidate all deferred work into Out of scope / Follow-ups subsections
- fix five internal inconsistencies: progress tracking in Step 4, missing
async_max_concurrent_row_groups in scheduler constructor, annotate
_ensure_async_engine_loop as existing, SamplerColumnGenerator dual-wrapper
scope (applies to all FromScratchColumnGenerator subclasses),
stateful serialization vs row group admission clarification
- resolve previously open decisions: asyncio.Event over Condition,
task_model as own module, async_max_concurrent_row_groups default 3,
async_salvage_max_attempts_per_task dropped in favour of max_rounds+1
semantics, thread pool keep default for v1
- fix CustomColumnGenerator FULL_COLUMN async path (needs own agenerate
branching on strategy); note ValidationColumnGenerator internal threading
* document relation to PR #269 and fix scheduler diagram
- add "Relation to PR #269" section explaining what we adopted
(dependency source, trait inference, completion tracker design,
statefulness separation) and what we changed (row-group tasks
instead of cell-level nodes, ROW_STREAMABLE omitted)
- fix scheduler main loop diagram: add async_max_concurrent_row_groups
admission step, pre-batch failure path (skip row group + release
slot), and loop back to ADMIT after row group completion
* add profiling/tracing section to async scheduler plan
- TaskTrace dataclass spec in Step 3 (opt-in, zero overhead when disabled)
- trace=True param on AsyncTaskScheduler constructor in Step 4
- Step 8 benchmark references trace for timing measurements
- New Profiling section: instrumentation points, example output table, usage snippet
* refine async scheduler plan from review
- Replace dependency map with static ExecutionGraph class (upstream,
downstream, strategy, topological_order, critical_path, task_count,
to_mermaid accessors)
- Use row-group-local indices in CompletionTracker instead of global
- Clarify from-scratch columns are FULL_COLUMN with empty upstream
deps, not a separate strategy enum value
- Remove reference to non-existent ColumnGeneratorCellByCell
- Expand PR #269 comparison: ExecutionTraits → GenerationStrategy
* fix graph complexity notation and clarify tracker API
- Correct "N columns, N edges" to "O(C) nodes, O(C²) edges worst-case"
- Add dispatched set param to get_ready_tasks to prevent double-dispatch
- Clarify is_row_group_complete drop_row interaction
* add PR breakdown, code sketches, and throttle note
- Add PR breakdown section with 4 PRs, dependency graph, and
"what works after merge" for each
- Add code-sketches.md with structural sketches of main components
- Reorganize test plan by PR (unit tests per PR, integration in PR 4)
- Note that throttle manager (PR #344) is optional; scheduler works
without it initially
* refine concurrency model and add multi-column handling
- Rename scheduler semaphore to execution semaphore for clarity
- Split execution semaphore from submission budget as distinct
concerns with separate semaphores
- Add reacquire step to dispatch pattern after throttle wait
- Add multi-column generator handling via instance dedup on the
scheduler (graph stays column-level)
* add compute-bound generator risk and follow-up
GIL contention with CPU-bound custom generators and event loop
starvation with native async compute are documented as v1 risks.
ProcessPoolExecutor routing via is_cpu_bound noted as follow-up.
* clarify compute-bound risk as thread pool starvation
Compute-heavy tasks saturating the thread pool starve I/O-bound
tasks (LLM calls) from acquiring threads, not just GIL contention.
* add async guidance for plugins and custom columns
Compute-bound plugins should implement generate(), not agenerate(),
to keep CPU work off the event loop. Same rule for custom columns:
only use async def for I/O-bound work.1 parent 982ce79 commit 47e52e5
3 files changed
Lines changed: 1514 additions & 0 deletions
0 commit comments