Priority Level
High (Major functionality broken)
Describe the bug
Very large async DataDesigner jobs can become CPU-bound in scheduler queue observation/selection before LLM endpoint capacity is saturated. In repeated large simulated runs, the scheduler spent most of wall time rebuilding queue views and selecting tasks, while healthy mock endpoints had available capacity and record completion throughput collapsed.
This directly affects the async scheduler goals of maximizing throughput, minimizing endpoint idle time, and supporting fire-and-forget large jobs.
Steps/Code to reproduce bug
Use a throwaway internal scheduler harness with mock ColumnGenerator implementations and no product-code changes. Configure:
- 50k to 2M logical records
- 8 to 64 independent LLM-like columns
- active frontiers from 512 to 16,384 tasks
- endpoint/request capacity from 64 to 1,024 slots
- realistic compressed latency with jitter and occasional tails
AsyncTaskScheduler with tracing disabled unless explicitly testing trace overhead
Representative scenarios:
Scenario A: 1,000,000 records x 64 columns, active frontier 4,096, endpoint cap 512
Scenario B: 2,000,000 records x 64 columns, active frontier 2,048, endpoint cap 1,024, low endpoint latency
Scenario C: 250,000 records x 16 columns, active frontier 4,096, 2-12 KB response payloads
Instrument FairTaskQueue.view(), FairTaskQueue.select_next(), completed model calls, endpoint occupancy, event-loop/cancellation timing, CPU time, and RSS.
Observed examples from the investigation:
| Scenario |
Logical calls |
Frontier |
Completed calls |
Wall time |
Endpoint util |
Queue bookkeeping |
CPU |
| 1M x 64, cap 512 |
64,000,000 |
4,096 |
604 |
32.5s |
21.7% |
84.9% wall |
high |
| 2M x 64, cap 1,024 |
128,000,000 |
2,048 |
4,096 |
26.0s |
0.01% |
113.4% wall |
98.9% |
| 250k x 16, payloads |
4,000,000 |
4,096 |
60 |
81.0s |
2.7% |
94.7% wall |
98.8% |
Expected behavior
Large async jobs should be primarily constrained by configured endpoint/request capacity and mock endpoint latency. Queue diagnostics and scheduler bookkeeping should not require full ready-queue scans on hot paths often enough to dominate runtime.
Agent Diagnostic / Prior Investigation
The investigation repeatedly sampled hot paths in queue observation and selection:
FairTaskQueue.view()
_record_observed_task_state()
_dispatch_queued_tasks()
_main_dispatch_loop()
The strongest signal was that increasing endpoint caps/frontier did not improve throughput proportionally. In several runs, larger frontiers made throughput worse because queue observation consumed the event loop. Cancellation/timebox behavior also degraded because the scheduler could not yield promptly while doing queue bookkeeping.
This appears related to full FairTaskQueue.view() construction inside scheduler hot paths, including _record_observed_task_state() and admission-blocked diagnostics.
Additional context
Suggested direction:
- Maintain queue counts and resource demand incrementally as queue state mutates.
- Avoid full queue-view scans in
_record_observed_task_state() on every dispatch/completion.
- Separate expensive diagnostics from hot-path capacity accounting.
- Gate detailed queue diagnostics behind tracing or periodic health snapshots.
Local artifact paths and machine identifiers from the investigation were intentionally omitted from this issue.
Checklist
Priority Level
High (Major functionality broken)
Describe the bug
Very large async DataDesigner jobs can become CPU-bound in scheduler queue observation/selection before LLM endpoint capacity is saturated. In repeated large simulated runs, the scheduler spent most of wall time rebuilding queue views and selecting tasks, while healthy mock endpoints had available capacity and record completion throughput collapsed.
This directly affects the async scheduler goals of maximizing throughput, minimizing endpoint idle time, and supporting fire-and-forget large jobs.
Steps/Code to reproduce bug
Use a throwaway internal scheduler harness with mock
ColumnGeneratorimplementations and no product-code changes. Configure:AsyncTaskSchedulerwith tracing disabled unless explicitly testing trace overheadRepresentative scenarios:
Instrument
FairTaskQueue.view(),FairTaskQueue.select_next(), completed model calls, endpoint occupancy, event-loop/cancellation timing, CPU time, and RSS.Observed examples from the investigation:
Expected behavior
Large async jobs should be primarily constrained by configured endpoint/request capacity and mock endpoint latency. Queue diagnostics and scheduler bookkeeping should not require full ready-queue scans on hot paths often enough to dominate runtime.
Agent Diagnostic / Prior Investigation
The investigation repeatedly sampled hot paths in queue observation and selection:
The strongest signal was that increasing endpoint caps/frontier did not improve throughput proportionally. In several runs, larger frontiers made throughput worse because queue observation consumed the event loop. Cancellation/timebox behavior also degraded because the scheduler could not yield promptly while doing queue bookkeeping.
This appears related to full
FairTaskQueue.view()construction inside scheduler hot paths, including_record_observed_task_state()and admission-blocked diagnostics.Additional context
Suggested direction:
_record_observed_task_state()on every dispatch/completion.Local artifact paths and machine identifiers from the investigation were intentionally omitted from this issue.
Checklist