Skip to content

Async scheduler large ready queues burn CPU in queue observation #724

@eric-tramel

Description

@eric-tramel

Priority Level

High (Major functionality broken)

Describe the bug

Very large async DataDesigner jobs can become CPU-bound in scheduler queue observation/selection before LLM endpoint capacity is saturated. In repeated large simulated runs, the scheduler spent most of wall time rebuilding queue views and selecting tasks, while healthy mock endpoints had available capacity and record completion throughput collapsed.

This directly affects the async scheduler goals of maximizing throughput, minimizing endpoint idle time, and supporting fire-and-forget large jobs.

Steps/Code to reproduce bug

Use a throwaway internal scheduler harness with mock ColumnGenerator implementations and no product-code changes. Configure:

  • 50k to 2M logical records
  • 8 to 64 independent LLM-like columns
  • active frontiers from 512 to 16,384 tasks
  • endpoint/request capacity from 64 to 1,024 slots
  • realistic compressed latency with jitter and occasional tails
  • AsyncTaskScheduler with tracing disabled unless explicitly testing trace overhead

Representative scenarios:

Scenario A: 1,000,000 records x 64 columns, active frontier 4,096, endpoint cap 512
Scenario B: 2,000,000 records x 64 columns, active frontier 2,048, endpoint cap 1,024, low endpoint latency
Scenario C: 250,000 records x 16 columns, active frontier 4,096, 2-12 KB response payloads

Instrument FairTaskQueue.view(), FairTaskQueue.select_next(), completed model calls, endpoint occupancy, event-loop/cancellation timing, CPU time, and RSS.

Observed examples from the investigation:

Scenario Logical calls Frontier Completed calls Wall time Endpoint util Queue bookkeeping CPU
1M x 64, cap 512 64,000,000 4,096 604 32.5s 21.7% 84.9% wall high
2M x 64, cap 1,024 128,000,000 2,048 4,096 26.0s 0.01% 113.4% wall 98.9%
250k x 16, payloads 4,000,000 4,096 60 81.0s 2.7% 94.7% wall 98.8%

Expected behavior

Large async jobs should be primarily constrained by configured endpoint/request capacity and mock endpoint latency. Queue diagnostics and scheduler bookkeeping should not require full ready-queue scans on hot paths often enough to dominate runtime.

Agent Diagnostic / Prior Investigation

The investigation repeatedly sampled hot paths in queue observation and selection:

FairTaskQueue.view()
_record_observed_task_state()
_dispatch_queued_tasks()
_main_dispatch_loop()

The strongest signal was that increasing endpoint caps/frontier did not improve throughput proportionally. In several runs, larger frontiers made throughput worse because queue observation consumed the event loop. Cancellation/timebox behavior also degraded because the scheduler could not yield promptly while doing queue bookkeeping.

This appears related to full FairTaskQueue.view() construction inside scheduler hot paths, including _record_observed_task_state() and admission-blocked diagnostics.

Additional context

Suggested direction:

  • Maintain queue counts and resource demand incrementally as queue state mutates.
  • Avoid full queue-view scans in _record_observed_task_state() on every dispatch/completion.
  • Separate expensive diagnostics from hot-path capacity accounting.
  • Gate detailed queue diagnostics behind tracing or periodic health snapshots.

Local artifact paths and machine identifiers from the investigation were intentionally omitted from this issue.

Checklist

  • I reproduced this issue or provided a minimal example
  • I searched the docs/issues myself, or had my agent do so
  • If I used an agent, I included its diagnostics above

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions