Async scheduler large ready queues burn CPU in queue observation

### Priority Level

High (Major functionality broken)

### Describe the bug

Very large async DataDesigner jobs can become CPU-bound in scheduler queue observation/selection before LLM endpoint capacity is saturated. In repeated large simulated runs, the scheduler spent most of wall time rebuilding queue views and selecting tasks, while healthy mock endpoints had available capacity and record completion throughput collapsed.

This directly affects the async scheduler goals of maximizing throughput, minimizing endpoint idle time, and supporting fire-and-forget large jobs.

### Steps/Code to reproduce bug

Use a throwaway internal scheduler harness with mock `ColumnGenerator` implementations and no product-code changes. Configure:

- 50k to 2M logical records
- 8 to 64 independent LLM-like columns
- active frontiers from 512 to 16,384 tasks
- endpoint/request capacity from 64 to 1,024 slots
- realistic compressed latency with jitter and occasional tails
- `AsyncTaskScheduler` with tracing disabled unless explicitly testing trace overhead

Representative scenarios:

```text
Scenario A: 1,000,000 records x 64 columns, active frontier 4,096, endpoint cap 512
Scenario B: 2,000,000 records x 64 columns, active frontier 2,048, endpoint cap 1,024, low endpoint latency
Scenario C: 250,000 records x 16 columns, active frontier 4,096, 2-12 KB response payloads
```

Instrument `FairTaskQueue.view()`, `FairTaskQueue.select_next()`, completed model calls, endpoint occupancy, event-loop/cancellation timing, CPU time, and RSS.

Observed examples from the investigation:

| Scenario | Logical calls | Frontier | Completed calls | Wall time | Endpoint util | Queue bookkeeping | CPU |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| 1M x 64, cap 512 | 64,000,000 | 4,096 | 604 | 32.5s | 21.7% | 84.9% wall | high |
| 2M x 64, cap 1,024 | 128,000,000 | 2,048 | 4,096 | 26.0s | 0.01% | 113.4% wall | 98.9% |
| 250k x 16, payloads | 4,000,000 | 4,096 | 60 | 81.0s | 2.7% | 94.7% wall | 98.8% |

### Expected behavior

Large async jobs should be primarily constrained by configured endpoint/request capacity and mock endpoint latency. Queue diagnostics and scheduler bookkeeping should not require full ready-queue scans on hot paths often enough to dominate runtime.

### Agent Diagnostic / Prior Investigation

The investigation repeatedly sampled hot paths in queue observation and selection:

```text
FairTaskQueue.view()
_record_observed_task_state()
_dispatch_queued_tasks()
_main_dispatch_loop()
```

The strongest signal was that increasing endpoint caps/frontier did not improve throughput proportionally. In several runs, larger frontiers made throughput worse because queue observation consumed the event loop. Cancellation/timebox behavior also degraded because the scheduler could not yield promptly while doing queue bookkeeping.

This appears related to full `FairTaskQueue.view()` construction inside scheduler hot paths, including `_record_observed_task_state()` and admission-blocked diagnostics.

### Additional context

Suggested direction:

- Maintain queue counts and resource demand incrementally as queue state mutates.
- Avoid full queue-view scans in `_record_observed_task_state()` on every dispatch/completion.
- Separate expensive diagnostics from hot-path capacity accounting.
- Gate detailed queue diagnostics behind tracing or periodic health snapshots.

Local artifact paths and machine identifiers from the investigation were intentionally omitted from this issue.

### Checklist

- [x] I reproduced this issue or provided a minimal example
- [x] I searched the docs/issues myself, or had my agent do so
- [x] If I used an agent, I included its diagnostics above


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async scheduler large ready queues burn CPU in queue observation #724

Priority Level

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Agent Diagnostic / Prior Investigation

Additional context

Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scenario	Logical calls	Frontier	Completed calls	Wall time	Endpoint util	Queue bookkeeping	CPU
1M x 64, cap 512	64,000,000	4,096	604	32.5s	21.7%	84.9% wall	high
2M x 64, cap 1,024	128,000,000	2,048	4,096	26.0s	0.01%	113.4% wall	98.9%
250k x 16, payloads	4,000,000	4,096	60	81.0s	2.7%	94.7% wall	98.8%

Async scheduler large ready queues burn CPU in queue observation #724

Description

Priority Level

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Agent Diagnostic / Prior Investigation

Additional context

Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions