Priority Level
High (Major functionality broken)
Describe the bug
Huge async jobs preallocate row-group metadata for the entire logical dataset before useful work starts. This creates memory overhead proportional to total row-group count even when the scheduler only admits a small active row-group horizon.
For fire-and-forget large jobs, a user can configure millions of records and small row groups, but memory is consumed up front for row groups that are not yet active.
Steps/Code to reproduce bug
Use a setup path that prepares an async run without needing to complete the full job. Configure:
- hundreds of thousands to millions of row groups
- small row-group sizes, for example 2 rows per group
- around 16 columns
- default/public-style active row-group horizon or a small internal horizon
- no tracing/events required
Representative preparation-only measurements:
| Row groups |
RSS delta |
| 10,000 |
~3.6 MiB |
| 50,000 |
~20 MiB |
| 100,000 |
~43 MiB |
| 250,000 |
~114 MiB |
| 500,000 |
~202 MiB |
| 1,000,000 |
~420 MiB |
The observed slope was roughly hundreds of RSS bytes per row group before active endpoint work began.
Expected behavior
Memory should scale primarily with active admitted row groups, in-flight tasks, retained completed state, and output buffers. It should not scale linearly with all future row groups that have not yet been admitted.
Agent Diagnostic / Prior Investigation
The investigation traced the preallocation to row-group metadata construction and copying across async builder/scheduler setup:
- the dataset builder constructs a full
row_groups list for the logical dataset,
CompletionTracker.with_graph(...) copies that list into a row-group-size dictionary,
- the scheduler stores the full row-group list and derives additional maps from it.
This happens before scheduler execution and before most row groups can become active.
Additional context
Suggested direction:
- Represent row groups lazily or compactly instead of materializing every
(row_group_id, size) pair.
- Let
CompletionTracker validate/derive row-group sizes from a compact range representation.
- Keep per-row-group mutable state only for admitted or recently finalized row groups.
- Add a startup memory regression test for million-row-group configurations.
Local artifact paths and machine identifiers from the investigation were intentionally omitted from this issue.
Checklist
Priority Level
High (Major functionality broken)
Describe the bug
Huge async jobs preallocate row-group metadata for the entire logical dataset before useful work starts. This creates memory overhead proportional to total row-group count even when the scheduler only admits a small active row-group horizon.
For fire-and-forget large jobs, a user can configure millions of records and small row groups, but memory is consumed up front for row groups that are not yet active.
Steps/Code to reproduce bug
Use a setup path that prepares an async run without needing to complete the full job. Configure:
Representative preparation-only measurements:
The observed slope was roughly hundreds of RSS bytes per row group before active endpoint work began.
Expected behavior
Memory should scale primarily with active admitted row groups, in-flight tasks, retained completed state, and output buffers. It should not scale linearly with all future row groups that have not yet been admitted.
Agent Diagnostic / Prior Investigation
The investigation traced the preallocation to row-group metadata construction and copying across async builder/scheduler setup:
row_groupslist for the logical dataset,CompletionTracker.with_graph(...)copies that list into a row-group-size dictionary,This happens before scheduler execution and before most row groups can become active.
Additional context
Suggested direction:
(row_group_id, size)pair.CompletionTrackervalidate/derive row-group sizes from a compact range representation.Local artifact paths and machine identifiers from the investigation were intentionally omitted from this issue.
Checklist