Skip to content

Huge async jobs preallocate row-group metadata for the full logical dataset #726

@eric-tramel

Description

@eric-tramel

Priority Level

High (Major functionality broken)

Describe the bug

Huge async jobs preallocate row-group metadata for the entire logical dataset before useful work starts. This creates memory overhead proportional to total row-group count even when the scheduler only admits a small active row-group horizon.

For fire-and-forget large jobs, a user can configure millions of records and small row groups, but memory is consumed up front for row groups that are not yet active.

Steps/Code to reproduce bug

Use a setup path that prepares an async run without needing to complete the full job. Configure:

  • hundreds of thousands to millions of row groups
  • small row-group sizes, for example 2 rows per group
  • around 16 columns
  • default/public-style active row-group horizon or a small internal horizon
  • no tracing/events required

Representative preparation-only measurements:

Row groups RSS delta
10,000 ~3.6 MiB
50,000 ~20 MiB
100,000 ~43 MiB
250,000 ~114 MiB
500,000 ~202 MiB
1,000,000 ~420 MiB

The observed slope was roughly hundreds of RSS bytes per row group before active endpoint work began.

Expected behavior

Memory should scale primarily with active admitted row groups, in-flight tasks, retained completed state, and output buffers. It should not scale linearly with all future row groups that have not yet been admitted.

Agent Diagnostic / Prior Investigation

The investigation traced the preallocation to row-group metadata construction and copying across async builder/scheduler setup:

  • the dataset builder constructs a full row_groups list for the logical dataset,
  • CompletionTracker.with_graph(...) copies that list into a row-group-size dictionary,
  • the scheduler stores the full row-group list and derives additional maps from it.

This happens before scheduler execution and before most row groups can become active.

Additional context

Suggested direction:

  • Represent row groups lazily or compactly instead of materializing every (row_group_id, size) pair.
  • Let CompletionTracker validate/derive row-group sizes from a compact range representation.
  • Keep per-row-group mutable state only for admitted or recently finalized row groups.
  • Add a startup memory regression test for million-row-group configurations.

Local artifact paths and machine identifiers from the investigation were intentionally omitted from this issue.

Checklist

  • I reproduced this issue or provided a minimal example
  • I searched the docs/issues myself, or had my agent do so
  • If I used an agent, I included its diagnostics above

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions