Skip to content

Epic: Stats and AggregateFns #7707

@gatesn

Description

@gatesn

Goal

Make Vortex statistics pluggable by modeling stats as aggregate-function partials and exposing them through expressions. The concrete success case is demonstrating a Bloom-filter zone-map stat for UTF-8 equality pruning added purely through plugins: a custom aggregate function, scalar function, and rewrite rule, without changing built-in pruning logic.

Direction

Add a nullable stat(expr, AggregateFnRef) expression. It returns the stat value for the current stats scope, or null when unavailable. Falsification should produce normal expressions containing stat(...); simplification/execution decides whether anything is proven.

Keep the first steps small: add the new expression and rewrite APIs beside the existing pruning path. Migrate file stats and zoned stats after the model is tested.

All new stats-facing APIs should live under vortex-array/src/stats/. The scalar function implementation may live with scalar functions, but should be re-exported through vortex_array::stats.

Phase 1: Stat Expressions

Phase 2: Rewrite Registry

Phase 3: Built-In Rewrite Rules

Phase 4: Zoned Layout Migration

Phase 5: Aggregate-Function Zoned Stats

WARNING: this is the phase that changes the ZonedLayout serialized form

  • Replace new zoned-layout stats configuration with aggregate-function descriptors. Use aggregate descriptors for zoned stats #7938
    • Configure stored zone stats with AggregateFnRef, not Stat enum values.
    • Use Display for AggregateFnRef as the descriptor string.
    • Use the descriptor string as the zone-map stats-table column name.
    • Keep Stat only as a compatibility bridge for existing array stats and legacy zoned metadata.
  • Compute per-zone aggregate partials at write time. Use aggregate descriptors for zoned stats #7938
    • Build the auxiliary stats table from each aggregate function's partial/state dtype.
    • Use a custom strategy/configuration hook for selecting aggregates before adding broader policy machinery.
  • Add a new zoned metadata format for aggregate-function stats. Use aggregate descriptors for zoned stats #7938
    • Current zoned metadata is raw bytes, not protobuf: zone_len followed by a legacy Stat bitset.
    • Add a version/magic marker so new metadata can be recognized as protobuf.
    • Store zone_len and present_aggregates: repeated string.
    • Preserve legacy metadata decoding by translating old Stat bitsets into built-in aggregate descriptor strings.
  • Lower stat(expr, aggregate_fn) at read time by matching aggregate-function descriptors in the zone stats table. Use aggregate descriptors for zoned stats #7938
    • Unavailable aggregate stats continue to lower to nullable null results.
  • Remove zoned-layout schema special cases that are only needed because stats are modeled as Stat enum values. Use aggregate descriptors for zoned stats #7938
    • Any auxiliary proof state should be represented by the aggregate partial itself or by a dedicated aggregate.

Phase 6: Plugin Bloom Proof

  • Add a plugin-provided Bloom aggregate for UTF-8 values.
  • Add a plugin-provided bloom_might_contain(filter, value) scalar function.
  • Register a plugin-provided rewrite for UTF-8 equality.
  • Store/load the Bloom stat through the aggregate-function zone-map path.
  • Demonstrate UTF-8 equality pruning without modifying built-in binary-expression pruning.

Phase 7: Satisfaction Follow-Up

  • Add satisfaction rewrite APIs and rules as new behavior.
    • Combine independent satisfiers with OR.
  • Teach filtering to use satisfied-zone masks to skip residual predicate evaluation for zones proven true.

Phase 8: Cleanup

  • Remove duplicated legacy stat propagation once the new rewrite path is complete.
  • Retire the old StatsCatalog pruning path.
  • Move broader generic stats storage work to a follow-up epic if still needed.

Status

In progress.

Metadata

Metadata

Assignees

No one assigned

    Labels

    epicPublic roadmap umbrella for a major initiative, with work tracked in sub-issues.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions