Skip to content

Add UDF's for sketch support#6349

Merged
alexanderbianchi merged 9 commits intoquickwit-oss:mainfrom
alexanderbianchi:bianchi/sketch-support
Apr 29, 2026
Merged

Add UDF's for sketch support#6349
alexanderbianchi merged 9 commits intoquickwit-oss:mainfrom
alexanderbianchi:bianchi/sketch-support

Conversation

@alexanderbianchi
Copy link
Copy Markdown
Collaborator

@alexanderbianchi alexanderbianchi commented Apr 27, 2026

Description

Adds DataFusion query support for DDSketch-backed parquet indexes.

This PR adds:

  • dd_sketch(keys, counts, count, min, max, flags), a decomposable aggregate UDF that merges sparse DDSketch bucket arrays and scalar bounds into a merged sketch state.
  • dd_quantile(sketch, q), a scalar UDF that computes a percentile from the merged sketch by rank-scanning merged buckets and clamping the mapped value to the sketch min/max.
  • aggregate UDF registration support in quickwit-df-core, so runtime plugins can register both scalar UDFs and UDAFs.
  • STORED AS sketches and automatic sketches-* index resolution in quickwit-datafusion.
  • sketch split routing through list_sketch_splits, while keeping existing metrics split routing unchanged.
  • SQL and Substrait integration coverage for sketch quantile queries.

The intended query shape is:

SELECT
    dd_quantile(dd_sketch(keys, counts, "count", "min", "max", flags), 0.99) AS p99
FROM "sketches-latency"
WHERE metric_name = 'req.latency'
GROUP BY service, time_bin

dd_sketch is the merge step. dd_quantile is only the final projection over the merged sketch. For now, only flags = 0 is accepted; non-zero flags are rejected rather than silently decoded with the wrong layout.

How was this PR tested?

cargo test -p quickwit-datafusion sketch_udf --lib
cargo test -p quickwit-datafusion --test sketches -- --nocapture
cargo test -p quickwit-datafusion --test metrics -- --nocapture
cargo test -p quickwit-df-core --lib

Follow-up work intentionally not in this PR

This PR is scoped to making the sketch query path available and validating the UDF/UDAF contract. The items below are valuable, but they are larger planning/runtime/catalog changes and should be handled as follow-up PRs.

  • Implement a grouped accumulator for dd_sketch.

    • Not included because it requires a separate group-indexed state layout and careful parity with state()/merge_batch() behavior.
    • Appreciated because it should reduce per-group boxed accumulator overhead for high-cardinality grouped sketch queries.
  • Support non-zero sketch flags and reference DDSketch layout decoding.

    • Not included because this PR only has one supported sketch layout; non-zero flags are rejected to avoid silent corruption.
    • Appreciated because it would allow multiple sketch schema/layout versions to coexist safely.
  • Add production DataFusion memory policy and query-level memory tracking.

    • Not included because it cuts across endpoint setup, node config, runtime pools, and observability.
    • Appreciated because sketch accumulators can grow with group cardinality and should be bounded in production.
  • Unify predicate lowering for split pruning and parquet pruning.

    • Not included because it requires changes to filter analysis, metastore split queries, and tag predicate propagation.
    • Appreciated because it would make metric/tag/time filters prune splits while preserving row-level correctness filters for DataFusion/parquet.
  • Add a runtime-scoped parquet index catalog with split planning bounds and parquet metadata caching.

    • Not included because it is a broader planning/runtime architecture change.
    • Appreciated because it would avoid repeated metastore work, cap broad scans, and improve footer/page-index reuse.
  • Make DDL aliases and Substrait routing preserve authoritative table metadata.

    • Not included because the current PR keeps the existing name/location inference model.
    • Appreciated because it would prevent alias collisions and support Substrait plans whose alias does not encode the index family.
  • Advertise scan ordering only when selected split metadata proves it.

    • Not included because it belongs to the table-provider planning path, not the sketch UDF itself.
    • Appreciated because it prevents optimizers from relying on ordering that selected splits may not actually satisfy.

@alexanderbianchi alexanderbianchi marked this pull request as ready for review April 28, 2026 00:02
@alexanderbianchi alexanderbianchi merged commit 32fbda2 into quickwit-oss:main Apr 29, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants