feat(query): Vector type support recluster using the K-Means algorithm#20004
feat(query): Vector type support recluster using the K-Means algorithm#20004b41sh wants to merge 4 commits into
K-Means algorithm#20004Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends Fuse table clustering/recluster support to handle Vector cluster keys by introducing vector block statistics (centroid + radius) and using a K-Means-based vector clustering step to produce a sortable _vector_cluster_id. It also updates pruning/statistics paths to ignore vector keys where only scalar min/max semantics apply, while enabling vector-stat lower-bound pruning.
Changes:
- Add
BlockMeta.vector_stats(centroid + radius) generation and expose it viafuse_block_statisticsasvector_statistics. - Support vector cluster keys in clustering/recluster pipelines via a new
TransformVectorClusterKmeansstage and updated cluster-stat generation logic. - Add vector-stat-based lower-bound pruning utilities and integrate them into vector index pruning, plus adjust page/range pruning to drop vector keys.
Reviewed changes
Copilot reviewed 41 out of 42 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/query/storages/fuse/src/table_functions/fuse_block_statistics.rs | Adds vector_statistics output column and serializes vector stats to Variant. |
| src/query/storages/fuse/src/table_functions/clustering_statistics.rs | Excludes vector cluster keys from scalar overlap depth computation. |
| src/query/storages/fuse/src/table_functions/clustering_information.rs | Excludes vector keys from scalar overlap depth computation and handles empty scalar keys. |
| src/query/storages/fuse/src/statistics/reducers.rs | Adjusts min/max reductions to clone results to satisfy ownership. |
| src/query/storages/fuse/src/statistics/mod.rs | Re-exports new vector clustering/stat helpers. |
| src/query/storages/fuse/src/statistics/cluster_statistics.rs | Adds vector cluster metadata/operator helpers; updates cluster stats generation with vector-aware logic. |
| src/query/storages/fuse/src/pruning/vector_index_pruner.rs | Adds vector-stat lower-bound candidate ordering and filter/topN pruning paths. |
| src/query/storages/fuse/src/operations/table_index.rs | Persists merged vector stats into new block metadata during vector index build. |
| src/query/storages/fuse/src/operations/mutation/mutator/recluster_mutator.rs | Adds vector-aware segment selection/refinement based on vector sphere overlap. |
| src/query/storages/fuse/src/operations/common/processors/transform_vector_cluster_kmeans.rs | New accumulating transform that assigns _vector_cluster_id via K-Means. |
| src/query/storages/fuse/src/operations/common/processors/transform_mutation_aggregator.rs | Filters vector keys out when filling missing segment cluster stats. |
| src/query/storages/fuse/src/operations/common/processors/mod.rs | Wires in the new K-Means transform module/exports. |
| src/query/storages/fuse/src/operations/append.rs | Integrates K-Means clustering into append clustering pipeline and refactors sort descriptor creation. |
| src/query/storages/fuse/src/io/write/vector_index_writer.rs | Adds vector statistics generation alongside vector index build; refactors finalize paths. |
| src/query/storages/fuse/src/io/write/stream/cluster_statistics.rs | Uses get_cluster_stats_gen and treats vector keys as scalar-empty for min/max. |
| src/query/storages/fuse/src/io/write/stream/block_builder.rs | Captures and writes vector stats into streamed BlockMeta. |
| src/query/storages/fuse/src/io/write/block_writer.rs | Captures and writes vector stats into non-streamed BlockMeta. |
| src/query/storages/fuse/Cargo.toml | Adds databend-common-vector dependency. |
| src/query/storages/common/table_meta/src/meta/v3/frozen/block_meta.rs | Initializes new vector_stats field for v3 frozen meta conversion. |
| src/query/storages/common/table_meta/src/meta/v2/statistics.rs | Adds VectorDistanceType and VectorColumnStatistics to v2 meta. |
| src/query/storages/common/table_meta/src/meta/v2/segment.rs | Adds vector_stats field to v2 BlockMeta and initializes it. |
| src/query/storages/common/table_meta/src/meta/v2/mod.rs | Re-exports vector statistics types in v2 module. |
| src/query/storages/common/table_meta/src/meta/statistics.rs | Defines StatisticsOfVectorColumns type alias in current meta. |
| src/query/storages/common/table_meta/src/meta/current/mod.rs | Re-exports vector statistics types in current meta. |
| src/query/storages/common/pruner/src/page_pruner.rs | Drops vector cluster keys to keep scalar page-index alignment. |
| src/query/storages/common/index/src/vector.rs | New shared vector distance/stat helpers (bounds, overlap, normalization). |
| src/query/storages/common/index/src/range_index.rs | Treats DataType::Vector domains as full/unprunable for range stats. |
| src/query/storages/common/index/src/page_index.rs | Handles empty min/max page stats safely. |
| src/query/storages/common/index/src/lib.rs | Exposes new vector helper APIs from the common index crate. |
| src/query/storages/common/index/Cargo.toml | Adds databend-common-vector dependency. |
| src/query/storages/common/cache/src/manager.rs | Updates test block meta construction to include vector_stats. |
| src/query/sql/src/planner/expression/expression_parser.rs | Allows vector cluster keys and enforces at most one vector key. |
| src/query/sql/src/planner/binder/ddl/table.rs | Allows vector cluster keys (non-hilbert), enforces at most one vector key, updates signature. |
| src/query/service/tests/it/storages/fuse/statistics.rs | Updates cluster stats generator construction in tests for new signature. |
| src/query/service/tests/it/storages/fuse/operations/mutation/recluster_mutator.rs | Adds vector recluster selection tests and updates constructors. |
| src/query/service/tests/it/storages/fuse/bloom_index_meta_size.rs | Updates test block meta construction to include vector_stats. |
| src/query/service/src/physical_plans/physical_replace_into.rs | Updates get_cluster_stats_gen call signature. |
| src/query/service/src/physical_plans/physical_recluster.rs | Adds K-Means clustering stage for vector keys and updates sort descriptor usage. |
| src/query/service/src/physical_plans/physical_mutation.rs | Updates get_cluster_stats_gen call signature. |
| src/query/service/src/physical_plans/physical_multi_table_insert.rs | Updates get_cluster_stats_gen call signature. |
| src/query/service/src/physical_plans/physical_column_mutation.rs | Updates get_cluster_stats_gen call signature. |
| Cargo.lock | Records new workspace dependency additions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 41 out of 42 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
src/query/storages/fuse/src/statistics/cluster_statistics.rs:279
ClusterStatistics.pagesare built from all scalar keys excluding_vector_cluster_id, but when a vector key exists the actual sort order becomes(scalar_prefix, _vector_cluster_id, scalar_suffix). Scalar suffix keys are no longer monotonic across pages, so including them inpagescan makePageIndexpruning incorrect. Consider only emitting page tuples for scalar keys that appear before the vector key position (or disablepageswhen vector clustering is enabled).
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/
Summary
This PR stores per-block vector statistics for indexed vector columns in
BlockMeta, uses them to support vector-aware recluster selection and pruning.What Changed
Vector block statistics
Vector statistics stores a lightweight sphere summary for each vector column in
BlockMeta:Together they describe a bounding sphere for the block’s vector data. During pruning or recluster selection, Databend can compare two spheres, or compare a query vector with a block sphere, before reading or scoring every vector.
Vector cluster/recluster support
CLUSTER BY key, if a key is identified as a vector column, the system records information about that vector column, including the column offset, column ID, and distance type.recluster/appendpipeline, addTransformVectorClusterKmeanstransformer.TransformVectorClusterKmeansreads the vector columns from the input block and aggregates multiple rows of vector data together fork-means clustering. After clustering is complete, each row of vectors is assigned to a cluster._vector_cluster_idto the end of the block; this column stores theK-means cluster IDcorresponding to each row of vector data.sort_desc, do not sort directly on the original vector column; instead, replace the position of the vector key with the offset of_vector_cluster_id.For example:
CLUSTER BY (tenant, embedding, ts)The actual sort order becomes:
tenant, _vector_cluster_id, tsTransformSortPartialsort pipeline to sort according tosort_desc.This way, vector recluster does not need to implement standard scalar comparison logic for the vector type, nor does it need to introduce a new
ClusterType. The vector is only responsible for generating a discrete cluster ID viak-means, the actual data reordering still reuses existing cluster-by sorting and block split/write processes.Tests
Type of change
This change is