Skip to content

feat(query): Vector type support recluster using the K-Means algorithm#20004

Draft
b41sh wants to merge 4 commits into
databendlabs:mainfrom
b41sh:feat-kmeans
Draft

feat(query): Vector type support recluster using the K-Means algorithm#20004
b41sh wants to merge 4 commits into
databendlabs:mainfrom
b41sh:feat-kmeans

Conversation

@b41sh

@b41sh b41sh commented Jun 12, 2026

Copy link
Copy Markdown
Member

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This PR stores per-block vector statistics for indexed vector columns in BlockMeta, uses them to support vector-aware recluster selection and pruning.

What Changed

Vector block statistics

Vector statistics stores a lightweight sphere summary for each vector column in BlockMeta:

  • centroid: the center point of all vectors in the block.
  • radius: the maximum distance from the centroid to any vector in the block.

Together they describe a bounding sphere for the block’s vector data. During pruning or recluster selection, Databend can compare two spheres, or compare a query vector with a block sphere, before reading or scoring every vector.

Vector cluster/recluster support

  1. When parsing CLUSTER BY key, if a key is identified as a vector column, the system records information about that vector column, including the column offset, column ID, and distance type.
  2. In the recluster/append pipeline, add TransformVectorClusterKmeans transformer.
  3. TransformVectorClusterKmeans reads the vector columns from the input block and aggregates multiple rows of vector data together for k-means clustering. After clustering is complete, each row of vectors is assigned to a cluster.
  4. The Transform append an internal column named _vector_cluster_id to the end of the block; this column stores the K-means cluster ID corresponding to each row of vector data.
  5. When constructing the sort description sort_desc, do not sort directly on the original vector column; instead, replace the position of the vector key with the offset of _vector_cluster_id.

For example:

CLUSTER BY (tenant, embedding, ts)

The actual sort order becomes:

tenant, _vector_cluster_id, ts

  1. Subsequent operations continue to use the existing TransformSortPartial sort pipeline to sort according to sort_desc.

This way, vector recluster does not need to implement standard scalar comparison logic for the vector type, nor does it need to introduce a new ClusterType. The vector is only responsible for generating a discrete cluster ID via k-means, the actual data reordering still reuses existing cluster-by sorting and block split/write processes.

  • fixes: #[Link the issue here]

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions Bot added the pr-feature this PR introduces a new feature to the codebase label Jun 12, 2026
@b41sh b41sh requested a review from Copilot June 12, 2026 12:26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends Fuse table clustering/recluster support to handle Vector cluster keys by introducing vector block statistics (centroid + radius) and using a K-Means-based vector clustering step to produce a sortable _vector_cluster_id. It also updates pruning/statistics paths to ignore vector keys where only scalar min/max semantics apply, while enabling vector-stat lower-bound pruning.

Changes:

  • Add BlockMeta.vector_stats (centroid + radius) generation and expose it via fuse_block_statistics as vector_statistics.
  • Support vector cluster keys in clustering/recluster pipelines via a new TransformVectorClusterKmeans stage and updated cluster-stat generation logic.
  • Add vector-stat-based lower-bound pruning utilities and integrate them into vector index pruning, plus adjust page/range pruning to drop vector keys.

Reviewed changes

Copilot reviewed 41 out of 42 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/query/storages/fuse/src/table_functions/fuse_block_statistics.rs Adds vector_statistics output column and serializes vector stats to Variant.
src/query/storages/fuse/src/table_functions/clustering_statistics.rs Excludes vector cluster keys from scalar overlap depth computation.
src/query/storages/fuse/src/table_functions/clustering_information.rs Excludes vector keys from scalar overlap depth computation and handles empty scalar keys.
src/query/storages/fuse/src/statistics/reducers.rs Adjusts min/max reductions to clone results to satisfy ownership.
src/query/storages/fuse/src/statistics/mod.rs Re-exports new vector clustering/stat helpers.
src/query/storages/fuse/src/statistics/cluster_statistics.rs Adds vector cluster metadata/operator helpers; updates cluster stats generation with vector-aware logic.
src/query/storages/fuse/src/pruning/vector_index_pruner.rs Adds vector-stat lower-bound candidate ordering and filter/topN pruning paths.
src/query/storages/fuse/src/operations/table_index.rs Persists merged vector stats into new block metadata during vector index build.
src/query/storages/fuse/src/operations/mutation/mutator/recluster_mutator.rs Adds vector-aware segment selection/refinement based on vector sphere overlap.
src/query/storages/fuse/src/operations/common/processors/transform_vector_cluster_kmeans.rs New accumulating transform that assigns _vector_cluster_id via K-Means.
src/query/storages/fuse/src/operations/common/processors/transform_mutation_aggregator.rs Filters vector keys out when filling missing segment cluster stats.
src/query/storages/fuse/src/operations/common/processors/mod.rs Wires in the new K-Means transform module/exports.
src/query/storages/fuse/src/operations/append.rs Integrates K-Means clustering into append clustering pipeline and refactors sort descriptor creation.
src/query/storages/fuse/src/io/write/vector_index_writer.rs Adds vector statistics generation alongside vector index build; refactors finalize paths.
src/query/storages/fuse/src/io/write/stream/cluster_statistics.rs Uses get_cluster_stats_gen and treats vector keys as scalar-empty for min/max.
src/query/storages/fuse/src/io/write/stream/block_builder.rs Captures and writes vector stats into streamed BlockMeta.
src/query/storages/fuse/src/io/write/block_writer.rs Captures and writes vector stats into non-streamed BlockMeta.
src/query/storages/fuse/Cargo.toml Adds databend-common-vector dependency.
src/query/storages/common/table_meta/src/meta/v3/frozen/block_meta.rs Initializes new vector_stats field for v3 frozen meta conversion.
src/query/storages/common/table_meta/src/meta/v2/statistics.rs Adds VectorDistanceType and VectorColumnStatistics to v2 meta.
src/query/storages/common/table_meta/src/meta/v2/segment.rs Adds vector_stats field to v2 BlockMeta and initializes it.
src/query/storages/common/table_meta/src/meta/v2/mod.rs Re-exports vector statistics types in v2 module.
src/query/storages/common/table_meta/src/meta/statistics.rs Defines StatisticsOfVectorColumns type alias in current meta.
src/query/storages/common/table_meta/src/meta/current/mod.rs Re-exports vector statistics types in current meta.
src/query/storages/common/pruner/src/page_pruner.rs Drops vector cluster keys to keep scalar page-index alignment.
src/query/storages/common/index/src/vector.rs New shared vector distance/stat helpers (bounds, overlap, normalization).
src/query/storages/common/index/src/range_index.rs Treats DataType::Vector domains as full/unprunable for range stats.
src/query/storages/common/index/src/page_index.rs Handles empty min/max page stats safely.
src/query/storages/common/index/src/lib.rs Exposes new vector helper APIs from the common index crate.
src/query/storages/common/index/Cargo.toml Adds databend-common-vector dependency.
src/query/storages/common/cache/src/manager.rs Updates test block meta construction to include vector_stats.
src/query/sql/src/planner/expression/expression_parser.rs Allows vector cluster keys and enforces at most one vector key.
src/query/sql/src/planner/binder/ddl/table.rs Allows vector cluster keys (non-hilbert), enforces at most one vector key, updates signature.
src/query/service/tests/it/storages/fuse/statistics.rs Updates cluster stats generator construction in tests for new signature.
src/query/service/tests/it/storages/fuse/operations/mutation/recluster_mutator.rs Adds vector recluster selection tests and updates constructors.
src/query/service/tests/it/storages/fuse/bloom_index_meta_size.rs Updates test block meta construction to include vector_stats.
src/query/service/src/physical_plans/physical_replace_into.rs Updates get_cluster_stats_gen call signature.
src/query/service/src/physical_plans/physical_recluster.rs Adds K-Means clustering stage for vector keys and updates sort descriptor usage.
src/query/service/src/physical_plans/physical_mutation.rs Updates get_cluster_stats_gen call signature.
src/query/service/src/physical_plans/physical_multi_table_insert.rs Updates get_cluster_stats_gen call signature.
src/query/service/src/physical_plans/physical_column_mutation.rs Updates get_cluster_stats_gen call signature.
Cargo.lock Records new workspace dependency additions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/query/storages/fuse/src/operations/append.rs Outdated
Comment thread src/query/storages/fuse/src/operations/append.rs
Comment thread src/query/storages/fuse/src/io/write/vector_index_writer.rs
Comment thread src/query/storages/fuse/src/statistics/cluster_statistics.rs

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 41 out of 42 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/query/storages/fuse/src/statistics/cluster_statistics.rs:279

  • ClusterStatistics.pages are built from all scalar keys excluding _vector_cluster_id, but when a vector key exists the actual sort order becomes (scalar_prefix, _vector_cluster_id, scalar_suffix). Scalar suffix keys are no longer monotonic across pages, so including them in pages can make PageIndex pruning incorrect. Consider only emitting page tuples for scalar keys that appear before the vector key position (or disable pages when vector clustering is enabled).

Comment thread src/query/storages/fuse/src/pruning/vector_index_pruner.rs
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@b41sh b41sh requested a review from zhyass June 15, 2026 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature this PR introduces a new feature to the codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants