[fix](ann-index) Fix ANN IVF/PQ recall, avoid init-time large ANN build-buffer reservation, and skip ANN index build for segments with insufficient rows.#64216
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
There was a problem hiding this comment.
Pull request overview
This PR updates Doris’s BE ANN index build path to improve IVF/PQ recall correctness and reduce unnecessary memory reservation during initialization, while adding safeguards to skip persisting ANN indexes for segments that are too small to train/build reliably.
Changes:
- Switch ANN index building to buffer full-segment vectors and (for train-required indexes) train once using the complete segment data before adding/saving.
- Skip ANN index persistence for empty segments and segments below an effective minimum row threshold (
max(min_train_rows, ann_index_build_min_segment_rows)). - Add/adjust regression + unit tests covering IVF/PQ recall, IVF_ON_DISK minimum training rows, and min-segment-row skip behavior.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/ann_index_p0/ivf_pq_recall.groovy | Adds a regression test asserting IVF+PQ recall on two well-separated clusters. |
| regression-test/suites/ann_index_p0/ivf_pq_full_buffer_train_recall.groovy | Adds a regression test ensuring train-required IVF+PQ behavior trains effectively (target appears in top-K). |
| regression-test/suites/ann_index_p0/ivf_on_disk_index_test.groovy | Updates insufficient-training-rows behavior to “skip index build” instead of throwing. |
| regression-test/suites/ann_index_p0/ann_index_build_min_segment_rows.groovy | Adds regression coverage for skipping ANN persistence based on ann_index_build_min_segment_rows. |
| regression-test/data/ann_index_p0/ivf_pq_recall.out | Golden outputs for the new recall test. |
| regression-test/data/ann_index_p0/ivf_pq_full_buffer_train_recall.out | Golden outputs for the new full-buffer training recall test. |
| regression-test/data/ann_index_p0/ivf_on_disk_index_test.out | Updates golden output to include the new insufficient-train-rows query section. |
| be/test/storage/index/ann/ann_index_writer_test.cpp | Expands unit tests to validate no init-time preallocation, full-buffer train/add behavior, min-segment-rows skipping, and IVF_ON_DISK min-train-rows. |
| be/src/storage/index/ann/faiss_ann_index.cpp | Makes IVF_ON_DISK min training rows consistent with IVF by using nlist. |
| be/src/storage/index/ann/ann_index_writer.h | Refactors writer state to buffered vectors + total row tracking; adds helper API for tests. |
| be/src/storage/index/ann/ann_index_writer.cpp | Implements full-segment buffering, effective min-row checks, and build/save logic; removes init-time buffer reservation. |
| be/src/common/config.h | Replaces chunk-size config with ann_index_build_min_segment_rows. |
| be/src/common/config.cpp | Defines/validates ann_index_build_min_segment_rows. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // The offsets check above guarantees every array row matches the ANN index dimension. | ||
| DCHECK(p != nullptr); | ||
| _buffered_vectors.insert(_buffered_vectors.end(), p, p + num_rows * dim); | ||
| _total_rows += cast_set<int64_t>(num_rows); |
| friend class TestAnnIndexColumnWriter; | ||
| #endif | ||
|
|
||
| // VectorIndex shoule be managed by some cache. |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
…ld-buffer reservation, and skip ANN index build for segments with insufficient rows. (apache#64082)
af2b9e1 to
66aa78a
Compare
|
run buildall |
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
Cherry-pick #64082
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)