Skip to content

[fix](ann-index) Fix ANN IVF/PQ recall, avoid init-time large ANN build-buffer reservation, and skip ANN index build for segments with insufficient rows.#64216

Open
kaka11chen wants to merge 1 commit into
apache:branch-4.1from
kaka11chen:cherry-pick-64082_4.1

Conversation

@kaka11chen
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

Cherry-pick #64082

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@kaka11chen kaka11chen requested a review from yiguolei as a code owner June 8, 2026 08:26
Copilot AI review requested due to automatic review settings June 8, 2026 08:26
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@kaka11chen
Copy link
Copy Markdown
Contributor Author

run buildall

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Doris’s BE ANN index build path to improve IVF/PQ recall correctness and reduce unnecessary memory reservation during initialization, while adding safeguards to skip persisting ANN indexes for segments that are too small to train/build reliably.

Changes:

  • Switch ANN index building to buffer full-segment vectors and (for train-required indexes) train once using the complete segment data before adding/saving.
  • Skip ANN index persistence for empty segments and segments below an effective minimum row threshold (max(min_train_rows, ann_index_build_min_segment_rows)).
  • Add/adjust regression + unit tests covering IVF/PQ recall, IVF_ON_DISK minimum training rows, and min-segment-row skip behavior.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
regression-test/suites/ann_index_p0/ivf_pq_recall.groovy Adds a regression test asserting IVF+PQ recall on two well-separated clusters.
regression-test/suites/ann_index_p0/ivf_pq_full_buffer_train_recall.groovy Adds a regression test ensuring train-required IVF+PQ behavior trains effectively (target appears in top-K).
regression-test/suites/ann_index_p0/ivf_on_disk_index_test.groovy Updates insufficient-training-rows behavior to “skip index build” instead of throwing.
regression-test/suites/ann_index_p0/ann_index_build_min_segment_rows.groovy Adds regression coverage for skipping ANN persistence based on ann_index_build_min_segment_rows.
regression-test/data/ann_index_p0/ivf_pq_recall.out Golden outputs for the new recall test.
regression-test/data/ann_index_p0/ivf_pq_full_buffer_train_recall.out Golden outputs for the new full-buffer training recall test.
regression-test/data/ann_index_p0/ivf_on_disk_index_test.out Updates golden output to include the new insufficient-train-rows query section.
be/test/storage/index/ann/ann_index_writer_test.cpp Expands unit tests to validate no init-time preallocation, full-buffer train/add behavior, min-segment-rows skipping, and IVF_ON_DISK min-train-rows.
be/src/storage/index/ann/faiss_ann_index.cpp Makes IVF_ON_DISK min training rows consistent with IVF by using nlist.
be/src/storage/index/ann/ann_index_writer.h Refactors writer state to buffered vectors + total row tracking; adds helper API for tests.
be/src/storage/index/ann/ann_index_writer.cpp Implements full-segment buffering, effective min-row checks, and build/save logic; removes init-time buffer reservation.
be/src/common/config.h Replaces chunk-size config with ann_index_build_min_segment_rows.
be/src/common/config.cpp Defines/validates ann_index_build_min_segment_rows.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +115 to +118
// The offsets check above guarantees every array row matches the ANN index dimension.
DCHECK(p != nullptr);
_buffered_vectors.insert(_buffered_vectors.end(), p, p + num_rows * dim);
_total_rows += cast_set<int64_t>(num_rows);
friend class TestAnnIndexColumnWriter;
#endif

// VectorIndex shoule be managed by some cache.
yiguolei
yiguolei previously approved these changes Jun 8, 2026
@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 8, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 8, 2026

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 8, 2026

PR approved by anyone and no changes requested.

…ld-buffer reservation, and skip ANN index build for segments with insufficient rows. (apache#64082)
@kaka11chen
Copy link
Copy Markdown
Contributor Author

run buildall

@github-actions github-actions Bot removed the approved Indicates a PR has been approved by one committer. label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants