[improvement](be) Train ANN index once instead of on every chunk#64145
[improvement](be) Train ANN index once instead of on every chunk#64145yx-keith wants to merge 1 commit into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found blocking issues in the final version of this PR. The main correctness fix direction is reasonable, but the final commit changes the writer from chunk-bounded buffering to buffering every ANN vector for the entire segment, and the tests still describe/assert the earlier chunked behavior.
Critical checkpoint conclusions:
- Goal: Fix repeated ANN/FAISS training so PQ/SQ codebooks are not retrained after vectors have already been added. The code trains at most once, but it now does so by retaining all vectors until finish(), which introduces a memory regression.
- Scope/minimality: The helper and needs_training() changes are focused, but removing chunk flushing is a larger behavioral change than needed and leaves chunk-size configuration/test assumptions stale.
- Concurrency/lifecycle: No new shared-state concurrency or special static lifecycle issue found in the touched code.
- Configuration: Existing ann_index_build_chunk_size is effectively ignored for ANN writer buffering after this PR, while tests and comments still rely on it.
- Compatibility/storage format: No on-disk format or FE-BE protocol compatibility issue found.
- Parallel paths: The affected build paths go through AnnIndexColumnWriter; no additional ANN build path was found needing the same train-once handling.
- Tests: BE unit tests appear inconsistent with the final implementation, and the new regression no longer exercises multi-chunk behavior because chunk flushing was removed. I attempted
./run-be-ut.sh --run --filter=AnnIndexWriterTest.*, but the runner could not complete in this environment becausethirdparty/installed/bin/protocis missing during gensrc. - Observability/performance/memory: The new all-segment buffer is a significant memory/performance risk for large segments and bypasses the previous chunk-size bound.
- User focus: No additional user-provided review focus was present.
| } | ||
|
|
||
| _float_array.insert(_float_array.end(), p, p + num_rows * dim); | ||
| return Status::OK(); |
There was a problem hiding this comment.
This removes the only bound on _float_array: every add_array_values() call now appends into one buffer and nothing is flushed until finish(). For ANN dimensions this can become very large per segment/build path (load, compaction, schema change, build-index-on-existing-data); e.g. rows * dim * sizeof(float) is no longer capped by ann_index_build_chunk_size. The old code reserved/flushed one chunk, but after this line the config is effectively ignored and memory is not admitted/tracked before the growth. Please keep bounded buffering (or add explicit sampling/training-buffer admission plus a separate bounded add path) rather than retaining the full segment in memory.
|
|
||
| // Index requires training (e.g. IVF/PQ). After the fix train() is invoked only | ||
| // once on the first full chunk; the remainder reuses the already-trained state. | ||
| EXPECT_CALL(*mock_index, needs_training()).WillRepeatedly(testing::Return(true)); |
There was a problem hiding this comment.
These expectations still describe the pre-final chunked behavior. With the final writer implementation, the 12 rows accumulated by add_array_values() are not flushed at chunk size 10; finish() calls _train_once_if_needed(12, ...) and add(12, ...), not train(10), add(10), and add(2). The same stale pattern appears in the later chunk/remainder tests (TestAddMoreThanChunkSizeIVF, TestLargeDataVolume*, TestPQMinTrainRows, TestPQWithSufficientData, TestHnswSkipsTrainCall), so the focused BE UTs will fail once the test binary can be built. Please update the tests to match the final behavior or restore chunked add semantics.
| // PQ-codebook correctness, which is what the bug breaks. | ||
| sql "set ivf_nprobe = ${nlist}" | ||
|
|
||
| setBeConfigTemporary([ann_index_build_chunk_size: chunkSize]) { |
There was a problem hiding this comment.
This regression no longer tests what its comments claim. The final AnnIndexColumnWriter::add_array_values() ignores ann_index_build_chunk_size and buffers all rows until finish(), so lowering this config does not create 10 writer flush chunks or exercise the former retrain-on-each-chunk path. As written, the test can pass while the chunk-size behavior is dead/stale. Please either restore chunked flushing and keep this as the multi-chunk regression, or rewrite the test/comment/config usage around the final all-at-finish behavior.
TPC-H: Total hot run time: 29229 ms |
TPC-DS: Total hot run time: 168557 ms |
Guards the fix in "Train ANN index once instead of on every chunk": shrink ann_index_build_chunk_size so a single 20k-row segment spans 10 chunks, build an IVF+PQ index, and assert recall@10 (vs exact l2_distance brute force) stays high. On a buggy BE the PQ codebook is re-trained on every chunk, so vectors written under earlier codebooks decode against the final codebook and recall collapses to ~0.1, failing the assertion. On the fixed BE the index is trained once and recall stays high (>= 0.5; typically ~0.8-0.95). An IVF+FLAT table loaded from the same data is a positive control: FLAT has no codebook so it is unaffected and must reach near-exact recall (>= 0.95), proving the harness can achieve high recall and that a low PQ recall is specifically the train-reentry bug. An EXPLAIN check asserts the approximate query is pushed into the ANN index so recall is not a trivial brute-force 1.0. Marked nonConcurrent because it temporarily lowers ann_index_build_chunk_size via setBeConfigTemporary.
e0c484a to
24dd1ed
Compare
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Problem
When using IVF indexes with a PQ quantizer (
quantizer=pq), if a segment contains more rows thanann_index_build_chunk_size, the PQ codebook is retrained on every chunk. Because k-means randomly permutes centroid labels on each run, vectors encoded under the old codebook are decoded incorrectly with the new one. The result is near-zero recall with no errors or warnings — the index silently returns wrong results.
This commonly occurs on large tables after compaction, where segments grow beyond
ann_index_build_chunk_size.Affected index types: IVF + PQ quantizer (
quantizer=pqorquantizer=sq)Root Cause
In
add_array_values(), whenever_float_arrayfilled tochunk_size, the code called_train_once_if_needed()followed byadd()then cleared the buffer. The_train_once_if_neededguard (_trainedflag) was set after the first call, so training only happened once — but that one training used only the first chunk of data, not the full segment. A small first chunk produces a poor-quality codebook, which degrades recall for all subsequent vectors.
Fix
Remove the train+add logic from
add_array_values(). The method now only accumulates vectors into_float_array. Training and adding are deferred entirely tofinish(), where the full segment data is available.finish()already handled this path correctly (train + add + save whennum_rows >= min_train_rows), so the change is purely inadd_array_values().The
_float_array.reserve()ininit()is also removed since it was sized tochunk_size * dim, which is no longer meaningful.Test Results
Environment: 400,000 vectors, dim=64, IVF(nlist=256) + PQ(m=8, nbits=8), recall@10, nprobe=200
Before fix (master)
Recall drops 17x when the segment spans multiple chunks.
After fix
Recall is identical across all chunk sizes after the fix.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)