Skip to content

[improvement](be) Train ANN index once instead of on every chunk#64145

Draft
yx-keith wants to merge 1 commit into
apache:masterfrom
yx-keith:fix-ann-ivf-train-once
Draft

[improvement](be) Train ANN index once instead of on every chunk#64145
yx-keith wants to merge 1 commit into
apache:masterfrom
yx-keith:fix-ann-ivf-train-once

Conversation

@yx-keith
Copy link
Copy Markdown
Contributor

@yx-keith yx-keith commented Jun 5, 2026

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Problem

When using IVF indexes with a PQ quantizer (quantizer=pq), if a segment contains more rows than ann_index_build_chunk_size, the PQ codebook is retrained on every chunk. Because k-means randomly permutes centroid labels on each run, vectors encoded under the old codebook are d
ecoded incorrectly with the new one. The result is near-zero recall with no errors or warnings — the index silently returns wrong results.

This commonly occurs on large tables after compaction, where segments grow beyond ann_index_build_chunk_size.

Affected index types: IVF + PQ quantizer (quantizer=pq or quantizer=sq)

Root Cause

In add_array_values(), whenever _float_array filled to chunk_size, the code called _train_once_if_needed() followed by add() then cleared the buffer. The _train_once_if_needed guard (_trained flag) was set after the first call, so training only happened once — but t
hat one training used only the first chunk of data, not the full segment. A small first chunk produces a poor-quality codebook, which degrades recall for all subsequent vectors.

Before fix (chunk_size=20,000, segment=400,000 rows):
  chunk 1 (20K rows) → train(20K)  ← codebook fit on 5% of data
  chunk 2 (20K rows) → skip train, add
  ...
  chunk 20 (20K rows) → skip train, add
  finish() → nothing left to train

Fix

Remove the train+add logic from add_array_values(). The method now only accumulates vectors into _float_array. Training and adding are deferred entirely to finish(), where the full segment data is available.

After fix (any chunk_size, segment=400,000 rows):
  add_array_values() × 20 → accumulate all 400K rows into _float_array
  finish() → train(400K) + add(400K) ← codebook fit on full data

finish() already handled this path correctly (train + add + save when num_rows >= min_train_rows), so the change is purely in add_array_values().

The _float_array.reserve() in init() is also removed since it was sized to chunk_size * dim, which is no longer meaningful.

Test Results

Environment: 400,000 vectors, dim=64, IVF(nlist=256) + PQ(m=8, nbits=8), recall@10, nprobe=200

Before fix (master)

ann_index_build_chunk_size Chunks recall@10
Grid A (baseline) 10,000,000 1 0.1445
Grid B (bug) 20,000 20 0.0085

Recall drops 17x when the segment spans multiple chunks.

After fix

ann_index_build_chunk_size Chunks recall@10
Grid A (baseline) 10,000,000 1 0.1445
Grid C 20,000 20 0.1445
Grid C2 30,000 14 0.1445

Recall is identical across all chunk sizes after the fix.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@yx-keith
Copy link
Copy Markdown
Contributor Author

yx-keith commented Jun 5, 2026

run buildall

@yx-keith yx-keith marked this pull request as draft June 5, 2026 08:55
@yx-keith
Copy link
Copy Markdown
Contributor Author

yx-keith commented Jun 5, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found blocking issues in the final version of this PR. The main correctness fix direction is reasonable, but the final commit changes the writer from chunk-bounded buffering to buffering every ANN vector for the entire segment, and the tests still describe/assert the earlier chunked behavior.

Critical checkpoint conclusions:

  • Goal: Fix repeated ANN/FAISS training so PQ/SQ codebooks are not retrained after vectors have already been added. The code trains at most once, but it now does so by retaining all vectors until finish(), which introduces a memory regression.
  • Scope/minimality: The helper and needs_training() changes are focused, but removing chunk flushing is a larger behavioral change than needed and leaves chunk-size configuration/test assumptions stale.
  • Concurrency/lifecycle: No new shared-state concurrency or special static lifecycle issue found in the touched code.
  • Configuration: Existing ann_index_build_chunk_size is effectively ignored for ANN writer buffering after this PR, while tests and comments still rely on it.
  • Compatibility/storage format: No on-disk format or FE-BE protocol compatibility issue found.
  • Parallel paths: The affected build paths go through AnnIndexColumnWriter; no additional ANN build path was found needing the same train-once handling.
  • Tests: BE unit tests appear inconsistent with the final implementation, and the new regression no longer exercises multi-chunk behavior because chunk flushing was removed. I attempted ./run-be-ut.sh --run --filter=AnnIndexWriterTest.*, but the runner could not complete in this environment because thirdparty/installed/bin/protoc is missing during gensrc.
  • Observability/performance/memory: The new all-segment buffer is a significant memory/performance risk for large segments and bypasses the previous chunk-size bound.
  • User focus: No additional user-provided review focus was present.

}

_float_array.insert(_float_array.end(), p, p + num_rows * dim);
return Status::OK();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removes the only bound on _float_array: every add_array_values() call now appends into one buffer and nothing is flushed until finish(). For ANN dimensions this can become very large per segment/build path (load, compaction, schema change, build-index-on-existing-data); e.g. rows * dim * sizeof(float) is no longer capped by ann_index_build_chunk_size. The old code reserved/flushed one chunk, but after this line the config is effectively ignored and memory is not admitted/tracked before the growth. Please keep bounded buffering (or add explicit sampling/training-buffer admission plus a separate bounded add path) rather than retaining the full segment in memory.


// Index requires training (e.g. IVF/PQ). After the fix train() is invoked only
// once on the first full chunk; the remainder reuses the already-trained state.
EXPECT_CALL(*mock_index, needs_training()).WillRepeatedly(testing::Return(true));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These expectations still describe the pre-final chunked behavior. With the final writer implementation, the 12 rows accumulated by add_array_values() are not flushed at chunk size 10; finish() calls _train_once_if_needed(12, ...) and add(12, ...), not train(10), add(10), and add(2). The same stale pattern appears in the later chunk/remainder tests (TestAddMoreThanChunkSizeIVF, TestLargeDataVolume*, TestPQMinTrainRows, TestPQWithSufficientData, TestHnswSkipsTrainCall), so the focused BE UTs will fail once the test binary can be built. Please update the tests to match the final behavior or restore chunked add semantics.

// PQ-codebook correctness, which is what the bug breaks.
sql "set ivf_nprobe = ${nlist}"

setBeConfigTemporary([ann_index_build_chunk_size: chunkSize]) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regression no longer tests what its comments claim. The final AnnIndexColumnWriter::add_array_values() ignores ann_index_build_chunk_size and buffers all rows until finish(), so lowering this config does not create 10 writer flush chunks or exercise the former retrain-on-each-chunk path. As written, the test can pass while the chunk-size behavior is dead/stale. Please either restore chunked flushing and keep this as the multi-chunk regression, or rewrite the test/comment/config usage around the final all-at-finish behavior.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29229 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e0c484aff9848bb7f54930741dc3b53a0c19d94d, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17801	4094	4023	4023
q2	q3	10811	1431	844	844
q4	4735	482	350	350
q5	8040	898	608	608
q6	275	186	138	138
q7	841	860	624	624
q8	10224	1591	1539	1539
q9	6758	4467	4464	4464
q10	6857	1832	1560	1560
q11	433	271	250	250
q12	642	436	291	291
q13	18118	3398	2800	2800
q14	265	268	250	250
q15	q16	814	793	710	710
q17	1008	906	956	906
q18	6905	5729	5682	5682
q19	1205	1226	1032	1032
q20	492	395	268	268
q21	5842	2806	2570	2570
q22	450	375	320	320
Total cold run time: 102516 ms
Total hot run time: 29229 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4850	4680	4661	4661
q2	q3	5182	5193	4627	4627
q4	2168	2202	1388	1388
q5	4795	4852	4678	4678
q6	231	183	126	126
q7	1868	1817	1574	1574
q8	2415	2151	1940	1940
q9	7439	7379	7433	7379
q10	4784	4689	4220	4220
q11	531	386	354	354
q12	737	747	530	530
q13	3082	3329	2808	2808
q14	277	291	253	253
q15	q16	679	702	610	610
q17	1272	1260	1245	1245
q18	7299	6835	7079	6835
q19	1148	1073	1113	1073
q20	2196	2221	1947	1947
q21	5256	4597	4431	4431
q22	514	448	412	412
Total cold run time: 56723 ms
Total hot run time: 51091 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 168557 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e0c484aff9848bb7f54930741dc3b53a0c19d94d, data reload: false

query5	4325	625	480	480
query6	449	193	193	193
query7	4859	573	316	316
query8	364	215	198	198
query9	8753	4006	4029	4006
query10	459	298	252	252
query11	5931	2322	2138	2138
query12	154	104	100	100
query13	1302	618	442	442
query14	6331	5386	5052	5052
query14_1	4409	4358	4358	4358
query15	202	197	174	174
query16	1007	457	446	446
query17	933	708	591	591
query18	2462	488	346	346
query19	197	190	152	152
query20	112	108	107	107
query21	218	146	125	125
query22	13783	13509	13350	13350
query23	17212	16477	16192	16192
query23_1	16300	16272	16319	16272
query24	7623	1783	1320	1320
query24_1	1320	1322	1320	1320
query25	599	493	415	415
query26	1305	331	174	174
query27	2751	572	341	341
query28	4499	2056	2028	2028
query29	1115	638	499	499
query30	309	242	205	205
query31	1111	1083	958	958
query32	114	65	61	61
query33	557	331	262	262
query34	1182	1189	661	661
query35	751	791	684	684
query36	1398	1423	1242	1242
query37	156	112	95	95
query38	3196	3149	3056	3056
query39	918	916	892	892
query39_1	883	862	879	862
query40	231	132	107	107
query41	73	71	66	66
query42	100	97	95	95
query43	322	323	286	286
query44	
query45	196	192	184	184
query46	1080	1194	748	748
query47	2438	2395	2274	2274
query48	397	385	305	305
query49	672	498	378	378
query50	995	361	270	270
query51	4339	4345	4284	4284
query52	94	91	79	79
query53	261	270	195	195
query54	295	235	215	215
query55	80	77	71	71
query56	273	261	237	237
query57	1418	1416	1323	1323
query58	249	236	229	229
query59	1624	1699	1445	1445
query60	323	250	239	239
query61	160	158	153	153
query62	682	651	584	584
query63	235	189	182	182
query64	2598	816	609	609
query65	
query66	1809	465	340	340
query67	29749	29057	29522	29057
query68	
query69	426	298	256	256
query70	994	955	946	946
query71	289	227	219	219
query72	3190	2735	2471	2471
query73	866	745	422	422
query74	5115	4938	4772	4772
query75	2647	2593	2221	2221
query76	2316	1156	736	736
query77	362	380	287	287
query78	12445	12435	11793	11793
query79	1440	1056	781	781
query80	1258	470	412	412
query81	526	281	242	242
query82	623	156	124	124
query83	331	291	245	245
query84	309	143	112	112
query85	901	563	448	448
query86	416	316	294	294
query87	3354	3375	3139	3139
query88	3610	2742	2734	2734
query89	436	385	330	330
query90	1875	178	174	174
query91	175	172	139	139
query92	66	61	61	61
query93	1539	1409	898	898
query94	723	363	319	319
query95	671	392	341	341
query96	1017	794	331	331
query97	2687	2693	2562	2562
query98	211	205	205	205
query99	1141	1170	1040	1040
Total cold run time: 252557 ms
Total hot run time: 168557 ms

Guards the fix in "Train ANN index once instead of on every chunk": shrink
ann_index_build_chunk_size so a single 20k-row segment spans 10 chunks, build an
IVF+PQ index, and assert recall@10 (vs exact l2_distance brute force) stays high.

On a buggy BE the PQ codebook is re-trained on every chunk, so vectors written
under earlier codebooks decode against the final codebook and recall collapses
to ~0.1, failing the assertion. On the fixed BE the index is trained once and
recall stays high (>= 0.5; typically ~0.8-0.95).

An IVF+FLAT table loaded from the same data is a positive control: FLAT has no
codebook so it is unaffected and must reach near-exact recall (>= 0.95), proving
the harness can achieve high recall and that a low PQ recall is specifically the
train-reentry bug. An EXPLAIN check asserts the approximate query is pushed into
the ANN index so recall is not a trivial brute-force 1.0.

Marked nonConcurrent because it temporarily lowers ann_index_build_chunk_size
via setBeConfigTemporary.
@yx-keith yx-keith force-pushed the fix-ann-ivf-train-once branch from e0c484a to 24dd1ed Compare June 8, 2026 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants