[vector] Add unified vector index integration#8174
Conversation
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for adding the IVF-PQ GlobalIndex integration. I found one blocking issue: the new SPI service file is included in the RAT scan but does not carry the ASF license header. This makes both the new ivfpq_test workflow and a local mvn -pl paimon-ivfpq/paimon-ivfpq-jni,paimon-ivfpq/paimon-ivfpq-index -am -DskipTests -DfailIfNoTests=false verify fail with Too many files with unapproved license: 1.
I also noticed that this PR adds the new JNI facade and GlobalIndex implementation without any src/test coverage under paimon-ivfpq. After fixing the RAT blocker, please add at least basic coverage for the SPI wiring / JNI facade behavior, or an executable test that exercises building and searching an IVF-PQ index with the native library available.
| @@ -0,0 +1 @@ | |||
| org.apache.paimon.ivfpq.index.IvfpqVectorGlobalIndexerFactory | |||
There was a problem hiding this comment.
This service file needs the standard ASF license header. It is currently reported by RAT as an unapproved file, so mvn ... verify fails before the new modules can be tested. Existing service files in the repository, such as the Lumina and Tantivy GlobalIndexerFactory service files, include the # Licensed to the Apache Software Foundation ... header.
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the quick update. The previous RAT blocker is fixed, and the main IVF-PQ modules compile locally with -Pfast-build. However, the newly added tests do not compile yet. Options does not provide setBoolean(String, boolean) or setDouble(String, double), so the latest ivfpq_test CI fails during paimon-ivfpq-index test compilation. Please switch these calls to options.set(<ConfigOption>, value) (for example options.set(IvfpqVectorIndexOptions.USE_OPQ, true) and options.set(IvfpqVectorIndexOptions.TRAIN_SAMPLE_RATIO, 0.5)) or use supported string setters before re-running the CI.
| options.setString("ivfpq.distance.metric", "l2"); | ||
| options.setInteger("ivfpq.nlist", 128); | ||
| options.setInteger("ivfpq.m", 8); | ||
| options.setBoolean("ivfpq.use_opq", true); |
There was a problem hiding this comment.
This test does not compile because Options has no setBoolean(String, boolean) method. The same applies to the setDouble calls below and the setBoolean call in IvfpqVectorGlobalIndexTest. Please use options.set(IvfpqVectorIndexOptions.USE_OPQ, true) / options.set(IvfpqVectorIndexOptions.TRAIN_SAMPLE_RATIO, 0.5) or supported string setters instead.
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the updates. I re-reviewed the latest revision and the previous blockers look resolved:
- the new GlobalIndexer SPI service file now carries the ASF license header;
- the tests no longer use unsupported
Options#setBoolean/setDoubleAPIs; - the vector-index modules compile cleanly locally;
- native end-to-end vector index tests pass after building
paimon-vector-indexJNI and copyinglibpaimon_vindex_jni.sointo the test resources, matching the new workflow.
Local checks I ran:
mvn -B -ntp -pl paimon-vector/paimon-vector-jni,paimon-vector/paimon-vector-index -am -DskipTests -DfailIfNoTests=false -Dcheckstyle.skip=true -Dspotless.check.skip=true -Drat.skip=false verify
cargo build --release -p paimon-vindex-jni
mvn -B -ntp -pl paimon-vector/paimon-vector-jni,paimon-vector/paimon-vector-index -Dcheckstyle.skip=true -Dspotless.check.skip=true test
The module test run completed with Tests run: 21, Failures: 0, Errors: 0, Skipped: 0 once the native library was available. LGTM.
Integrate apache/paimon-vector-index (pure Rust IVF-PQ) into Paimon's GlobalIndex SPI framework. Follows the paimon-tantivy two-level module pattern: paimon-ivfpq-jni for Java JNI bindings and NativeLoader, paimon-ivfpq-index for Paimon GlobalIndexer integration. Key features: - IVF-PQ vector index with identifier "ivfpq" - Native Roaring bitmap filter pushdown (byte[] format) - Direct stream I/O via JNI (no adapter classes needed) - Reservoir sampling for training with configurable sample ratio - Batched vector insertion for memory efficiency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4710870 to
3adde86
Compare
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the update. I found one blocker in the incremental changes.
VectorGlobalIndexWriter.addVectorsFromTempFile now uses fixed-size batchIds / batchVectors arrays with ADD_BATCH_SIZE = 10000, but the last batch often has thisBatch < ADD_BATCH_SIZE. The code still passes the full arrays to writer.addVectors(batchIds, batchVectors, thisBatch), while the native layer validates vectors.length == count * dimension.
This is already reproduced by the current vector_index_test CI: the native end-to-end tests fail with errors like add_vectors: vector data length 20000 does not match vector count * dimension 6 when a final partial batch is added.
Please trim/copy the arrays for the partial batch before calling native addVectors, or expose/use an API that accepts offset/count without requiring the backing array length to match exactly.
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the update. I rechecked the incremental change and the previous partial-batch issue in VectorGlobalIndexWriter is fixed by trimming the id/vector arrays before calling native addVectors when the last batch is smaller than ADD_BATCH_SIZE. The dedicated vector_index_test check is green now. LGTM.
Summary
Integrate
apache/paimon-vector-indexwith Paimon GlobalIndex SPI as the newpaimon-vectormodule. The PR follows the latest upstream unified Java/JNI vector index API and supports multiple ANN algorithms instead of IVF-PQ only.Changes
paimon-vectorwith two submodules:paimon-vector-jni: Java JNI bindings and native library loading.paimon-vector-index: Paimon GlobalIndex reader/writer integration.ivf-flativf-pqivf-hnsw-flativf-hnsw-sqVectorGlobalIndexerFactorythe abstract base class and let each concrete factory provide the native vector index type.index_type; no Paimon-facingvector.index.typeoption is introduced.paimon-vector-indexAPI:org.apache.paimon.index.vector, matching current native JNI symbols.VectorIndexWriteraccepts upstream-styleMap<String, String>options.VectorIndexReaderopens directly fromVectorIndexInputand exposes string metadata for index type and metric.VectorIndexOptions,VectorMetric, and JNI-owned enum wrappers. Vector build/search parameters are documented asvector.*dynamic options and are read directly fromOptionswhere needed.nprobeandef_search; dimension and metric are read from the native vector index metadata.vector.*options, and update multimodal/Flink procedure docs to use the new vector index names.utcase-vector-indexCI workflow and focused tests for SPI registration, metadata, validation, and search behavior.Documented Vector Options
vector.index.dimension128ARRAY<FLOAT>columns.VECTOR<FLOAT>columns use the type dimension.vector.distance.metricinner_productl2,cosine, orinner_product.vector.nlist256vector.pq.m16ivf-pq; the vector dimension must be divisible by this value.vector.pq.use-opqfalseivf-pq.vector.hnsw.m20ivf-hnsw-flatandivf-hnsw-sq.vector.hnsw.ef-construction150vector.hnsw.max-level7vector.nprobe16vector.hnsw.ef-search00uses the native library default.vector.train.sample-ratio1.0vector.add.batch-size10000Notes
ivfpq.*options, old module names, old JNI package names, or the previous JNI config-object API because this integration has not been released.vectornaming.Testing
mvn -pl paimon-vector/paimon-vector-index spotless:apply -DfailIfNoTests=falsemvn -pl paimon-vector/paimon-vector-index -am -Pfast-build -DfailIfNoTests=false -Dtest=VectorGlobalIndexerFactoryTest,VectorGlobalIndexTest testmvn -pl paimon-vector/paimon-vector-index -am -DskipTests -DfailIfNoTests=false compilemvn -pl paimon-vector/paimon-vector-index -am -DfailIfNoTests=false -Dtest=VectorGlobalIndexerFactoryTest,VectorGlobalIndexTest testmvn -pl paimon-vector/paimon-vector-index -am -Pfast-build -DfailIfNoTests=false -Dtest=VectorGlobalIndexerFactoryTest,VectorGlobalIndexTest clean testgit diff --checkNative-library-dependent test cases are skipped locally when the native library is unavailable. The dedicated vector CI workflow builds the latest native library and should exercise those tests on Linux.