feat: integrate OMEGA adaptive early termination into zvec#301
Open
driPyf wants to merge 128 commits intoalibaba:mainfrom
Open
feat: integrate OMEGA adaptive early termination into zvec#301driPyf wants to merge 128 commits intoalibaba:mainfrom
driPyf wants to merge 128 commits intoalibaba:mainfrom
Conversation
Integrate OMEGALib repository as a submodule to provide OMEGA adaptive search functionality. The submodule includes GBDT inference, feature extraction, model management, and search context components.
Add OMEGA index components that wrap HNSW with adaptive search capability: - OmegaSearcher: Wraps HnswSearcher with OMEGA model integration and automatic fallback - OmegaBuilder: Wraps HnswBuilder for index construction - OmegaStreamer: Wraps HnswStreamer for streaming operations - Factory registration for all components - CMakeLists.txt integration with omega library dependency OMEGA mode activates when vector count >= threshold and model is loaded, otherwise falls back to standard HNSW transparently.
- Add OMEGA index type to zvec type system - Implement OmegaIndexParams class for index configuration - Add Python bindings for OmegaIndexParam - Integrate OMEGA searcher with HNSW fallback mechanism - Add comprehensive Python unit tests for OMEGA functionality - Update schema validation to support OMEGA index type Tests verify that OMEGA index correctly falls back to HNSW behavior when OMEGA-specific features are not enabled, ensuring full compatibility.
…y recall - Implement OmegaIndex with ITrainingCapable interface for training support - Create OmegaStreamer with training mode for feature collection during search - Add OmegaSearcher adaptive search with OMEGA early stopping prediction - Implement training data export and collection APIs - Add OmegaQueryParams and OmegaContext for per-query target_recall specification - Create omega_params.h and omega_context.h for parameter management - Update engine_helper to convert and extract OMEGA query parameters - Integrate training mode with Collection API (enable/disable/export methods) - Add training data collector, query generator, and model trainer components - Add Python training API with OmegaTrainer class - Add debug logging for OMEGA index creation and merge operations - Adjust HnswSearcher member access modifiers for OMEGA inheritance - Remove test_omega_fallback.py (replaced by test_collection.py tests)
- Fix memory explosion in training data collection by clearing records after copy - Add omega_model directory creation before training to fix CSV write failure - Remove all debug fprintf/fflush statements and empty code blocks
… params - Parallelize ground truth computation and training searches with std::thread - Add training_query_id support for thread-safe parallel training - Add num_training_queries param to OmegaIndexParams (default: 1000) - Use ef_construction as training search ef instead of hardcoded 1000
Build System Changes: - Add ZVEC_ENABLE_OMEGA option for conditional OMEGA compilation (default: OFF) - Add -DZVEC_ENABLE_OMEGA definition when enabled - Update thirdparty/CMakeLists.txt to conditionally build omega library - Update src/core/CMakeLists.txt to conditionally compile omega sources - Update omega submodule to version with LightGBM C API support Training System Refactor: - Replace Python subprocess training with native LightGBM C API * Remove CSV export and Python _omega_training.py invocation * Add direct omega::OmegaTrainer integration via C++ API * Remove ExportToCSV, ExportGtCmpsToCSV, InvokePythonTrainer methods - Add configurable training parameters to OmegaModelTrainerOptions: * num_iterations (default: 100) * num_leaves (default: 31) * learning_rate (default: 0.1) * num_threads (default: 8) - Add type conversion helpers (ConvertRecord, ConvertGtCmpsData) - Improve training performance Training Data Collection Improvements: - Move training record storage from OmegaStreamer to OmegaContext * Remove shared collected_records_ vector and training_mutex_ from OmegaStreamer * Store records per-query in OmegaContext via add_training_record() * Eliminate lock contention during parallel training searches - Remove legacy GetTrainingRecords/ClearTrainingRecords from OmegaStreamer - Simplify OmegaIndex training interface (return empty vectors) - Update omega_streamer.cc to use context-based record collection Code Cleanup: - Wrap all OMEGA-dependent code with #ifdef ZVEC_ENABLE_OMEGA guards - Update OmegaModelTrainerOptions documentation - Add detailed logging for training record collection - Improve error handling for missing OmegaContext
- Expose target_recall parameter for OMEGA adaptive early stopping - Update OMEGA tests with 100k docs and recall validation - Remove deprecated _omega_training.py
Major optimization: - Move training data collection before Flush() to use in-memory graph - Eliminate ~2 minute disk reload delay for 1M vectors - Fix GT computation to use correct indexers (was using empty flushed ones) Training improvements: - Add ef_groundtruth parameter for faster GT computation using HNSW - Support parallel training searches with per-query ground truth - Add window_size parameter for early stopping control - Expose all OMEGA params through Python API (OmegaIndexParam, OmegaQueryParam) Code quality: - Add TIMING logs for performance debugging - Refactor TrainingDataCollector to use passed indexers instead of segment's - Clean up training flow in merge_vector_indexer()
…query-side search path OMEGA integration updates: - wire the updated omega training and search behavior into zvec index build, load and query execution paths - expose and propagate OMEGA training/query parameters through the Python API, index params and engine helper conversions - update omega builder, searcher, streamer and context handling to match the reference behavior more closely Training and validation updates: - update training data collection and model training integration for the reference-aligned OMEGA workflow Performance and debugging updates: - add an OMEGA prediction microbenchmark for query-side inference analysis - improve storage/index plumbing needed by the OMEGA workflow - add query-side diagnostics to investigate early-stop calibration and repeated prediction overhead
…g hooks, and add query-side profiling
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #300
Greptile Summary
This PR introduces a new
OMEGAindex type to zvec, integrating adaptive early termination on top of HNSW. Instead of adding a separate search engine, the implementation keeps HNSW as the underlying graph traversal path and adds a learned query-time stopping policy that decides whether the current search state is already sufficient for a target recall. The implementation spans a newomegaalgorithm directory, OMEGA-aware searcher/streamer/index support, DB-layer training orchestration, Python bindings, benchmark integration, and Python workflow tests.Key changes:
OMEGAis now exposed as a first-class index type with dedicated index params and query params in both the core interfaces and Python bindings.OmegaSearcherandOmegaStreamerintegrate OMEGA with the existing HNSW search loop through hook callbacks, so online search remains HNSW-based while query-time stop decisions come from OMEGALib.omega_model/artifacts such asmodel.txt,threshold_table.txt,interval_table.txt,gt_collected_table.txt, andgt_cmps_all_table.txt.min_vector_threshold.Issues found:
Confidence Score: 4/5
src/core/algorithm/omega/omega_searcher.cc,src/core/algorithm/omega/omega_streamer.cc, andsrc/db/training/omega_training_coordinator.cc— these files define the runtime activation/fallback logic, the HNSW hook integration, and the offline training lifecycle that make the feature work end to end.Important Files Changed
src/core/algorithm/omega/omega_searcher.ccsrc/core/algorithm/omega/omega_streamer.ccsrc/core/algorithm/omega/omega_context.hHnswContextwith OMEGA-specific per-query state such astarget_recall,training_query_id, and collected training outputs.src/core/interface/indexes/omega_index.ccsrc/db/training/omega_training_coordinator.ccomega_model/.src/db/training/omega_model_trainer.ccsrc/db/index/segment/segment.ccsrc/db/index/column/vector_column/engine_helper.hppsrc/binding/python/model/param/python_param.ccOmegaIndexParam,OmegaQueryParam, and optimize-time retraining options.python/tests/test_collection.pythirdparty/CMakeLists.txtCMakeLists.txtSequence Diagram
sequenceDiagram participant User participant Collection participant Segment participant OmegaTrainingCoordinator participant OMEGALib participant OmegaSearcher participant HNSW Note over User,Collection: Offline build / optimize User->>Collection: insert(docs) Collection->>Segment: build / persist vector index User->>Collection: optimize() Collection->>Segment: collect held-out queries and traces Segment->>OmegaTrainingCoordinator: training records + gt data OmegaTrainingCoordinator->>OMEGALib: train model OMEGALib-->>OmegaTrainingCoordinator: model + auxiliary tables OmegaTrainingCoordinator-->>Segment: persist omega_model/ Note over User,Collection: Online query User->>Collection: query(VectorQuery, OmegaQueryParam) Collection->>OmegaSearcher: search(...) OmegaSearcher->>OmegaSearcher: should_use_omega() alt model available and threshold satisfied OmegaSearcher->>HNSW: search with OMEGA hooks HNSW->>OMEGALib: update SearchContext during traversal OMEGALib-->>HNSW: stop / continue HNSW-->>OmegaSearcher: results else fallback OmegaSearcher->>HNSW: plain HNSW search HNSW-->>OmegaSearcher: results end OmegaSearcher-->>Collection: results Collection-->>User: query results