Skip to content

feat: integrate OMEGA adaptive early termination into zvec#301

Open
driPyf wants to merge 128 commits intoalibaba:mainfrom
driPyf:feat/omega-integration
Open

feat: integrate OMEGA adaptive early termination into zvec#301
driPyf wants to merge 128 commits intoalibaba:mainfrom
driPyf:feat/omega-integration

Conversation

@driPyf
Copy link
Copy Markdown

@driPyf driPyf commented Apr 1, 2026

Closes #300

Greptile Summary

This PR introduces a new OMEGA index type to zvec, integrating adaptive early termination on top of HNSW. Instead of adding a separate search engine, the implementation keeps HNSW as the underlying graph traversal path and adds a learned query-time stopping policy that decides whether the current search state is already sufficient for a target recall. The implementation spans a new omega algorithm directory, OMEGA-aware searcher/streamer/index support, DB-layer training orchestration, Python bindings, benchmark integration, and Python workflow tests.

Key changes:

  • OMEGA is now exposed as a first-class index type with dedicated index params and query params in both the core interfaces and Python bindings.
  • OmegaSearcher and OmegaStreamer integrate OMEGA with the existing HNSW search loop through hook callbacks, so online search remains HNSW-based while query-time stop decisions come from OMEGALib.
  • The offline pipeline now includes held-out query generation, ground-truth collection, search-trace collection, model training, and persistence of omega_model/ artifacts such as model.txt, threshold_table.txt, interval_table.txt, gt_collected_table.txt, and gt_cmps_all_table.txt.
  • Runtime behavior still falls back to plain HNSW when no model is available or the dataset is below min_vector_threshold.
  • Python workflow tests now cover insert → optimize/train → online OMEGA query, as well as the fallback-to-HNSW path when OMEGA is inactive.

Issues found:

  • The current integration makes OMEGALib a required build dependency, so environments without working OpenMP or LightGBM toolchains will fail at configure time rather than transparently building without OMEGA.
  • The Python workflow tests validate that the OMEGA query path runs and that fallback results match HNSW, but they do not directly assert that adaptive early termination actually triggered inside OMEGALib.
  • The current query-time validation is intentionally behavioral rather than log- or metric-driven, so regressions in “OMEGA active but not stopping early” may not be caught by Python tests alone.

Confidence Score: 4/5

  • Likely safe to merge after normal build validation. The overall design is coherent: HNSW remains the underlying search engine, OMEGA is layered as query-time control, fallback behavior is preserved, and the offline training loop is integrated into the DB lifecycle rather than living in an external script.
  • The remaining risks are mainly around build environment requirements and the fact that Python tests validate end-to-end behavior more strongly than internal early-stop activation.
  • Pay close attention to src/core/algorithm/omega/omega_searcher.cc, src/core/algorithm/omega/omega_streamer.cc, and src/db/training/omega_training_coordinator.cc — these files define the runtime activation/fallback logic, the HNSW hook integration, and the offline training lifecycle that make the feature work end to end.

Important Files Changed

Filename Overview
src/core/algorithm/omega/omega_searcher.cc OMEGA-aware searcher that loads persisted model artifacts, decides whether OMEGA should be active for a query, and falls back to plain HNSW when the model is unavailable or the dataset is below threshold.
src/core/algorithm/omega/omega_streamer.cc Streaming/search-time integration layer that wires OMEGA into the HNSW loop through hooks, supports both inference and training modes, and persists the searcher-side OMEGA params alongside the index.
src/core/algorithm/omega/omega_context.h Extends HnswContext with OMEGA-specific per-query state such as target_recall, training_query_id, and collected training outputs.
src/core/interface/indexes/omega_index.cc Framework-facing OMEGA index wrapper that routes build/load/search lifecycle through the correct OMEGA-aware streamer/searcher path.
src/db/training/omega_training_coordinator.cc Coordinates held-out query reuse, training-data collection, retraining flow, and model output management under omega_model/.
src/db/training/omega_model_trainer.cc Bridges zvec-collected training records into OMEGALib’s C++ training API and writes the full set of OMEGA model artifacts.
src/db/index/segment/segment.cc Adds OMEGA auto-training and retraining hooks into the segment optimize/build lifecycle.
src/db/index/column/vector_column/engine_helper.hpp DB-to-core translation layer for OMEGA index params and runtime searcher config.
src/binding/python/model/param/python_param.cc Python bindings for OmegaIndexParam, OmegaQueryParam, and optimize-time retraining options.
python/tests/test_collection.py Python workflow tests covering OMEGA training, OMEGA query execution, and fallback-to-HNSW behavior.
thirdparty/CMakeLists.txt Build integration for OMEGALib and its dependency wiring into zvec.
CMakeLists.txt Top-level build wiring for OMEGA-related core libraries and tools.

Sequence Diagram

sequenceDiagram
    participant User
    participant Collection
    participant Segment
    participant OmegaTrainingCoordinator
    participant OMEGALib
    participant OmegaSearcher
    participant HNSW

    Note over User,Collection: Offline build / optimize
    User->>Collection: insert(docs)
    Collection->>Segment: build / persist vector index
    User->>Collection: optimize()
    Collection->>Segment: collect held-out queries and traces
    Segment->>OmegaTrainingCoordinator: training records + gt data
    OmegaTrainingCoordinator->>OMEGALib: train model
    OMEGALib-->>OmegaTrainingCoordinator: model + auxiliary tables
    OmegaTrainingCoordinator-->>Segment: persist omega_model/

    Note over User,Collection: Online query
    User->>Collection: query(VectorQuery, OmegaQueryParam)
    Collection->>OmegaSearcher: search(...)
    OmegaSearcher->>OmegaSearcher: should_use_omega()

    alt model available and threshold satisfied
        OmegaSearcher->>HNSW: search with OMEGA hooks
        HNSW->>OMEGALib: update SearchContext during traversal
        OMEGALib-->>HNSW: stop / continue
        HNSW-->>OmegaSearcher: results
    else fallback
        OmegaSearcher->>HNSW: plain HNSW search
        HNSW-->>OmegaSearcher: results
    end

    OmegaSearcher-->>Collection: results
    Collection-->>User: query results
Loading

driPyf and others added 30 commits January 29, 2026 02:27
Integrate OMEGALib repository as a submodule to provide OMEGA adaptive search functionality. The submodule includes GBDT inference, feature extraction, model management, and search context components.
Add OMEGA index components that wrap HNSW with adaptive search capability:
- OmegaSearcher: Wraps HnswSearcher with OMEGA model integration and automatic fallback
- OmegaBuilder: Wraps HnswBuilder for index construction
- OmegaStreamer: Wraps HnswStreamer for streaming operations
- Factory registration for all components
- CMakeLists.txt integration with omega library dependency

OMEGA mode activates when vector count >= threshold and model is loaded, otherwise falls back to standard HNSW transparently.
- Add OMEGA index type to zvec type system
- Implement OmegaIndexParams class for index configuration
- Add Python bindings for OmegaIndexParam
- Integrate OMEGA searcher with HNSW fallback mechanism
- Add comprehensive Python unit tests for OMEGA functionality
- Update schema validation to support OMEGA index type

Tests verify that OMEGA index correctly falls back to HNSW behavior
when OMEGA-specific features are not enabled, ensuring full compatibility.
…y recall

- Implement OmegaIndex with ITrainingCapable interface for training support
- Create OmegaStreamer with training mode for feature collection during search
- Add OmegaSearcher adaptive search with OMEGA early stopping prediction
- Implement training data export and collection APIs
- Add OmegaQueryParams and OmegaContext for per-query target_recall specification
- Create omega_params.h and omega_context.h for parameter management
- Update engine_helper to convert and extract OMEGA query parameters
- Integrate training mode with Collection API (enable/disable/export methods)
- Add training data collector, query generator, and model trainer components
- Add Python training API with OmegaTrainer class
- Add debug logging for OMEGA index creation and merge operations
- Adjust HnswSearcher member access modifiers for OMEGA inheritance
- Remove test_omega_fallback.py (replaced by test_collection.py tests)
- Fix memory explosion in training data collection by clearing records after copy
- Add omega_model directory creation before training to fix CSV write failure
- Remove all debug fprintf/fflush statements and empty code blocks
… params

- Parallelize ground truth computation and training searches with std::thread
- Add training_query_id support for thread-safe parallel training
- Add num_training_queries param to OmegaIndexParams (default: 1000)
- Use ef_construction as training search ef instead of hardcoded 1000
Build System Changes:
- Add ZVEC_ENABLE_OMEGA option for conditional OMEGA compilation (default: OFF)
- Add -DZVEC_ENABLE_OMEGA definition when enabled
- Update thirdparty/CMakeLists.txt to conditionally build omega library
- Update src/core/CMakeLists.txt to conditionally compile omega sources
- Update omega submodule to version with LightGBM C API support

Training System Refactor:
- Replace Python subprocess training with native LightGBM C API
  * Remove CSV export and Python _omega_training.py invocation
  * Add direct omega::OmegaTrainer integration via C++ API
  * Remove ExportToCSV, ExportGtCmpsToCSV, InvokePythonTrainer methods
- Add configurable training parameters to OmegaModelTrainerOptions:
  * num_iterations (default: 100)
  * num_leaves (default: 31)
  * learning_rate (default: 0.1)
  * num_threads (default: 8)
- Add type conversion helpers (ConvertRecord, ConvertGtCmpsData)
- Improve training performance

Training Data Collection Improvements:
- Move training record storage from OmegaStreamer to OmegaContext
  * Remove shared collected_records_ vector and training_mutex_ from OmegaStreamer
  * Store records per-query in OmegaContext via add_training_record()
  * Eliminate lock contention during parallel training searches
- Remove legacy GetTrainingRecords/ClearTrainingRecords from OmegaStreamer
- Simplify OmegaIndex training interface (return empty vectors)
- Update omega_streamer.cc to use context-based record collection

Code Cleanup:
- Wrap all OMEGA-dependent code with #ifdef ZVEC_ENABLE_OMEGA guards
- Update OmegaModelTrainerOptions documentation
- Add detailed logging for training record collection
- Improve error handling for missing OmegaContext
- Expose target_recall parameter for OMEGA adaptive early stopping
- Update OMEGA tests with 100k docs and recall validation
- Remove deprecated _omega_training.py
Major optimization:
- Move training data collection before Flush() to use in-memory graph
- Eliminate ~2 minute disk reload delay for 1M vectors
- Fix GT computation to use correct indexers (was using empty flushed ones)

Training improvements:
- Add ef_groundtruth parameter for faster GT computation using HNSW
- Support parallel training searches with per-query ground truth
- Add window_size parameter for early stopping control
- Expose all OMEGA params through Python API (OmegaIndexParam, OmegaQueryParam)

Code quality:
- Add TIMING logs for performance debugging
- Refactor TrainingDataCollector to use passed indexers instead of segment's
- Clean up training flow in merge_vector_indexer()
…query-side search path

OMEGA integration updates:
- wire the updated omega training and search behavior into zvec index build, load and query execution paths
- expose and propagate OMEGA training/query parameters through the Python API, index params and engine helper conversions
- update omega builder, searcher, streamer and context handling to match the reference behavior more closely

Training and validation updates:
- update training data collection and model training integration for the reference-aligned OMEGA workflow

Performance and debugging updates:
- add an OMEGA prediction microbenchmark for query-side inference analysis
- improve storage/index plumbing needed by the OMEGA workflow
- add query-side diagnostics to investigate early-stop calibration and repeated prediction overhead
@driPyf driPyf requested a review from iaojnh as a code owner April 2, 2026 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Integrate OMEGA adaptive early termination into zvec

7 participants