feat: integrate OMEGA adaptive early termination into zvec by driPyf · Pull Request #301 · alibaba/zvec

driPyf · 2026-04-01T12:54:09Z

Closes #300

Greptile Summary

This PR introduces a new OMEGA index type to zvec, integrating adaptive early termination on top of HNSW. Instead of adding a separate search engine, the implementation keeps HNSW as the underlying graph traversal path and adds a learned query-time stopping policy that decides whether the current search state is already sufficient for a target recall. The implementation spans a new omega algorithm directory, OMEGA-aware searcher/streamer/index support, DB-layer training orchestration, Python bindings, benchmark integration, and Python workflow tests.

Key changes:

OMEGA is now exposed as a first-class index type with dedicated index params and query params in both the core interfaces and Python bindings.
OmegaSearcher and OmegaStreamer integrate OMEGA with the existing HNSW search loop through hook callbacks, so online search remains HNSW-based while query-time stop decisions come from OMEGALib.
The offline pipeline now includes held-out query generation, ground-truth collection, search-trace collection, model training, and persistence of omega_model/ artifacts such as model.txt, threshold_table.txt, interval_table.txt, gt_collected_table.txt, and gt_cmps_all_table.txt.
Runtime behavior still falls back to plain HNSW when no model is available or the dataset is below min_vector_threshold.
Python workflow tests now cover insert → optimize/train → online OMEGA query, as well as the fallback-to-HNSW path when OMEGA is inactive.

Issues found:

The current integration makes OMEGALib a required build dependency, so environments without working OpenMP or LightGBM toolchains will fail at configure time rather than transparently building without OMEGA.
The Python workflow tests validate that the OMEGA query path runs and that fallback results match HNSW, but they do not directly assert that adaptive early termination actually triggered inside OMEGALib.
The current query-time validation is intentionally behavioral rather than log- or metric-driven, so regressions in “OMEGA active but not stopping early” may not be caught by Python tests alone.

Confidence Score: 4/5

Likely safe to merge after normal build validation. The overall design is coherent: HNSW remains the underlying search engine, OMEGA is layered as query-time control, fallback behavior is preserved, and the offline training loop is integrated into the DB lifecycle rather than living in an external script.
The remaining risks are mainly around build environment requirements and the fact that Python tests validate end-to-end behavior more strongly than internal early-stop activation.
Pay close attention to src/core/algorithm/omega/omega_searcher.cc, src/core/algorithm/omega/omega_streamer.cc, and src/db/training/omega_training_coordinator.cc — these files define the runtime activation/fallback logic, the HNSW hook integration, and the offline training lifecycle that make the feature work end to end.

Important Files Changed

Filename	Overview
`src/core/algorithm/omega/omega_searcher.cc`	OMEGA-aware searcher that loads persisted model artifacts, decides whether OMEGA should be active for a query, and falls back to plain HNSW when the model is unavailable or the dataset is below threshold.
`src/core/algorithm/omega/omega_streamer.cc`	Streaming/search-time integration layer that wires OMEGA into the HNSW loop through hooks, supports both inference and training modes, and persists the searcher-side OMEGA params alongside the index.
`src/core/algorithm/omega/omega_context.h`	Extends `HnswContext` with OMEGA-specific per-query state such as `target_recall`, `training_query_id`, and collected training outputs.
`src/core/interface/indexes/omega_index.cc`	Framework-facing OMEGA index wrapper that routes build/load/search lifecycle through the correct OMEGA-aware streamer/searcher path.
`src/db/training/omega_training_coordinator.cc`	Coordinates held-out query reuse, training-data collection, retraining flow, and model output management under `omega_model/`.
`src/db/training/omega_model_trainer.cc`	Bridges zvec-collected training records into OMEGALib’s C++ training API and writes the full set of OMEGA model artifacts.
`src/db/index/segment/segment.cc`	Adds OMEGA auto-training and retraining hooks into the segment optimize/build lifecycle.
`src/db/index/column/vector_column/engine_helper.hpp`	DB-to-core translation layer for OMEGA index params and runtime searcher config.
`src/binding/python/model/param/python_param.cc`	Python bindings for `OmegaIndexParam`, `OmegaQueryParam`, and optimize-time retraining options.
`python/tests/test_collection.py`	Python workflow tests covering OMEGA training, OMEGA query execution, and fallback-to-HNSW behavior.
`thirdparty/CMakeLists.txt`	Build integration for OMEGALib and its dependency wiring into zvec.
`CMakeLists.txt`	Top-level build wiring for OMEGA-related core libraries and tools.

Sequence Diagram

sequenceDiagram
    participant User
    participant Collection
    participant Segment
    participant OmegaTrainingCoordinator
    participant OMEGALib
    participant OmegaSearcher
    participant HNSW

    Note over User,Collection: Offline build / optimize
    User->>Collection: insert(docs)
    Collection->>Segment: build / persist vector index
    User->>Collection: optimize()
    Collection->>Segment: collect held-out queries and traces
    Segment->>OmegaTrainingCoordinator: training records + gt data
    OmegaTrainingCoordinator->>OMEGALib: train model
    OMEGALib-->>OmegaTrainingCoordinator: model + auxiliary tables
    OmegaTrainingCoordinator-->>Segment: persist omega_model/

    Note over User,Collection: Online query
    User->>Collection: query(VectorQuery, OmegaQueryParam)
    Collection->>OmegaSearcher: search(...)
    OmegaSearcher->>OmegaSearcher: should_use_omega()

    alt model available and threshold satisfied
        OmegaSearcher->>HNSW: search with OMEGA hooks
        HNSW->>OMEGALib: update SearchContext during traversal
        OMEGALib-->>HNSW: stop / continue
        HNSW-->>OmegaSearcher: results
    else fallback
        OmegaSearcher->>HNSW: plain HNSW search
        HNSW-->>OmegaSearcher: results
    end

    OmegaSearcher-->>Collection: results
    Collection-->>User: query results

Integrate OMEGALib repository as a submodule to provide OMEGA adaptive search functionality. The submodule includes GBDT inference, feature extraction, model management, and search context components.

Add OMEGA index components that wrap HNSW with adaptive search capability: - OmegaSearcher: Wraps HnswSearcher with OMEGA model integration and automatic fallback - OmegaBuilder: Wraps HnswBuilder for index construction - OmegaStreamer: Wraps HnswStreamer for streaming operations - Factory registration for all components - CMakeLists.txt integration with omega library dependency OMEGA mode activates when vector count >= threshold and model is loaded, otherwise falls back to standard HNSW transparently.

- Add OMEGA index type to zvec type system - Implement OmegaIndexParams class for index configuration - Add Python bindings for OmegaIndexParam - Integrate OMEGA searcher with HNSW fallback mechanism - Add comprehensive Python unit tests for OMEGA functionality - Update schema validation to support OMEGA index type Tests verify that OMEGA index correctly falls back to HNSW behavior when OMEGA-specific features are not enabled, ensuring full compatibility.

…y recall - Implement OmegaIndex with ITrainingCapable interface for training support - Create OmegaStreamer with training mode for feature collection during search - Add OmegaSearcher adaptive search with OMEGA early stopping prediction - Implement training data export and collection APIs - Add OmegaQueryParams and OmegaContext for per-query target_recall specification - Create omega_params.h and omega_context.h for parameter management - Update engine_helper to convert and extract OMEGA query parameters - Integrate training mode with Collection API (enable/disable/export methods) - Add training data collector, query generator, and model trainer components - Add Python training API with OmegaTrainer class - Add debug logging for OMEGA index creation and merge operations - Adjust HnswSearcher member access modifiers for OMEGA inheritance - Remove test_omega_fallback.py (replaced by test_collection.py tests)

- Fix memory explosion in training data collection by clearing records after copy - Add omega_model directory creation before training to fix CSV write failure - Remove all debug fprintf/fflush statements and empty code blocks

… params - Parallelize ground truth computation and training searches with std::thread - Add training_query_id support for thread-safe parallel training - Add num_training_queries param to OmegaIndexParams (default: 1000) - Use ef_construction as training search ef instead of hardcoded 1000

Build System Changes: - Add ZVEC_ENABLE_OMEGA option for conditional OMEGA compilation (default: OFF) - Add -DZVEC_ENABLE_OMEGA definition when enabled - Update thirdparty/CMakeLists.txt to conditionally build omega library - Update src/core/CMakeLists.txt to conditionally compile omega sources - Update omega submodule to version with LightGBM C API support Training System Refactor: - Replace Python subprocess training with native LightGBM C API * Remove CSV export and Python _omega_training.py invocation * Add direct omega::OmegaTrainer integration via C++ API * Remove ExportToCSV, ExportGtCmpsToCSV, InvokePythonTrainer methods - Add configurable training parameters to OmegaModelTrainerOptions: * num_iterations (default: 100) * num_leaves (default: 31) * learning_rate (default: 0.1) * num_threads (default: 8) - Add type conversion helpers (ConvertRecord, ConvertGtCmpsData) - Improve training performance Training Data Collection Improvements: - Move training record storage from OmegaStreamer to OmegaContext * Remove shared collected_records_ vector and training_mutex_ from OmegaStreamer * Store records per-query in OmegaContext via add_training_record() * Eliminate lock contention during parallel training searches - Remove legacy GetTrainingRecords/ClearTrainingRecords from OmegaStreamer - Simplify OmegaIndex training interface (return empty vectors) - Update omega_streamer.cc to use context-based record collection Code Cleanup: - Wrap all OMEGA-dependent code with #ifdef ZVEC_ENABLE_OMEGA guards - Update OmegaModelTrainerOptions documentation - Add detailed logging for training record collection - Improve error handling for missing OmegaContext

- Expose target_recall parameter for OMEGA adaptive early stopping - Update OMEGA tests with 100k docs and recall validation - Remove deprecated _omega_training.py

Major optimization: - Move training data collection before Flush() to use in-memory graph - Eliminate ~2 minute disk reload delay for 1M vectors - Fix GT computation to use correct indexers (was using empty flushed ones) Training improvements: - Add ef_groundtruth parameter for faster GT computation using HNSW - Support parallel training searches with per-query ground truth - Add window_size parameter for early stopping control - Expose all OMEGA params through Python API (OmegaIndexParam, OmegaQueryParam) Code quality: - Add TIMING logs for performance debugging - Refactor TrainingDataCollector to use passed indexers instead of segment's - Clean up training flow in merge_vector_indexer()

…query-side search path OMEGA integration updates: - wire the updated omega training and search behavior into zvec index build, load and query execution paths - expose and propagate OMEGA training/query parameters through the Python API, index params and engine helper conversions - update omega builder, searcher, streamer and context handling to match the reference behavior more closely Training and validation updates: - update training data collection and model training integration for the reference-aligned OMEGA workflow Performance and debugging updates: - add an OMEGA prediction microbenchmark for query-side inference analysis - improve storage/index plumbing needed by the OMEGA workflow - add query-side diagnostics to investigate early-stop calibration and repeated prediction overhead

…g hooks, and add query-side profiling

…side index

…ture

…hmark summaries

driPyf and others added 30 commits January 29, 2026 02:27

Add OMEGALib as git submodule in thirdparty/omega

ebd71b0

Integrate OMEGALib repository as a submodule to provide OMEGA adaptive search functionality. The submodule includes GBDT inference, feature extraction, model management, and search context components.

Change omega submodule URL from SSH to HTTPS

d0e6be3

Merge branch 'alibaba:main' into main

ccbb36b

chore: update OMEGA submodule to latest commit

65ee973

chore: update OMEGA submodule to latest commit

4a19996

feat(omega): add OmegaQueryParam to Python API

4f982d0

- Expose target_recall parameter for OMEGA adaptive early stopping - Update OMEGA tests with 100k docs and recall validation - Remove deprecated _omega_training.py

perf(omega): batch OMEGA distance evaluation, clean temporary trainin…

f770de3

…g hooks, and add query-side profiling

feat(omega): add retrain-only model refresh with cached held-out queries

347bcf4

chore(thirdparty): nest OMEGALib under thirdparty/omega

bb8c5a2

chore(thirdparty): add omega wrapper cmake layer

55875cd

feat(scripts): align benchmark presets and export profiling summaries

5a195f0

chore(scripts): update OMEGA training defaults for benchmark runs

53e0b55

fix(scripts): invoke VectorDBBench CLI instead of Streamlit entrypoint

420b4c4

fix(scripts): avoid piping main benchmark output during profiling runs

27fda47

chore(scripts): rename online profiling fields and persist summary be…

543d16d

…side index

fix(scripts): persist per-index profiling summaries and stabilize cap…

9d750ff

…ture

fix(scripts): enable zvec info logs during profiling passes

308d8ba

fix(omega): align search-core timing and reduce traversal overhead

e548017

perf(omega): add compile-time control timing profiling and align benc…

5300d35

…hmark summaries

refactor(omega): move control timing profiling to runtime flags

4871eae

refactor(omega): align omega search path with hnsw core loop

5df2d5a

feat(omega): add hooks-only perf switch and ab script

4078546

driPyf added 4 commits April 2, 2026 15:33

Update OMEGALib submodule

07980c7

Fix recursive submodule init in CI

90ffba5

Update OMEGALib for Windows UTF-8 fix

0b7309c

Update OMEGALib for Android network build fix

0c6f8c4

driPyf requested a review from iaojnh as a code owner April 2, 2026 08:15

driPyf added 25 commits April 2, 2026 16:21

Fix omega unit test link dependencies

43b3afd

Update OMEGALib for LightGBM stub linkage

be84830

Fix HNSW core utility linkage on macOS

6607743

Tighten core target link dependencies

0c728dd

Update OMEGALib for stub compile flags

77f56b8

Update OMEGALib for USE_SOCKET stub fix

527dc70

Update OMEGALib for Linkers stub emission

013795d

Revert OMEGALib LightGBM stub workaround

9258982

Fix MSVC UTF-8 and core interface link deps

fceebba

Fix omega build and optimize regressions

5f36f3c

Format omega integration fixes

7dfd196

Fix example linking and rename steady timer

63bc93b

Fix macOS examples and C++17 training init

c045298

Keep omega binding registration linked

02246e3

Gate omega build paths and tests by platform/config

346d8b6

Fix Android examples OpenMP and pytest 9 fixture skips

aec406c

chore: sync pytest omega test formatting

c173e8f

fix: link omega examples on windows

2464112

fix: stabilize omega example linking

04dedbd

fix: simplify example library paths

5fafa0e

ci: add windows examples workflow

c61bf58

ci: simplify windows examples triggers

1bcae21

ci: remove windows examples workflow

0c0d767

Add OMEGA integration test and examples

bd4697c

Fix lint formatting for OMEGA changes

88a1aa7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate OMEGA adaptive early termination into zvec#301

feat: integrate OMEGA adaptive early termination into zvec#301
driPyf wants to merge 128 commits intoalibaba:mainfrom
driPyf:feat/omega-integration

driPyf commented Apr 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

driPyf commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Key changes:

Issues found:

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

driPyf commented Apr 1, 2026 •

edited

Loading