Skip to content

Adapt [vector_async_index] add collection.flush_index()#209

Merged
hnwyllmm merged 9 commits into
oceanbase:developfrom
xiebaoma:develop
May 12, 2026
Merged

Adapt [vector_async_index] add collection.flush_index()#209
hnwyllmm merged 9 commits into
oceanbase:developfrom
xiebaoma:develop

Conversation

@xiebaoma

@xiebaoma xiebaoma commented May 8, 2026

Copy link
Copy Markdown
Contributor

Summary

to adapt vector_async_index "call dbms_index_manager.refresh()" , add collection.refresh_index()

Solution Description

Summary by CodeRabbit

  • New Features

    • Manual collection index refresh exposed when supported by the connected database.
  • Behavior

    • Hybrid search results may include embeddings by default; embedding entries can be None and dimensions validated only when present.
  • Tests

    • Integration tests updated to call refresh when available and to relax expectations around returned result fields.

Review Change Stack

@CLAassistant

CLAassistant commented May 8, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@coderabbitai

coderabbitai Bot commented May 8, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds BaseClient feature detection for a refresh capability, hides the Collection refresh attribute when unsupported, implements Collection.refresh_index() to call the DB index manager refresh procedure, and updates integration tests to call refresh before queries.

Changes

Index Refresh Operation

Layer / File(s) Summary
Client capability detection
src/pyseekdb/client/client_base.py
Adds BaseClient._refresh_enabled() that detects seekdb type/version and returns True for versions >= 1.3.0.0.
Collection attribute gating
src/pyseekdb/client/collection.py
Overrides Collection.__getattribute__ and __dir__ and adds _is_refresh_available() to hide or raise for refresh_index when unavailable.
Collection refresh implementation
src/pyseekdb/client/collection.py
Adds Collection.refresh_index() which calls self._client._execute("CALL dbms_index_manager.refresh();") and raises AttributeError if refresh is unavailable.
Hybrid search test adjustments
tests/integration_tests/test_collection_hybrid_search_source_inference.py
Relaxes result-shape assertions to subset checks, includes embeddings by default, and allows None embeddings while validating dimensions only when present.
Integration tests invoke refresh
tests/integration_tests/test_collection_query.py, tests/integration_tests/test_collection_v1_compatibility.py, tests/integration_tests/test_offical_case.py, tests/integration_tests/test_sparse_vector_index.py
Insert conditional collection.refresh_index() calls before queries to ensure index refresh is applied in tests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I twitch my whiskers at the break of day,
I tap the key, the index wakes to play,
Vectors hum and line up row by row,
Fresh paths of search begin to softly grow,
A tiny hop — the queries brightly flow.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ⚠️ Warning The PR title mentions 'flush_index()' but the actual implementation adds 'refresh_index()' instead; the changes primarily focus on conditionally hiding/exposing refresh_index based on backend capability. Update the PR title to accurately reflect the main change: 'Add collection.refresh_index() with capability detection' or similar wording that mentions refresh_index, not flush_index.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/pyseekdb/client/collection.py`:
- Around line 596-604: Collection.flush currently calls a non-existent
procedure; update the implementation so it invokes OceanBase's
DBMS_VECTOR.REFRESH_INDEX with the required parameters instead of "CALL
dbms_index_manager.refresh();": build the CALL statement using the collection's
index name and table name (and optional vector column if available) and execute
it via self._client._execute (e.g., "CALL DBMS_VECTOR.REFRESH_INDEX(idx_name,
table_name, idx_vector_col, refresh_threshold, refresh_type);"), or if the
intent was to defer parameters, raise/validate and document required args in
Collection.flush and accept them as parameters to the method before calling
self._client._execute.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 03433250-9858-4b86-874e-547d26a2cc43

📥 Commits

Reviewing files that changed from the base of the PR and between 0a1d921 and c561754.

📒 Files selected for processing (1)
  • src/pyseekdb/client/collection.py

Comment thread src/pyseekdb/client/collection.py Outdated
Comment on lines +596 to +604
def flush(self) -> None:
"""
Flush async vector index build tasks.

This executes ``CALL dbms_index_manager.refresh();`` and returns only
after the database completes the refresh procedure.
"""
self._client._execute("CALL dbms_index_manager.refresh();")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

OceanBase dbms_index_manager.refresh procedure signature and scope: does it refresh all async vector indexes globally, or can it be scoped to a specific table/index? Please provide official documentation links and examples.

💡 Result:

No evidence found for a dbms_index_manager.refresh procedure in OceanBase documentation across official OceanBase Database (en.oceanbase.com) and seekdb (oceanbase.ai) sites [1][2][3][4][5][6]. Searches for the exact package and procedure returned no matching results [1-30]. OceanBase (seekdb) uses DBMS_VECTOR.REFRESH_INDEX for incremental refresh of async vector indexes. It is scoped to a specific index, table, and vector column, not global [1]. Procedure signature [1]: PROCEDURE refresh_index( IN idx_name VARCHAR(65535), IN table_name VARCHAR(65535), IN idx_vector_col VARCHAR(65535) DEFAULT NULL, IN refresh_threshold INT DEFAULT 10000, IN refresh_type VARCHAR(65535) DEFAULT NULL); Example [1]: CALL DBMS_VECTOR.REFRESH_INDEX('idx1', 'vector_index_test', 'c2', 1, 'FAST'); Official docs: https://www.oceanbase.ai/docs/zh-CN/refresh-index [1], https://www.oceanbase.ai/docs/vector-index-monitor-and-maintain [2]. dbms_index_manager may be a misremembered or non-existent package; standard index management uses DBMS_STATS or ALTER INDEX [7]. Vector indexes are async by design with automatic background refresh, manual via DBMS_VECTOR [2].

Citations:


🏁 Script executed:

# Search for any usage of dbms_index_manager in the codebase
rg "dbms_index_manager" --type py --type sql

# Search for any DBMS_VECTOR usages that might indicate the correct pattern
rg "DBMS_VECTOR" --type py --type sql

# Check for test files related to flush or index refresh
fd "test.*collection" --type f | head -20

# Check the client._execute calls to understand the pattern
rg "_execute\(" src/pyseekdb/client/collection.py -A 2 -B 2

Repository: oceanbase/pyseekdb

Length of output: 1030


🏁 Script executed:

# Search for any tests of flush method
rg "\.flush\(" tests/ --type py -B 3 -A 3

# Check git log for recent changes to this file
cd src/pyseekdb/client && git log --oneline collection.py | head -20

# Check if there are any comments or docstrings that mention refresh or flush
rg "refresh|flush" src/pyseekdb/client/collection.py --context 5

Repository: oceanbase/pyseekdb

Length of output: 185


🏁 Script executed:

# Search for flush tests anywhere in the repo
rg "flush" tests/ --type py -B 2 -A 2

# Get commit details
git show c561754 --stat

# Also check if there's any OceanBase-specific documentation in the repo
find . -type f -name "*.md" -o -name "*.rst" | xargs rg "refresh|flush|dbms_index_manager" 2>/dev/null | head -20

Repository: oceanbase/pyseekdb

Length of output: 8781


🏁 Script executed:

# Search more broadly in tests for flush
rg "flush" . --type py

# Check the collection.py file around the flush method for context
sed -n '590,620p' src/pyseekdb/client/collection.py

# Also check if there's any comment or documentation about refresh behavior
rg "refresh" . --type py --type md -B 2 -A 2 | head -50

Repository: oceanbase/pyseekdb

Length of output: 1683


Collection.flush() calls a non-existent database procedure and will fail at runtime

The procedure dbms_index_manager.refresh() does not exist in OceanBase. The correct procedure for refreshing async vector indexes is DBMS_VECTOR.REFRESH_INDEX(), which requires specific parameters: index name, table name, and optional vector column name.

Per OceanBase documentation (https://www.oceanbase.ai/docs/zh-CN/refresh-index), the signature is:

CALL DBMS_VECTOR.REFRESH_INDEX(idx_name, table_name, idx_vector_col DEFAULT NULL, refresh_threshold INT DEFAULT 10000, refresh_type VARCHAR DEFAULT NULL);

Update the implementation to use the correct procedure with appropriate parameters, or clarify the intent if this is meant to call a custom stored procedure.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/pyseekdb/client/collection.py` around lines 596 - 604, Collection.flush
currently calls a non-existent procedure; update the implementation so it
invokes OceanBase's DBMS_VECTOR.REFRESH_INDEX with the required parameters
instead of "CALL dbms_index_manager.refresh();": build the CALL statement using
the collection's index name and table name (and optional vector column if
available) and execute it via self._client._execute (e.g., "CALL
DBMS_VECTOR.REFRESH_INDEX(idx_name, table_name, idx_vector_col,
refresh_threshold, refresh_type);"), or if the intent was to defer parameters,
raise/validate and document required args in Collection.flush and accept them as
parameters to the method before calling self._client._execute.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/pyseekdb/client/collection.py (1)

620-634: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

refresh_index() is bound to a likely invalid refresh procedure/signature

Line 633 executes CALL dbms_index_manager.refresh(); with no index/table context. This is a high-risk runtime failure if the backend expects vector-index-scoped refresh parameters.

What is the official manual refresh procedure/signature for async vector indexes in OceanBase/SeekDB (e.g., package name, required arguments, and examples)? Is `CALL dbms_index_manager.refresh();` valid?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/pyseekdb/client/collection.py` around lines 620 - 634, The call in
refresh_index() uses a bare "CALL dbms_index_manager.refresh();" which may be
invalid if the backend requires an index or table context; update refresh_index
(and related helper _is_refresh_available) to construct and execute the correct
procedure signature per the backend docs: either call the refresh with the
index/table identifier (e.g., "CALL
dbms_index_manager.refresh('<index_name>');") or iterate over known async
indexes and call refresh for each, and validate the response; if the exact
signature is unknown, replace the direct _client._execute call with a helper
that reads the documented procedure name/parameters from configuration or raises
a clear NotImplementedError indicating the required index parameter so callers
supply it. Ensure you reference refresh_index, _is_refresh_available, and
_client._execute when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/pyseekdb/client/collection.py`:
- Around line 118-132: The attribute gating in Collection.__getattribute__ and
Collection.__dir__ checks for "refresh" but the real public API is
"refresh_index", causing wrong introspection and AttributeError messages; update
both checks to look for "refresh_index" (and the error message raised in
__getattribute__ to mention 'refresh_index') and keep using the existing
_is_refresh_available() helper to decide removal/raising so introspection,
exception text, and runtime behavior are consistent with the actual method name.

In `@tests/integration_tests/test_collection_hybrid_search_source_inference.py`:
- Around line 123-125: The assertion assumes embeddings_only["embeddings"][0] is
iterable but a backend may return row-level None; update the checks in
test_collection_hybrid_search_source_inference (the assertions that iterate
embeddings_only["embeddings"][0] around the current lines) to first guard
against None per row (e.g., iterate over embeddings_only["embeddings"] and for
each row check if the row is None then accept it, otherwise assert it is a list
of length dimension), and apply the same guard/approach to the other similar
block around lines 131-134 so TypeError is avoided when a row is None.

In `@tests/integration_tests/test_collection_query.py`:
- Line 113: The unconditional call to collection.refresh_index() should be
guarded because refresh support varies by DB type/version; modify the test to
check the collection or client capability before invoking refresh_index (e.g.,
use a boolean capability check like collection.supports_refresh() or
client.supports_feature("refresh") or a runtime gate such as hasattr(collection,
"refresh_index") and a version check) and only call collection.refresh_index()
when the capability is present so the parameterized test won't fail for
unsupported modes.

---

Duplicate comments:
In `@src/pyseekdb/client/collection.py`:
- Around line 620-634: The call in refresh_index() uses a bare "CALL
dbms_index_manager.refresh();" which may be invalid if the backend requires an
index or table context; update refresh_index (and related helper
_is_refresh_available) to construct and execute the correct procedure signature
per the backend docs: either call the refresh with the index/table identifier
(e.g., "CALL dbms_index_manager.refresh('<index_name>');") or iterate over known
async indexes and call refresh for each, and validate the response; if the exact
signature is unknown, replace the direct _client._execute call with a helper
that reads the documented procedure name/parameters from configuration or raises
a clear NotImplementedError indicating the required index parameter so callers
supply it. Ensure you reference refresh_index, _is_refresh_available, and
_client._execute when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 054e3cc6-e23b-4b19-8dcb-ba188f892197

📥 Commits

Reviewing files that changed from the base of the PR and between e9f62c9 and 89ee609.

📒 Files selected for processing (7)
  • src/pyseekdb/client/client_base.py
  • src/pyseekdb/client/collection.py
  • tests/integration_tests/test_collection_hybrid_search_source_inference.py
  • tests/integration_tests/test_collection_query.py
  • tests/integration_tests/test_collection_v1_compatibility.py
  • tests/integration_tests/test_offical_case.py
  • tests/integration_tests/test_sparse_vector_index.py

Comment thread src/pyseekdb/client/collection.py Outdated
Comment thread tests/integration_tests/test_collection_hybrid_search_source_inference.py Outdated
Comment thread tests/integration_tests/test_collection_query.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/pyseekdb/client/collection.py`:
- Around line 133-140: The capability check in _is_refresh_available currently
returns True when the _refresh_enabled attribute is missing, exposing
refresh_index on clients that don't support it; change the behavior so that if
getattr(self._client, "_refresh_enabled", None) yields None the method returns
False (i.e., unknown => unavailable), and preserve the existing callable
handling (call and bool cast inside try/except) to return False on exceptions;
update _is_refresh_available to only return True when a callable
_refresh_enabled exists and evaluates truthy.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7cdbbcea-b978-4a44-9ce5-860802d622bf

📥 Commits

Reviewing files that changed from the base of the PR and between 89ee609 and 45279ca.

📒 Files selected for processing (2)
  • src/pyseekdb/client/collection.py
  • tests/integration_tests/test_collection_hybrid_search_source_inference.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration_tests/test_collection_hybrid_search_source_inference.py

Comment thread src/pyseekdb/client/collection.py Outdated
Comment thread src/pyseekdb/client/collection.py Outdated
def __repr__(self) -> str:
return f"Collection(name='{self._name}', dimension={self._dimension}, client={self._client.mode})"

def __getattribute__(self, name: str):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

拐的弯太大了吧。实在不行你把refresh_index函数代理给client来实现好了,比如:

class Collection:
    def refresh_index(self):
        self._client.refresh_index()

class Client:
    def _refresh_index_enabled(self) -> bool:
        return version >= Version("1.3.0")
    def refresh_index():
        if not _refresh_index_enabled():
            return
        try:
             self._execute('call dbms_index_manager.refresh()')
        except:
             # do something.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/pyseekdb/client/collection.py (1)

621-634: ⚠️ Potential issue | 🔴 Critical

refresh_index() still calls a non-existent database procedure.

This issue was previously flagged: Line 634 calls dbms_index_manager.refresh(), which does not exist in OceanBase. According to the earlier review with web search verification, the correct procedure is DBMS_VECTOR.REFRESH_INDEX(idx_name, table_name, idx_vector_col, refresh_threshold, refresh_type).

The method will fail at runtime when invoked on databases that support refresh.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/pyseekdb/client/collection.py` around lines 621 - 634, The refresh_index
implementation currently invokes a non-existent procedure; update refresh_index
(and keep the availability guard _is_refresh_available) to call the correct
OceanBase procedure via self._client._execute using the DBMS_VECTOR signature:
"CALL DBMS_VECTOR.REFRESH_INDEX(idx_name, table_name, idx_vector_col,
refresh_threshold, refresh_type)". Use the collection's actual
index/table/column identifiers (fetch from this collection's metadata or
attributes) to populate idx_name, table_name and idx_vector_col, and choose
sensible defaults or parameters for refresh_threshold and refresh_type (or
expose them as method args) before executing the CALL.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/pyseekdb/client/collection.py`:
- Around line 621-634: The refresh_index implementation currently invokes a
non-existent procedure; update refresh_index (and keep the availability guard
_is_refresh_available) to call the correct OceanBase procedure via
self._client._execute using the DBMS_VECTOR signature: "CALL
DBMS_VECTOR.REFRESH_INDEX(idx_name, table_name, idx_vector_col,
refresh_threshold, refresh_type)". Use the collection's actual
index/table/column identifiers (fetch from this collection's metadata or
attributes) to populate idx_name, table_name and idx_vector_col, and choose
sensible defaults or parameters for refresh_threshold and refresh_type (or
expose them as method args) before executing the CALL.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2984fea4-3fc2-4b4d-8779-48851c6801b4

📥 Commits

Reviewing files that changed from the base of the PR and between 45279ca and 49fad21.

📒 Files selected for processing (5)
  • src/pyseekdb/client/collection.py
  • tests/integration_tests/test_collection_query.py
  • tests/integration_tests/test_collection_v1_compatibility.py
  • tests/integration_tests/test_offical_case.py
  • tests/integration_tests/test_sparse_vector_index.py
✅ Files skipped from review due to trivial changes (1)
  • tests/integration_tests/test_sparse_vector_index.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/integration_tests/test_collection_v1_compatibility.py
  • tests/integration_tests/test_offical_case.py

@hnwyllmm hnwyllmm changed the title Adapt [vector_async_index] add collection.flush() Adapt [vector_async_index] add collection.flush_index() May 12, 2026
@hnwyllmm hnwyllmm merged commit 50cdd99 into oceanbase:develop May 12, 2026
9 checks passed
JingYuChe pushed a commit to JingYuChe/pyseekdb that referenced this pull request Jun 22, 2026
## Summary
<!-- 
Please clearly and concisely describe the purpose of this pull request.
If this pull request resolves an issue, please link it via "close #xxx"
or "fix #xxx".
-->
to adapt vector_async_index "call dbms_index_manager.refresh()" , add
collection.refresh_index()


## Solution Description
<!-- Please clearly and concisely describe your solution. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Manual collection index refresh exposed when supported by the
connected database.

* **Behavior**
* Hybrid search results may include embeddings by default; embedding
entries can be None and dimensions validated only when present.

* **Tests**
* Integration tests updated to call refresh when available and to relax
expectations around returned result fields.

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/oceanbase/pyseekdb/pull/209)
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants