Skip to content

Add namespace-scoped collections with IVF vector indexing for agent database#240

Merged
hnwyllmm merged 95 commits into
oceanbase:developfrom
JingYuChe:develop
Jun 30, 2026
Merged

Add namespace-scoped collections with IVF vector indexing for agent database#240
hnwyllmm merged 95 commits into
oceanbase:developfrom
JingYuChe:develop

Conversation

@JingYuChe

@JingYuChe JingYuChe commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Solution Description

Summary by CodeRabbit

  • New Features
    • Added namespace-enabled collections with namespace-scoped CRUD, querying, and prewarm, including stricter use_namespace routing and validation.
    • Introduced IVF dense vector indexing via IVFConfiguration (IVFIndexType/IVFIndexLib) and added partition_count + has_vector_index support.
    • Expanded schema/index configuration to accept IVF in addition to HNSW.
  • Bug Fixes
    • Improved remote connection robustness and added more user-friendly namespace error messages.
  • Documentation
    • Updated query() include to support distances and extended where_document docs with $regex (and updated logical operator guidance).
  • Tests
    • Significantly expanded namespace/IVF/hybrid-search and lifecycle concurrency coverage, including prewarm and optional-index scenarios.

JingYuChe and others added 30 commits April 28, 2026 21:11
namespace调用oceanbase的内核DBMS_LOGIC_TABLE包,内核自行管理元数据表
…must

OceanBase rejects a bool with only must_not when it is nested under must.
Combine $not_contains with metadata filters via outer filter + must_not instead.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
tests/unit_tests/test_document_query_builder.py (2)

50-56: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Also assert that $not_contains survives wrapping.

document_expr_as_knn_filter() is supposed to preserve the original must_not clause and add the permissive exists filter. This test only checks the added filter, so an implementation that drops must_not would still pass.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit_tests/test_document_query_builder.py` around lines 50 - 56, The
test for document KNN wrapping only verifies the added exists filter and can
miss a regression where the original $not_contains / must_not clause is dropped.
Update test_pure_must_not_adds_exists_filter in document_expr_as_knn_filter /
build_document_hybrid_expression to also assert that the wrapped bool query
still contains the original must_not from $not_contains while keeping the
permissive exists filter.

29-40: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Strengthen the nested $and/$or assertion.

This test only proves that some bool.must wrapper exists. It would still pass if the nested $or branch or the "c" term were dropped during translation, which is the main behavior this case is meant to protect. Please assert the concrete nested structure here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit_tests/test_document_query_builder.py` around lines 29 - 40, The
nested `$and`/`$or` test in `test_and_or_nested` is too weak because it only
checks for a generic `bool.must` wrapper and could miss dropped branches during
translation. Update the assertions to verify the concrete structure returned by
`build_document_hybrid_expression`, including the nested `$or` branch and the
`"c"` term, so the test specifically protects the full nested boolean
translation.
src/pyseekdb/client/client_base.py (1)

149-154: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Parenthesize the mixed and/or for clarity. The final clause "1061" in message and "duplicate" in message binds tighter than the surrounding or, which is the intended grouping but is easy to misread.

♻️ Suggested tweak
     if (
         "already exists" in message
         or "duplicate key name" in message
         or "code=1061" in message
-        or "1061" in message and "duplicate" in message
+        or ("1061" in message and "duplicate" in message)
     ):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/pyseekdb/client/client_base.py` around lines 149 - 154, The conditional
in client_base.py mixes or and and in a way that is correct but hard to read;
update the message-checking logic in the relevant exception handler to
explicitly parenthesize the final "1061" in message and "duplicate" in message
clause alongside the other or conditions. Keep the behavior the same, but make
the grouping obvious by clarifying the boolean expression used in that
duplicate-detection check.

Source: Linters/SAST tools

tests/integration_tests/namespace_hybrid_search_helpers.py (1)

137-139: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Use strict=True in zip for the ground-truth distance. Since l2_squared computes the expected-KNN baseline, a query/corpus dimension mismatch would silently truncate and yield a wrong expected ordering, masking a real regression instead of failing loudly.

♻️ Suggested tweak
-    return sum((x - y) ** 2 for x, y in zip(a, b))
+    return sum((x - y) ** 2 for x, y in zip(a, b, strict=True))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration_tests/namespace_hybrid_search_helpers.py` around lines 137
- 139, The l2_squared helper used for the expected KNN baseline should fail on
vector length mismatches instead of silently truncating. Update the zip call
inside l2_squared in namespace_hybrid_search_helpers.py to use strict=True so
any query/corpus dimension mismatch raises immediately, and keep the behavior
localized to this ground-truth distance helper.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/pyseekdb/client/document_query_builder.py`:
- Around line 254-275: The unused merge_into_knn_filter helper should be removed
because knn filter merging is already handled elsewhere and client_base does not
use it. Delete the merge_into_knn_filter function from document_query_builder
and remove its import from client_base, then verify _build_knn_expression and
_inject_filter_into_knn continue to cover all merge behavior.
- Around line 14-22: The query-string escaping in _query_string_contains is
using SQL-style escaping instead of search-syntax escaping, so
Lucene/Elasticsearch reserved characters can be misread. Update
_query_string_contains to use a query_string-aware escaper for the query
payload, and apply the same escaping approach anywhere the document query
builder combines terms with $and/$or so those joins preserve search syntax
correctly.

In `@src/pyseekdb/client/kernel_errors.py`:
- Line 64: The boolean expression in the kernel error check mixes or and and
without parentheses, which Ruff flags in the kernel_errors logic. Update the
condition in the relevant text-matching branch to group the and subexpression
explicitly, keeping the existing behavior while making the intent clear and
lint-safe. Use the surrounding kernel_errors.py check that references "being
dropped" and "namespace" to locate the expression.

In `@tests/integration_tests/test_namespace_query_where_document.py`:
- Around line 98-119: The document filter tests can pass vacuously when the
query returns no rows, so add a non-empty result assertion before checking for
violations. Update test_where_document_not_contains, test_where_document_and,
test_where_document_or, test_where_document_regex, and the two nested tests to
assert that docs is not empty (like the baseline test) after ns.query, then keep
the existing exclusion checks so regressions that filter everything out are
caught.

---

Nitpick comments:
In `@src/pyseekdb/client/client_base.py`:
- Around line 149-154: The conditional in client_base.py mixes or and and in a
way that is correct but hard to read; update the message-checking logic in the
relevant exception handler to explicitly parenthesize the final "1061" in
message and "duplicate" in message clause alongside the other or conditions.
Keep the behavior the same, but make the grouping obvious by clarifying the
boolean expression used in that duplicate-detection check.

In `@tests/integration_tests/namespace_hybrid_search_helpers.py`:
- Around line 137-139: The l2_squared helper used for the expected KNN baseline
should fail on vector length mismatches instead of silently truncating. Update
the zip call inside l2_squared in namespace_hybrid_search_helpers.py to use
strict=True so any query/corpus dimension mismatch raises immediately, and keep
the behavior localized to this ground-truth distance helper.

In `@tests/unit_tests/test_document_query_builder.py`:
- Around line 50-56: The test for document KNN wrapping only verifies the added
exists filter and can miss a regression where the original $not_contains /
must_not clause is dropped. Update test_pure_must_not_adds_exists_filter in
document_expr_as_knn_filter / build_document_hybrid_expression to also assert
that the wrapped bool query still contains the original must_not from
$not_contains while keeping the permissive exists filter.
- Around line 29-40: The nested `$and`/`$or` test in `test_and_or_nested` is too
weak because it only checks for a generic `bool.must` wrapper and could miss
dropped branches during translation. Update the assertions to verify the
concrete structure returned by `build_document_hybrid_expression`, including the
nested `$or` branch and the `"c"` term, so the test specifically protects the
full nested boolean translation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3d4fbc95-a5a6-43f4-aae7-2525641835cf

📥 Commits

Reviewing files that changed from the base of the PR and between c688423 and 908186a.

📒 Files selected for processing (50)
  • src/pyseekdb/client/client_base.py
  • src/pyseekdb/client/client_seekdb_server.py
  • src/pyseekdb/client/collection.py
  • src/pyseekdb/client/document_query_builder.py
  • src/pyseekdb/client/filters.py
  • src/pyseekdb/client/kernel_errors.py
  • src/pyseekdb/client/namespace.py
  • src/pyseekdb/client/schema.py
  • src/pyseekdb/client/validators.py
  • tests/integration_tests/namespace_dml_helpers.py
  • tests/integration_tests/namespace_fts_helpers.py
  • tests/integration_tests/namespace_hybrid_search_helpers.py
  • tests/integration_tests/test_get_or_create_collection_concurrency.py
  • tests/integration_tests/test_get_or_create_collection_multiprocess.py
  • tests/integration_tests/test_logic_table_monitoring_p1.py
  • tests/integration_tests/test_namespace_constraints.py
  • tests/integration_tests/test_namespace_dml.py
  • tests/integration_tests/test_namespace_dml_multi_coll.py
  • tests/integration_tests/test_namespace_dml_multi_coll_multi_ns.py
  • tests/integration_tests/test_namespace_dml_multi_ns.py
  • tests/integration_tests/test_namespace_drop_validation.py
  • tests/integration_tests/test_namespace_get_delete_filters.py
  • tests/integration_tests/test_namespace_hybrid_search_combined.py
  • tests/integration_tests/test_namespace_hybrid_search_fulltext.py
  • tests/integration_tests/test_namespace_hybrid_search_fulltext_multi_coll_multi_ns.py
  • tests/integration_tests/test_namespace_hybrid_search_multi_coll_multi_ns.py
  • tests/integration_tests/test_namespace_hybrid_search_search_index.py
  • tests/integration_tests/test_namespace_hybrid_search_triple_branch.py
  • tests/integration_tests/test_namespace_hybrid_search_triple_branch_multi_coll_multi_ns.py
  • tests/integration_tests/test_namespace_hybrid_search_vector.py
  • tests/integration_tests/test_namespace_lifecycle.py
  • tests/integration_tests/test_namespace_lifecycle_concurrency.py
  • tests/integration_tests/test_namespace_optional_indexes.py
  • tests/integration_tests/test_namespace_prewarm.py
  • tests/integration_tests/test_namespace_prewarm_lob.py
  • tests/integration_tests/test_namespace_query.py
  • tests/integration_tests/test_namespace_query_where_document.py
  • tests/integration_tests/test_namespace_reconnect.py
  • tests/integration_tests/test_namespace_session_vars.py
  • tests/integration_tests/test_namespace_upsert_concurrency.py
  • tests/unit_tests/test_build_document_query.py
  • tests/unit_tests/test_build_knn_filter.py
  • tests/unit_tests/test_build_query_expression.py
  • tests/unit_tests/test_document_query_builder.py
  • tests/unit_tests/test_get_or_create_collection_concurrency.py
  • tests/unit_tests/test_get_or_create_namespace_concurrency.py
  • tests/unit_tests/test_namespace.py
  • tests/unit_tests/test_namespace_kernel_errors.py
  • tests/unit_tests/test_namespace_lifecycle_lock.py
  • tests/unit_tests/test_namespace_upsert_reconcile.py
🚧 Files skipped from review as they are similar to previous changes (36)
  • tests/unit_tests/test_build_knn_filter.py
  • tests/integration_tests/test_namespace_hybrid_search_search_index.py
  • tests/integration_tests/test_namespace_hybrid_search_combined.py
  • tests/integration_tests/test_namespace_lifecycle_concurrency.py
  • tests/integration_tests/test_namespace_hybrid_search_triple_branch.py
  • tests/integration_tests/test_namespace_constraints.py
  • tests/unit_tests/test_namespace_lifecycle_lock.py
  • src/pyseekdb/client/client_seekdb_server.py
  • tests/integration_tests/test_namespace_dml_multi_coll_multi_ns.py
  • tests/unit_tests/test_build_document_query.py
  • tests/unit_tests/test_get_or_create_namespace_concurrency.py
  • tests/integration_tests/test_logic_table_monitoring_p1.py
  • tests/integration_tests/test_namespace_session_vars.py
  • tests/integration_tests/test_namespace_dml_multi_ns.py
  • tests/integration_tests/test_get_or_create_collection_concurrency.py
  • tests/unit_tests/test_build_query_expression.py
  • src/pyseekdb/client/schema.py
  • tests/integration_tests/test_namespace_hybrid_search_vector.py
  • tests/integration_tests/test_namespace_dml_multi_coll.py
  • tests/integration_tests/test_namespace_reconnect.py
  • tests/integration_tests/test_namespace_hybrid_search_fulltext_multi_coll_multi_ns.py
  • tests/integration_tests/test_namespace_upsert_concurrency.py
  • tests/integration_tests/test_namespace_query.py
  • tests/integration_tests/test_namespace_prewarm.py
  • src/pyseekdb/client/collection.py
  • src/pyseekdb/client/namespace.py
  • tests/integration_tests/test_namespace_get_delete_filters.py
  • tests/integration_tests/test_namespace_hybrid_search_fulltext.py
  • tests/integration_tests/test_namespace_prewarm_lob.py
  • tests/unit_tests/test_get_or_create_collection_concurrency.py
  • tests/integration_tests/test_namespace_hybrid_search_multi_coll_multi_ns.py
  • tests/integration_tests/test_namespace_hybrid_search_triple_branch_multi_coll_multi_ns.py
  • tests/integration_tests/namespace_dml_helpers.py
  • tests/integration_tests/test_namespace_drop_validation.py
  • tests/unit_tests/test_namespace.py
  • tests/integration_tests/namespace_fts_helpers.py

Comment thread src/pyseekdb/client/document_query_builder.py
Comment thread src/pyseekdb/client/document_query_builder.py Outdated
Comment thread src/pyseekdb/client/kernel_errors.py Outdated
Comment thread tests/integration_tests/test_namespace_query_where_document.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/pyseekdb/client/document_query_builder.py (1)

149-159: 🎯 Functional Correctness | 🔴 Critical

The fast path for $and/$or combining $contains clauses alters boolean semantics for multi-word terms.

The optimization at lines 149-159 (and 173-183) joins all $contains values with spaces and forces a default_operator.

  • Non-optimized path: Generates separate query_string clauses for each condition. In Lucene/Elasticsearch, a raw string like "foo bar" without an explicit operator defaults to foo OR bar. Thus {"$and": [...]} results in (foo OR bar) AND baz.
  • Optimized path: Joins to "foo bar baz" and sets default_operator: "and". Lucene parses this as foo AND bar AND baz.

This changes the semantics for any multi-word term. The optimization should only apply if every $contains value is a single atomic term (no spaces or operators), or it should be removed to fall back to the correct bool composition.

Example discrepancy

Input: {"$and": [{"$contains": "foo bar"}, {"$contains": "baz"}]}

  • Expected (current fallback): (foo OR bar) AND baz
  • Current Fast Path: foo AND bar AND baz
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/pyseekdb/client/document_query_builder.py` around lines 149 - 159, The
`$and`/$`or` fast-path in `document_query_builder` changes semantics for
multi-word `$contains` values by collapsing them into one `query_string` and
forcing `default_operator`; update the optimization in the `$and` and `$or`
branches to only trigger for single-token atomic terms, or remove it and use the
existing bool composition fallback. Keep the logic aligned with the surrounding
query-building path in `document_query_builder` so multi-word phrases still
behave the same as the non-optimized `query_string` clauses.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/pyseekdb/client/document_query_builder.py`:
- Around line 14-17: The backslash handling in _escape_query_string_term is
duplicated because _QUERY_STRING_RESERVED_RE already matches backslashes, so
remove the manual text.replace("\\", "\\\\") and rely on the regex substitution
alone. Update the logic in _escape_query_string_term so every reserved
character, including backslash, is escaped exactly once and consistently with
the other Lucene query_string characters.

---

Outside diff comments:
In `@src/pyseekdb/client/document_query_builder.py`:
- Around line 149-159: The `$and`/$`or` fast-path in `document_query_builder`
changes semantics for multi-word `$contains` values by collapsing them into one
`query_string` and forcing `default_operator`; update the optimization in the
`$and` and `$or` branches to only trigger for single-token atomic terms, or
remove it and use the existing bool composition fallback. Keep the logic aligned
with the surrounding query-building path in `document_query_builder` so
multi-word phrases still behave the same as the non-optimized `query_string`
clauses.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 78a0309a-8d7d-49fa-8794-0debfa57241b

📥 Commits

Reviewing files that changed from the base of the PR and between 908186a and 1ac0899.

📒 Files selected for processing (10)
  • docs/guide/dql.md
  • src/pyseekdb/client/client_base.py
  • src/pyseekdb/client/configuration.py
  • src/pyseekdb/client/document_query_builder.py
  • src/pyseekdb/client/kernel_errors.py
  • tests/integration_tests/namespace_hybrid_search_helpers.py
  • tests/integration_tests/test_namespace_constraints.py
  • tests/integration_tests/test_namespace_query_where_document.py
  • tests/unit_tests/test_document_query_builder.py
  • tests/unit_tests/test_namespace.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/pyseekdb/client/kernel_errors.py
  • tests/integration_tests/test_namespace_query_where_document.py
  • tests/unit_tests/test_document_query_builder.py
  • src/pyseekdb/client/configuration.py
  • tests/unit_tests/test_namespace.py

Comment thread src/pyseekdb/client/document_query_builder.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unit_tests/test_namespace_no_index_embedding_dim.py (1)

11-47: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add coverage for None embeddings.

_validate_namespace_explicit_embedding_dimensions() now has a dedicated if vec is None: continue branch, but these tests only cover full-vector match/mismatch paths. A small regression test here would lock in the intended mixed-payload behavior for both indexed and no-index namespaces.

Suggested test
 class TestNoIndexExplicitEmbeddingDimension:
+    def test_ignores_none_embeddings(self):
+        _validate_namespace_no_index_explicit_embeddings(
+            [None, [0.0] * 384],
+            expected_dimension=384,
+        )
+
     def test_accepts_matching_dimension(self):
         _validate_namespace_no_index_explicit_embeddings(
             [[0.0] * 384, [1.0] * 384],
             expected_dimension=384,
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit_tests/test_namespace_no_index_embedding_dim.py` around lines 11 -
47, Add regression coverage for None embeddings in the existing namespace
dimension tests. Extend TestNoIndexExplicitEmbeddingDimension and
TestIvfExplicitEmbeddingDimension to call
_validate_namespace_no_index_explicit_embeddings and
_validate_namespace_explicit_embedding_dimensions with a payload that mixes None
and valid vectors, and assert that None entries are skipped while the remaining
embeddings still validate correctly. Use the existing helper names and
expected_dimension/has_vector_index parameters so the new test locks in the vec
is None branch behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/unit_tests/test_namespace_no_index_embedding_dim.py`:
- Around line 11-47: Add regression coverage for None embeddings in the existing
namespace dimension tests. Extend TestNoIndexExplicitEmbeddingDimension and
TestIvfExplicitEmbeddingDimension to call
_validate_namespace_no_index_explicit_embeddings and
_validate_namespace_explicit_embedding_dimensions with a payload that mixes None
and valid vectors, and assert that None entries are skipped while the remaining
embeddings still validate correctly. Use the existing helper names and
expected_dimension/has_vector_index parameters so the new test locks in the vec
is None branch behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 941c6c59-88ef-4cbc-b37c-81fdf3d6b2e8

📥 Commits

Reviewing files that changed from the base of the PR and between 1ac0899 and b39abbd.

📒 Files selected for processing (9)
  • src/pyseekdb/client/client_base.py
  • src/pyseekdb/client/collection.py
  • src/pyseekdb/client/configuration.py
  • src/pyseekdb/client/namespace.py
  • src/pyseekdb/client/validators.py
  • tests/integration_tests/test_namespace_constraints.py
  • tests/integration_tests/test_namespace_optional_indexes.py
  • tests/unit_tests/test_namespace.py
  • tests/unit_tests/test_namespace_no_index_embedding_dim.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • tests/integration_tests/test_namespace_optional_indexes.py
  • src/pyseekdb/client/collection.py
  • src/pyseekdb/client/configuration.py
  • src/pyseekdb/client/namespace.py
  • tests/unit_tests/test_namespace.py

@hnwyllmm hnwyllmm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One API ergonomics issue: create_collection(name, use_namespace=True) currently appears to fail on the default path. When schema is omitted, create_collection() first builds the legacy default HNSW schema via _prepare_schema_parameters(), and then _create_namespace_collection() rejects it with ValueError("use_namespace=True only supports IVF index type, HNSW is not allowed").

That error is technically accurate, but confusing for users because they did not explicitly choose HNSW; the SDK chose it as the default. For a namespace-enabled collection, the default path should either construct the namespace-compatible default schema, for example IVF plus the default fulltext config, or fail earlier with an actionable message such as: use_namespace=True requires an IVF schema. The default non-namespace schema is HNSW. Please pass Schema(vector_index=VectorIndexConfig(ivf=IVFConfiguration(...))) or remove use_namespace=True.

I would prefer making the default namespace path work if this is meant to match the Agent Database design shorthand, but at minimum the current error should explain why the default path became HNSW and how users should fix it.

@hnwyllmm hnwyllmm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another compatibility concern: namespace record IDs are currently validated with the same identifier-like pattern used for names ([A-Za-z0-9_]). That is stricter than the existing collection path, where _id supports arbitrary string formats and is safely escaped/cast before being written.

For record IDs, real applications may reasonably use UUIDs, file paths, URLs, external document IDs, or other opaque business keys, for example repo/src/foo.py, https://example.com/doc/123, or 550e8400-e29b-41d4-a716-446655440000. These would work in a normal collection today but fail after moving to a namespace collection.

Could we relax namespace record-id validation to match the existing collection semantics, e.g. validate only type/non-empty/length and rely on parameterization/escaping plus JSON storage for safety? The stricter [A-Za-z0-9_] rule seems appropriate for resource names like collection/namespace/ltable names, but too restrictive for per-record business IDs.

@hnwyllmm hnwyllmm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more API-safety issue: SparseVectorIndexConfig appears to be silently ignored when creating a namespace-enabled collection.

The design doc says sparse vector support is out of scope for this phase, which is fine. But the public Schema still allows sparse_vector_index=..., and the namespace creation path does not reject it, persist it, or create a sparse index. That means a user can pass a sparse config, receive a successfully created collection, and only later discover that sparse retrieval is not available.

Could we explicitly reject schema.sparse_vector_index is not None when use_namespace=True, with a clear error such as use_namespace=True does not support SparseVectorIndexConfig yet? Silent feature drop is risky here, especially because the same schema works for normal collections.

@hnwyllmm hnwyllmm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test coverage is broad, but I think we still need a few public API compatibility tests for the namespace feature. Many current tests exercise validators, DDL fragments, include/n_results handling, or internal helpers such as _create_namespace_collection(). Those are useful, but they do not fully protect the real user-facing behavior.

Could we add tests around the public APIs, for example:

  • client.create_collection(name, use_namespace=True) with no explicit schema: either succeeds with the intended namespace default schema, or fails with a clear actionable error.
  • client.get_or_create_collection(name, use_namespace=True) when a normal collection with the same name already exists: should not silently return a non-namespace collection; it should raise a type/mode mismatch or otherwise make the behavior explicit.
  • client.create_collection(..., use_namespace=True, schema=...) with unsupported configs such as HNSW or SparseVectorIndexConfig: should fail explicitly rather than silently ignoring unsupported pieces.
  • Namespace record IDs should preserve ordinary collection compatibility, including IDs like UUIDs, paths, and URLs if we relax the validation.

These tests should go through Client/BaseClient.create_collection() and get_or_create_collection() rather than only calling internal methods, because the current regressions are mostly about public API composition and defaults.

@hnwyllmm hnwyllmm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration coverage is impressive across lifecycle, DML, query, hybrid search, prewarm, drop, concurrency, and session variables. One gap I still see is compatibility and misuse coverage at the integration level.

Most namespace integration helpers appear to create collections with an explicit IVF schema, which is the happy path. Could we also add a small set of integration tests for edge/user-migration behavior, such as:

  • Creating a namespace collection through the default public API path, or verifying the error is clear if default schema creation is intentionally unsupported.
  • Attempting to reuse a normal collection name with get_or_create_collection(..., use_namespace=True) and asserting an explicit mode/type mismatch instead of returning the wrong collection type.
  • Passing unsupported namespace schema options, such as HNSW or sparse vector index, and asserting they fail explicitly.
  • Verifying namespace record IDs accept the same practical business-key shapes as normal collections, such as UUIDs, paths, and URLs, if validation is relaxed.
  • If multi-LTable is part of the intended design scope, covering more than one logical table under the same namespace; if not, documenting/test-asserting the current single-default-LTable limitation.

These would complement the current happy-path namespace tests and protect backward compatibility for users migrating from normal collections.

@hnwyllmm hnwyllmm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add user-facing documentation and example code for the namespace feature before merging?

This PR introduces a fairly large new API surface and some important constraints, so users will need a dedicated guide rather than only tests. I suggest adding a namespace guide that covers:

  • Creating a namespace-enabled collection, including the required/recommended schema.
  • Creating/getting/listing/deleting namespaces.
  • Writing data with add/update/upsert/delete through the namespace API.
  • Querying and hybrid_search, including where, where_document, include, and distances.
  • prewarm behavior and deployment limitations.
  • Deleting namespaces/collections and the async cleanup semantics if applicable.
  • Compatibility and limitations: OceanBase/version requirement, unsupported schema options such as HNSW/sparse vector if they remain unsupported, record ID compatibility, and the current LTable behavior/scope.

A small runnable example under examples or docs would also help a lot, ideally showing the full flow from create_collection(..., use_namespace=True) to namespace CRUD/query and cleanup.

Comment thread .github/workflows/ci.yml
@JingYuChe JingYuChe requested a review from hnwyllmm June 30, 2026 03:18
@hnwyllmm hnwyllmm merged commit 44d74c3 into oceanbase:develop Jun 30, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants