Skip to content

feat: add Oracle embedders and hybrid search support#3426

Open
fileames wants to merge 11 commits into
deepset-ai:mainfrom
fileames:oracle-upstream-compat-features
Open

feat: add Oracle embedders and hybrid search support#3426
fileames wants to merge 11 commits into
deepset-ai:mainfrom
fileames:oracle-upstream-compat-features

Conversation

@fileames

Copy link
Copy Markdown

Proposed Changes:

Adds missing Oracle integration capabilities from the downstream implementation into the upstream Haystack Oracle integration:

  • Adds OracleTextEmbedder and OracleDocumentEmbedder using OracleConnectionConfig.
  • Adds OracleHybridRetriever for DBMS hybrid vector search.
  • Adds hybrid vector index support through OracleVectorizerPreference, create_hybrid_vector_index, and
    create_hybrid_vector_index_async.
  • Extends vector index configuration with opt-in IVF support while keeping HNSW as the default.
  • Adds direct async paths where Oracle async APIs are available, with fallback behavior preserved.
  • Adds explicit close() and close_async() lifecycle APIs for sync/async pools.
  • Extends filter support with contains, not contains, and hybrid filter conversion.
  • Improves DBMS_SEARCH keyword index cleanup by using DBMS_SEARCH.DROP_INDEX.

How did you test it?

  • Unit tests
  • Live Oracle feature integration tests

Notes for the reviewer

Checklist

@fileames fileames requested a review from a team as a code owner June 10, 2026 18:27
@fileames fileames requested review from julian-risch and removed request for a team June 10, 2026 18:27
@CLAassistant

CLAassistant commented Jun 10, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ davidsbatista
❌ fileames
You have signed the CLA already but the status is still pending? Let us recheck it.

@github-actions github-actions Bot added the type:documentation Improvements or additions to documentation label Jun 10, 2026
@fileames fileames force-pushed the oracle-upstream-compat-features branch from 6425d45 to 8e107ba Compare June 10, 2026 18:33
@julian-risch

Copy link
Copy Markdown
Member

@fileames Thank you for opening this PR! We need you to agree to the CLA please, before we can merge this PR. #3426 (comment)

@fede-kamel fede-kamel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for upstreaming this — the feature set is really useful and the overall structure is solid (validated identifiers, bound parameters, opt-in IVF, backwards-compatible to_dict). I tested the branch against a live Oracle Autonomous Database 26ai (23.26.2.2.0) with an in-database ONNX embedding model, since the CI container can't exercise most of the new functionality (see below). Results:

  • fmt-check and the configured test:types pass; 50 mocked unit tests pass
  • Full integration suite against the live 26ai DB: 115 passed, 2 failed, 3 skipped — both failures are reproducible bugs (inline comments)
  • Two extra live tests for hybrid retrieval with filters (not covered by the PR's tests) found a blocking bug (inline comment in filters.py)

Why CI is red (and what it means): the CI container is gvenzl/oracle-free:23-slim, and the slim flavor removes Oracle Text entirely, which DBMS_SEARCH is built on. The CI logs show every fixture teardown failing with PLS-00201: identifier 'DBMS_SEARCH.DROP_INDEX' must be declared. This was invisible before this PR because _ensure_keyword_index() swallows all DatabaseErrors on create (so the keyword index has silently never existed in CI), while the new _drop_keyword_index() only forgives ORA-942/ORA-1418/DRG-10502 — the "package does not exist" case (ORA-06550/PLS-00201) falls through and raises. Suggestion: either treat ORA-06550/PLS-00201 as benign in _is_missing_object_error, or (better) switch CI to the non-slim gvenzl/oracle-free:23 image, which would give the keyword and hybrid paths real CI coverage for the first time.

Type-checking gap (not visible in this diff): tool.hatch.envs.test.scripts.types in pyproject.toml only checks document_stores.oracle and components.retrievers.oracle, so the new embedders package is never type-checked. Running mypy -p haystack_integrations.components.embedders.oracle yields 4 errors: the oracledb.defaults.fetch_lobs assignment, and two Liskov violations because OracleDocumentEmbedder.run(documents: list[Document]) overrides OracleTextEmbedder.run(text: str). Please add the package to the types script and consider making the two embedders independent components (per Haystack convention) instead of subclassing.

Pre-existing latent bug surfaced by live testing (not introduced here, but related): on a database where DBMS_SEARCH actually works, test_write_documents_duplicate_overwrite fails with ORA-06531: Reference to uninitialized collection raised from the DBMS_SEARCH-generated maintenance trigger (DR$...$TRIG) during the overwrite. So DuplicatePolicy.OVERWRITE is currently broken on any database with a functioning keyword index — worth a separate issue, and it will surface in CI as soon as the image is switched to non-slim.

Smaller points: pydoc/config_docusaurus.yml wasn't updated with the new modules (see oracle.md comment); the VECDB_* env fallbacks in tests/conftest.py look downstream-specific, and conftest doesn't read ORACLE_WALLET_LOCATION/ORACLE_WALLET_PASSWORD while test_oracle_features_integration.py does — adding them would let the non-mocked tests run against wallet-based databases (I had to patch this locally to test against ADB).

Comment thread integrations/oracle/src/haystack_integrations/document_stores/oracle/filters.py Outdated
Comment thread integrations/oracle/oracle.md Outdated
Comment thread integrations/oracle/tests/conftest.py Outdated
@socket-security

socket-security Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedpypi/​oracledb@​2.5.1 ⏵ 4.0.19110010010070 -30

View full report

@julian-risch julian-risch requested review from davidsbatista and removed request for julian-risch June 16, 2026 07:03
@fede-kamel

Copy link
Copy Markdown
Contributor

Review — validated against a live Oracle 26ai database

Thanks @fileames! I went through the embedders + hybrid-search work and ran it against a live Oracle 26ai database with Oracle Text / DBMS_SEARCH, DBMS_VECTOR_CHAIN, DBMS_HYBRID_VECTOR and an in-database ONNX embedding model. Unit tests, ruff and mypy are all clean. Two issues reproduce against a real DB but are invisible in CI, because the gvenzl/oracle-free image ships without Oracle Text — so the keyword index (and its trigger) is never created in CI.

🔴 1. Hybrid retrieval + any metadata filter → ORA-20000 / ORA-00904

to_hybrid_filter() turns a Haystack filter such as {"field": "meta.lang", "operator": "==", "value": "en"} into filter_by={"path": "meta.lang", ...}, but DBMS_HYBRID_VECTOR.SEARCH's filter_by.path resolves to a column on the base table, not a JSON path into the metadata column:

ORA-00904: "THEBASE"."META"."LANG": invalid identifier   -- path "meta.lang"
ORA-00904: "THEBASE"."LANG": invalid identifier          -- even with the meta. prefix stripped

So hybrid retrieval raises for any metadata filter. test_hybrid_retriever_live never passes filters, so this wasn't caught (the unit test only asserts the JSON shape against a mock).

Fix: apply Haystack metadata filters as a SQL predicate while fetching the ranked hits (post-filter); native filter_by over columns declared with FILTER BY at index-creation time is still available via params. This also removes a positional score-misalignment that occurred whenever a ranked hit was dropped.

🔴 2. write_documents(policy=OVERWRITE)ORA-06531

The OVERWRITE MERGE (combining WHEN MATCHED UPDATE with WHEN NOT MATCHED INSERT) fails inside the DBMS_SEARCH keyword-index trigger — even on an empty table where every row is an insert:

ORA-06531: Reference to uninitialized collection
ORA-06512: at "...DR$...$TRIG"

NONE (plain INSERT) and SKIP (MERGE ... WHEN NOT MATCHED) both work — only the combined update+insert MERGE trips the trigger.

Fix: delete-then-insert for OVERWRITE (rows de-duplicated by id, last wins).

Fix branch + verification

I pushed both fixes plus live integration tests that cover them:

fede-kamel:fix/oracle-hybrid-filter-and-overwrite
https://github.com/fede-kamel/haystack-core-integrations/tree/fix/oracle-hybrid-filter-and-overwrite

After the fix, against the live DB: the new test_hybrid_retriever_with_metadata_filter_live and test_write_documents_overwrite_policy_live pass, the existing hybrid test still passes, and the full base/features suite is green (unit + ruff + mypy also clean). To apply it onto this branch:

git fetch https://github.com/fede-kamel/haystack-core-integrations fix/oracle-hybrid-filter-and-overwrite
git cherry-pick 08cd15f1

Minor (non-blocking)

  • oracle.md (786 lines) is a generated pydoc artifact — no other integration commits one, and it's already drifting from the code (module path ...components.document_stores..., docstrings that don't match the classes). Suggest deleting it and instead adding the new modules — text_embedder, document_embedder, hybrid_retriever, keyword_retriever — to pydoc/config_docusaurus.yml, which today lists only embedding_retriever + document_store, so the new components are missing from the generated docs.
  • OracleDocumentEmbedder mutates document.embedding in place → Haystack emits a warning; consider dataclasses.replace.
  • An empty in / not in list produces invalid SQL ... IN ().
  • _has_async_pool() actually checks whether the library supports async pools (always true on oracledb 2.x), so the async methods always open a second pool alongside the sync one; __del__ only closes the sync pool.

Separate / pre-existing (not introduced here): _keyword_retrieval with filters builds WHERE t. (...) via where.replace("WHERE", "WHERE t.").

@fede-kamel

Copy link
Copy Markdown
Contributor

@fileames I packaged the two fixes from my review above as a small delta PR against your branch so it's easy to pull in (I don't have push access to oracle-upstream-compat-features directly):

👉 fileames#2fix(oracle): hybrid metadata filters + OVERWRITE upsert

It branches directly off oracle-upstream-compat-features, so it contains only the delta on top of your work: the hybrid-filter post-filtering fix, the OVERWRITE delete-then-insert fix, the two live integration tests, and wallet support in conftest. All verified against a live Oracle 26ai DB (details in my comment above).

To get it into this PR — whichever is easiest for you:

  1. Merge it — merging [speech2text] WhisperTranscriber #2 into oracle-upstream-compat-features updates this PR (feat: add Oracle embedders and hybrid search support #3426) automatically, since they share the same head branch.
  2. Cherry-pick the commit:
    git checkout oracle-upstream-compat-features
    git fetch https://github.com/fede-kamel/haystack-core-integrations fix/oracle-hybrid-filter-and-overwrite
    git cherry-pick 08cd15f1
    git push
  3. Pull the branch directly:
    git checkout oracle-upstream-compat-features
    git pull https://github.com/fede-kamel/haystack-core-integrations fix/oracle-hybrid-filter-and-overwrite
    git push

Happy to adjust the approach (e.g. the post-filter top_k semantics) if you'd prefer something different.

@davidsbatista

Copy link
Copy Markdown
Contributor

@fileames you need to sign CLA otherwise even if it's approved we can are not able to merge the PR.

@fede-kamel

Copy link
Copy Markdown
Contributor

@fileames you're right — I had branched off an earlier commit ("Fix lint issues") and missed your "Address feedback" changes. I re-checked the latest HEAD and validated both items I'd flagged against a live Oracle Autonomous DB with Oracle Text/DBMS_SEARCH and DBMS_HYBRID_VECTOR enabled (the CI gvenzl image lacks these, so neither path is exercised in CI). Both are fixed:

1. OVERWRITEORA-06531 in the keyword-index trigger — fixed.
Switching from the combined MERGE to the PL/SQL UPDATE-then-INSERT block avoids the DBMS_SEARCH trigger fault. Differential check on the same DB: restoring the old MERGE still raises ORA-06531: Reference to uninitialized collection / error during execution of trigger 'DR$...$TRIG', while your current code passes — verified for update-existing, insert-new, and repeated-id-within-a-batch.

2. Hybrid retrieval + metadata filters → ORA-00904 — fixed.
Correcting the filter_by path prefix from meta. to metadata. (the real JSON column) resolves it. Verified live with ==, >= (range), in, not in, and a nested AND — all return correctly filtered results. The per-row fetch + skip-on-missing-rowid also keeps scores aligned when a hit drops out.

Full integration suite against that DB: 119 passed, 3 skipped. The one failure (test_update_by_filter) was ORA-04036 PGA_AGGREGATE_LIMIT — a memory cap on the small test instance, unrelated to this PR.

So nothing further needed from your side on these two. My separate fix branch is redundant and I'm dropping it. Thanks for the quick turnaround!

@fileames fileames force-pushed the oracle-upstream-compat-features branch from 49fa76a to 01bd499 Compare June 22, 2026 12:33
@fileames

Copy link
Copy Markdown
Author

@davidsbatista Thanks for all the help. I will sign the CLA very soon, in the meantime, is there anything else blocking in this branch? How can I resolve Core / Add label on docstrings edit / label (pull_request_target) error?

@davidsbatista

Copy link
Copy Markdown
Contributor

@davidsbatista Thanks for all the help. I will sign the CLA very soon, in the meantime, is there anything else blocking in this branch? How can I resolve Core / Add label on docstrings edit / label (pull_request_target) error?

fix is here: #3484

@davidsbatista

Copy link
Copy Markdown
Contributor

it's fixed now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants