feat: add Oracle embedders and hybrid search support by fileames · Pull Request #3426 · deepset-ai/haystack-core-integrations

fileames · 2026-06-10T18:27:25Z

Proposed Changes:

Adds missing Oracle integration capabilities from the downstream implementation into the upstream Haystack Oracle integration:

Adds OracleTextEmbedder and OracleDocumentEmbedder using OracleConnectionConfig.
Adds OracleHybridRetriever for DBMS hybrid vector search.
Adds hybrid vector index support through OracleVectorizerPreference, create_hybrid_vector_index, and
create_hybrid_vector_index_async.
Extends vector index configuration with opt-in IVF support while keeping HNSW as the default.
Adds direct async paths where Oracle async APIs are available, with fallback behavior preserved.
Adds explicit close() and close_async() lifecycle APIs for sync/async pools.
Extends filter support with contains, not contains, and hybrid filter conversion.
Improves DBMS_SEARCH keyword index cleanup by using DBMS_SEARCH.DROP_INDEX.

How did you test it?

Unit tests
Live Oracle feature integration tests

Notes for the reviewer

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.

CLAassistant · 2026-06-10T18:27:40Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ davidsbatista
❌ fileames
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

julian-risch · 2026-06-11T05:39:56Z

@fileames Thank you for opening this PR! We need you to agree to the CLA please, before we can merge this PR. #3426 (comment)

fede-kamel

Thanks for upstreaming this — the feature set is really useful and the overall structure is solid (validated identifiers, bound parameters, opt-in IVF, backwards-compatible to_dict). I tested the branch against a live Oracle Autonomous Database 26ai (23.26.2.2.0) with an in-database ONNX embedding model, since the CI container can't exercise most of the new functionality (see below). Results:

fmt-check and the configured test:types pass; 50 mocked unit tests pass
Full integration suite against the live 26ai DB: 115 passed, 2 failed, 3 skipped — both failures are reproducible bugs (inline comments)
Two extra live tests for hybrid retrieval with filters (not covered by the PR's tests) found a blocking bug (inline comment in filters.py)

Why CI is red (and what it means): the CI container is gvenzl/oracle-free:23-slim, and the slim flavor removes Oracle Text entirely, which DBMS_SEARCH is built on. The CI logs show every fixture teardown failing with PLS-00201: identifier 'DBMS_SEARCH.DROP_INDEX' must be declared. This was invisible before this PR because _ensure_keyword_index() swallows all DatabaseErrors on create (so the keyword index has silently never existed in CI), while the new _drop_keyword_index() only forgives ORA-942/ORA-1418/DRG-10502 — the "package does not exist" case (ORA-06550/PLS-00201) falls through and raises. Suggestion: either treat ORA-06550/PLS-00201 as benign in _is_missing_object_error, or (better) switch CI to the non-slim gvenzl/oracle-free:23 image, which would give the keyword and hybrid paths real CI coverage for the first time.

Type-checking gap (not visible in this diff): tool.hatch.envs.test.scripts.types in pyproject.toml only checks document_stores.oracle and components.retrievers.oracle, so the new embedders package is never type-checked. Running mypy -p haystack_integrations.components.embedders.oracle yields 4 errors: the oracledb.defaults.fetch_lobs assignment, and two Liskov violations because OracleDocumentEmbedder.run(documents: list[Document]) overrides OracleTextEmbedder.run(text: str). Please add the package to the types script and consider making the two embedders independent components (per Haystack convention) instead of subclassing.

Pre-existing latent bug surfaced by live testing (not introduced here, but related): on a database where DBMS_SEARCH actually works, test_write_documents_duplicate_overwrite fails with ORA-06531: Reference to uninitialized collection raised from the DBMS_SEARCH-generated maintenance trigger (DR$...$TRIG) during the overwrite. So DuplicatePolicy.OVERWRITE is currently broken on any database with a functioning keyword index — worth a separate issue, and it will surface in CI as soon as the image is switched to non-slim.

Smaller points: pydoc/config_docusaurus.yml wasn't updated with the new modules (see oracle.md comment); the VECDB_* env fallbacks in tests/conftest.py look downstream-specific, and conftest doesn't read ORACLE_WALLET_LOCATION/ORACLE_WALLET_PASSWORD while test_oracle_features_integration.py does — adding them would let the non-mocked tests run against wallet-based databases (I had to patch this locally to test against ADB).

socket-security · 2026-06-15T13:15:28Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	pypi/oracledb@2.5.1 ⏵ 4.0.1					^-30

View full report

fede-kamel · 2026-06-18T21:37:23Z

Review — validated against a live Oracle 26ai database

Thanks @fileames! I went through the embedders + hybrid-search work and ran it against a live Oracle 26ai database with Oracle Text / DBMS_SEARCH, DBMS_VECTOR_CHAIN, DBMS_HYBRID_VECTOR and an in-database ONNX embedding model. Unit tests, ruff and mypy are all clean. Two issues reproduce against a real DB but are invisible in CI, because the gvenzl/oracle-free image ships without Oracle Text — so the keyword index (and its trigger) is never created in CI.

🔴 1. Hybrid retrieval + any metadata filter → `ORA-20000` / `ORA-00904`

to_hybrid_filter() turns a Haystack filter such as {"field": "meta.lang", "operator": "==", "value": "en"} into filter_by={"path": "meta.lang", ...}, but DBMS_HYBRID_VECTOR.SEARCH's filter_by.path resolves to a column on the base table, not a JSON path into the metadata column:

ORA-00904: "THEBASE"."META"."LANG": invalid identifier   -- path "meta.lang"
ORA-00904: "THEBASE"."LANG": invalid identifier          -- even with the meta. prefix stripped

So hybrid retrieval raises for any metadata filter. test_hybrid_retriever_live never passes filters, so this wasn't caught (the unit test only asserts the JSON shape against a mock).

Fix: apply Haystack metadata filters as a SQL predicate while fetching the ranked hits (post-filter); native filter_by over columns declared with FILTER BY at index-creation time is still available via params. This also removes a positional score-misalignment that occurred whenever a ranked hit was dropped.

🔴 2. `write_documents(policy=OVERWRITE)` → `ORA-06531`

The OVERWRITE MERGE (combining WHEN MATCHED UPDATE with WHEN NOT MATCHED INSERT) fails inside the DBMS_SEARCH keyword-index trigger — even on an empty table where every row is an insert:

ORA-06531: Reference to uninitialized collection
ORA-06512: at "...DR$...$TRIG"

NONE (plain INSERT) and SKIP (MERGE ... WHEN NOT MATCHED) both work — only the combined update+insert MERGE trips the trigger.

Fix: delete-then-insert for OVERWRITE (rows de-duplicated by id, last wins).

Fix branch + verification

I pushed both fixes plus live integration tests that cover them:

fede-kamel:fix/oracle-hybrid-filter-and-overwrite
https://github.com/fede-kamel/haystack-core-integrations/tree/fix/oracle-hybrid-filter-and-overwrite

After the fix, against the live DB: the new test_hybrid_retriever_with_metadata_filter_live and test_write_documents_overwrite_policy_live pass, the existing hybrid test still passes, and the full base/features suite is green (unit + ruff + mypy also clean). To apply it onto this branch:

git fetch https://github.com/fede-kamel/haystack-core-integrations fix/oracle-hybrid-filter-and-overwrite
git cherry-pick 08cd15f1

Minor (non-blocking)

oracle.md (786 lines) is a generated pydoc artifact — no other integration commits one, and it's already drifting from the code (module path ...components.document_stores..., docstrings that don't match the classes). Suggest deleting it and instead adding the new modules — text_embedder, document_embedder, hybrid_retriever, keyword_retriever — to pydoc/config_docusaurus.yml, which today lists only embedding_retriever + document_store, so the new components are missing from the generated docs.
OracleDocumentEmbedder mutates document.embedding in place → Haystack emits a warning; consider dataclasses.replace.
An empty in / not in list produces invalid SQL ... IN ().
_has_async_pool() actually checks whether the library supports async pools (always true on oracledb 2.x), so the async methods always open a second pool alongside the sync one; __del__ only closes the sync pool.

Separate / pre-existing (not introduced here): _keyword_retrieval with filters builds WHERE t. (...) via where.replace("WHERE", "WHERE t.").

fede-kamel · 2026-06-18T21:43:08Z

@fileames I packaged the two fixes from my review above as a small delta PR against your branch so it's easy to pull in (I don't have push access to oracle-upstream-compat-features directly):

👉 fileames#2 — fix(oracle): hybrid metadata filters + OVERWRITE upsert

It branches directly off oracle-upstream-compat-features, so it contains only the delta on top of your work: the hybrid-filter post-filtering fix, the OVERWRITE delete-then-insert fix, the two live integration tests, and wallet support in conftest. All verified against a live Oracle 26ai DB (details in my comment above).

To get it into this PR — whichever is easiest for you:

Merge it — merging [speech2text] WhisperTranscriber #2 into oracle-upstream-compat-features updates this PR (feat: add Oracle embedders and hybrid search support #3426) automatically, since they share the same head branch.

Cherry-pick the commit:

git checkout oracle-upstream-compat-features
git fetch https://github.com/fede-kamel/haystack-core-integrations fix/oracle-hybrid-filter-and-overwrite
git cherry-pick 08cd15f1
git push

Pull the branch directly:

git checkout oracle-upstream-compat-features
git pull https://github.com/fede-kamel/haystack-core-integrations fix/oracle-hybrid-filter-and-overwrite
git push

Happy to adjust the approach (e.g. the post-filter top_k semantics) if you'd prefer something different.

davidsbatista · 2026-06-19T07:48:40Z

@fileames you need to sign CLA otherwise even if it's approved we can are not able to merge the PR.

fede-kamel · 2026-06-19T16:24:30Z

@fileames you're right — I had branched off an earlier commit ("Fix lint issues") and missed your "Address feedback" changes. I re-checked the latest HEAD and validated both items I'd flagged against a live Oracle Autonomous DB with Oracle Text/DBMS_SEARCH and DBMS_HYBRID_VECTOR enabled (the CI gvenzl image lacks these, so neither path is exercised in CI). Both are fixed:

1. OVERWRITE → ORA-06531 in the keyword-index trigger — fixed.
Switching from the combined MERGE to the PL/SQL UPDATE-then-INSERT block avoids the DBMS_SEARCH trigger fault. Differential check on the same DB: restoring the old MERGE still raises ORA-06531: Reference to uninitialized collection / error during execution of trigger 'DR$...$TRIG', while your current code passes — verified for update-existing, insert-new, and repeated-id-within-a-batch.

2. Hybrid retrieval + metadata filters → ORA-00904 — fixed.
Correcting the filter_by path prefix from meta. to metadata. (the real JSON column) resolves it. Verified live with ==, >= (range), in, not in, and a nested AND — all return correctly filtered results. The per-row fetch + skip-on-missing-rowid also keeps scores aligned when a hit drops out.

Full integration suite against that DB: 119 passed, 3 skipped. The one failure (test_update_by_filter) was ORA-04036 PGA_AGGREGATE_LIMIT — a memory cap on the small test instance, unrelated to this PR.

So nothing further needed from your side on these two. My separate fix branch is redundant and I'm dropping it. Thanks for the quick turnaround!

fileames · 2026-06-22T12:34:23Z

@davidsbatista Thanks for all the help. I will sign the CLA very soon, in the meantime, is there anything else blocking in this branch? How can I resolve Core / Add label on docstrings edit / label (pull_request_target) error?

davidsbatista · 2026-06-22T13:29:46Z

@davidsbatista Thanks for all the help. I will sign the CLA very soon, in the meantime, is there anything else blocking in this branch? How can I resolve Core / Add label on docstrings edit / label (pull_request_target) error?

fix is here: #3484

davidsbatista · 2026-06-22T13:56:54Z

it's fixed now

fileames requested a review from a team as a code owner June 10, 2026 18:27

fileames requested review from julian-risch and removed request for a team June 10, 2026 18:27

github-actions Bot added the type:documentation Improvements or additions to documentation label Jun 10, 2026

fileames added 2 commits June 10, 2026 18:33

Add more Oracle functionality

63ce6ad

Fix lint issues

8e107ba

fileames force-pushed the oracle-upstream-compat-features branch from 6425d45 to 8e107ba Compare June 10, 2026 18:33

fede-kamel reviewed Jun 11, 2026

View reviewed changes

Address feedback

2a91c7e

fileames added 2 commits June 15, 2026 14:14

Increase vector memory

4923d9d

Increase vector memory

8e2373e

julian-risch requested review from davidsbatista and removed request for julian-risch June 16, 2026 07:03

fede-kamel mentioned this pull request Jun 18, 2026

fix(oracle): hybrid metadata filters + OVERWRITE upsert (delta for #3426) fileames/haystack-core-integrations#2

Closed

Merge branch 'main' into oracle-upstream-compat-features

7040abe

Merge branch 'main' into oracle-upstream-compat-features

526a4ab

davidsbatista and others added 2 commits June 22, 2026 10:59

Merge branch 'main' into oracle-upstream-compat-features

4264a9a

Skip failing CI checks

01bd499

fileames force-pushed the oracle-upstream-compat-features branch from 49fa76a to 01bd499 Compare June 22, 2026 12:33

Merge branch 'main' into oracle-upstream-compat-features

783af18

davidsbatista mentioned this pull request Jun 22, 2026

adding allow-unsafe-pr-checkout: true to CI_docstring_labeler.yml #3484

Merged

Merge branch 'main' into oracle-upstream-compat-features

03d8173

Conversation

fileames commented Jun 10, 2026

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

CLAassistant commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

julian-risch commented Jun 11, 2026

Uh oh!

fede-kamel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

socket-security Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fede-kamel commented Jun 18, 2026

Review — validated against a live Oracle 26ai database

🔴 1. Hybrid retrieval + any metadata filter → ORA-20000 / ORA-00904

🔴 2. write_documents(policy=OVERWRITE) → ORA-06531

Fix branch + verification

Minor (non-blocking)

Uh oh!

fede-kamel commented Jun 18, 2026

Uh oh!

davidsbatista commented Jun 19, 2026

Uh oh!

fede-kamel commented Jun 19, 2026

Uh oh!

fileames commented Jun 22, 2026

Uh oh!

davidsbatista commented Jun 22, 2026

Uh oh!

davidsbatista commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CLAassistant commented Jun 10, 2026 •

edited

Loading

socket-security Bot commented Jun 15, 2026 •

edited

Loading

🔴 1. Hybrid retrieval + any metadata filter → `ORA-20000` / `ORA-00904`

🔴 2. `write_documents(policy=OVERWRITE)` → `ORA-06531`