feat: add Oracle embedders and hybrid search support#3426
Conversation
|
|
6425d45 to
8e107ba
Compare
|
@fileames Thank you for opening this PR! We need you to agree to the CLA please, before we can merge this PR. #3426 (comment) |
fede-kamel
left a comment
There was a problem hiding this comment.
Thanks for upstreaming this — the feature set is really useful and the overall structure is solid (validated identifiers, bound parameters, opt-in IVF, backwards-compatible to_dict). I tested the branch against a live Oracle Autonomous Database 26ai (23.26.2.2.0) with an in-database ONNX embedding model, since the CI container can't exercise most of the new functionality (see below). Results:
fmt-checkand the configuredtest:typespass; 50 mocked unit tests pass- Full integration suite against the live 26ai DB: 115 passed, 2 failed, 3 skipped — both failures are reproducible bugs (inline comments)
- Two extra live tests for hybrid retrieval with filters (not covered by the PR's tests) found a blocking bug (inline comment in
filters.py)
Why CI is red (and what it means): the CI container is gvenzl/oracle-free:23-slim, and the slim flavor removes Oracle Text entirely, which DBMS_SEARCH is built on. The CI logs show every fixture teardown failing with PLS-00201: identifier 'DBMS_SEARCH.DROP_INDEX' must be declared. This was invisible before this PR because _ensure_keyword_index() swallows all DatabaseErrors on create (so the keyword index has silently never existed in CI), while the new _drop_keyword_index() only forgives ORA-942/ORA-1418/DRG-10502 — the "package does not exist" case (ORA-06550/PLS-00201) falls through and raises. Suggestion: either treat ORA-06550/PLS-00201 as benign in _is_missing_object_error, or (better) switch CI to the non-slim gvenzl/oracle-free:23 image, which would give the keyword and hybrid paths real CI coverage for the first time.
Type-checking gap (not visible in this diff): tool.hatch.envs.test.scripts.types in pyproject.toml only checks document_stores.oracle and components.retrievers.oracle, so the new embedders package is never type-checked. Running mypy -p haystack_integrations.components.embedders.oracle yields 4 errors: the oracledb.defaults.fetch_lobs assignment, and two Liskov violations because OracleDocumentEmbedder.run(documents: list[Document]) overrides OracleTextEmbedder.run(text: str). Please add the package to the types script and consider making the two embedders independent components (per Haystack convention) instead of subclassing.
Pre-existing latent bug surfaced by live testing (not introduced here, but related): on a database where DBMS_SEARCH actually works, test_write_documents_duplicate_overwrite fails with ORA-06531: Reference to uninitialized collection raised from the DBMS_SEARCH-generated maintenance trigger (DR$...$TRIG) during the overwrite. So DuplicatePolicy.OVERWRITE is currently broken on any database with a functioning keyword index — worth a separate issue, and it will surface in CI as soon as the image is switched to non-slim.
Smaller points: pydoc/config_docusaurus.yml wasn't updated with the new modules (see oracle.md comment); the VECDB_* env fallbacks in tests/conftest.py look downstream-specific, and conftest doesn't read ORACLE_WALLET_LOCATION/ORACLE_WALLET_PASSWORD while test_oracle_features_integration.py does — adding them would let the non-mocked tests run against wallet-based databases (I had to patch this locally to test against ADB).
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
Review — validated against a live Oracle 26ai databaseThanks @fileames! I went through the embedders + hybrid-search work and ran it against a live Oracle 26ai database with Oracle Text / 🔴 1. Hybrid retrieval + any metadata filter →
|
|
@fileames I packaged the two fixes from my review above as a small delta PR against your branch so it's easy to pull in (I don't have push access to 👉 fileames#2 — It branches directly off To get it into this PR — whichever is easiest for you:
Happy to adjust the approach (e.g. the post-filter top_k semantics) if you'd prefer something different. |
|
@fileames you need to sign CLA otherwise even if it's approved we can are not able to merge the PR. |
|
@fileames you're right — I had branched off an earlier commit ("Fix lint issues") and missed your "Address feedback" changes. I re-checked the latest HEAD and validated both items I'd flagged against a live Oracle Autonomous DB with Oracle Text/ 1. 2. Hybrid retrieval + metadata filters → Full integration suite against that DB: 119 passed, 3 skipped. The one failure ( So nothing further needed from your side on these two. My separate fix branch is redundant and I'm dropping it. Thanks for the quick turnaround! |
49fa76a to
01bd499
Compare
|
@davidsbatista Thanks for all the help. I will sign the CLA very soon, in the meantime, is there anything else blocking in this branch? How can I resolve Core / Add label on docstrings edit / label (pull_request_target) error? |
fix is here: #3484 |
|
it's fixed now |
Proposed Changes:
Adds missing Oracle integration capabilities from the downstream implementation into the upstream Haystack Oracle integration:
create_hybrid_vector_index_async.
How did you test it?
Notes for the reviewer
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:.