feat: native Elasticsearch vector search support#27111
feat: native Elasticsearch vector search support#27111joaopamaral wants to merge 44 commits intoopen-metadata:mainfrom
Conversation
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Initial results look good, but I've run a test only with ES 9.x and version 1.12.4 (not the one from main). I also need to double-check if OpenSearch is affected by this change. Also need to review some AI-resolved conflicts from version 1.12.4 with main. |
|
Thanks @joaopamaral this is great!!. Can you make it ready for review? and also address comments here #27111 (comment) |
|
Sure @harshach! I'll work on the bot review first before making it ready for review! 👍 |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
2 similar comments
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
5 similar comments
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi @harshach, ’ve addressed the bot review, but I still need to re-review the code after rebasing/merging with main and rerun the tests against a real server. So far, I’ve tested this PR with version 1.12.4 and ES 9.3.1. I still need to validate that everything continues to work correctly with OpenSearch and ES 8.x. I won’t be able to run tests for the next couple of days, but feel free to proceed with any testing on your side in the meantime. |
There was a problem hiding this comment.
Pull request overview
This PR adds native Elasticsearch (8.x/9.x) vector search support to OpenMetadata, aiming to provide semantic/vector search capabilities on Elasticsearch deployments comparable to the existing OpenSearch implementation.
Changes:
- Added a new
ElasticSearchVectorServiceplus wiring inSearchRepository/ElasticSearchBulkSinkto initialize and use it when Elasticsearch is the configured backend. - Introduced ES-native vector index mapping templates (
vector_search_index_es_native.json) and extended query-building to emit Elasticsearch’s top-levelknnquery format. - Added/updated tests around the ES-native query format and Elasticsearch vector service behavior.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| openmetadata-spec/src/main/resources/json/schema/search/searchRequest.json | Adds semanticSearch flag to the search request schema. |
| openmetadata-spec/src/main/resources/elasticsearch/en/vector_search_index_es_native.json | New ES-native vector index template (en). |
| openmetadata-spec/src/main/resources/elasticsearch/jp/vector_search_index_es_native.json | New ES-native vector index template (jp). |
| openmetadata-spec/src/main/resources/elasticsearch/ru/vector_search_index_es_native.json | New ES-native vector index template (ru). |
| openmetadata-spec/src/main/resources/elasticsearch/zh/vector_search_index_es_native.json | New ES-native vector index template (zh). |
| openmetadata-service/src/main/java/org/openmetadata/service/search/vector/VectorSearchQueryBuilder.java | Adds buildNativeESQuery and refactors filter emission for vector search queries. |
| openmetadata-service/src/test/java/org/openmetadata/service/search/vector/VectorSearchQueryBuilderTest.java | Adds coverage for ES-native top-level knn query structure and filter behavior. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/vector/VectorIndexService.java | Extends vector service interface and adds an alias helper. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/vector/OpenSearchVectorService.java | Adjusts to use the new interface default alias method and annotates overrides. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/vector/ElasticSearchVectorService.java | New Elasticsearch vector service implementation using Rest5Client for generic requests. |
| openmetadata-service/src/test/java/org/openmetadata/service/search/vector/ElasticSearchVectorServiceTest.java | New tests for ES vector service result parsing, grouping, and dimension patching. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java | Initializes ES vector service when Elasticsearch backend is configured; mapping selection tweaks for ES-native template. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/RecreateWithEmbeddings.java | Attempts to include a vector “entity” key in recreate flow when vector search is enabled. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/SemanticSearchQueryBuilder.java | New builder for semantic/hybrid query composition on Elasticsearch. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchIndexManager.java | Extracts mappings sub-object before calling putMapping. |
| openmetadata-service/src/test/java/org/openmetadata/service/search/elasticsearch/ElasticSearchIndexManagerTest.java | Adds a test asserting updateIndex handles full index JSON by extracting mappings. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/search/VectorSearchResource.java | Switches to repository-provided VectorIndexService and adds a fingerprint endpoint. |
| openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/ElasticSearchBulkSink.java | Adds async vector-embedding task execution + migration path for ES indexing jobs. |
| openmetadata-service/src/test/java/org/openmetadata/service/apps/bundles/searchIndex/ElasticSearchBulkSinkSimpleTest.java | Adds minimal coverage for vector-embedding helpers on the ES sink. |
| openmetadata-mcp/src/main/java/org/openmetadata/mcp/tools/SemanticSearchTool.java | Uses repository VectorIndexService rather than OpenSearch-only implementation. |
|
Also need to review all after this refactor #26000 😢 |
|
@joaopamaral thanks for your work on this, can you check the co-pilot comments and address the merge conflict here please |
|
The Java checkstyle failed. Please run You can install the pre-commit hooks with |
|
… interface constant EsUtils.addDenseVectorSettings: - Also check existing properties.embedding.dims, not just _meta.embedding_dimension. Either source can carry an immutable dimension; rewriting either via putMapping is rejected by ES. - Both checks delegate to a shared assertDimensionMatches helper that throws IllegalStateException with a clear "drop+reindex" message before any mapping rewrite. - Preserve existing _meta fields when upserting embedding_model / embedding_dimension; previous putObject call clobbered siblings. ElasticSearchVectorService: - Drop the language field and its constructor / init parameters. It was stored and normalized but never read anywhere; OpenSearchVectorService has no equivalent. VectorIndexService: - Remove unused VECTOR_INDEX_KEY constant — never referenced. Tests updated for new ctor / init signatures + spotless formatting. Addresses PR review comments r3170730416, r3170730465, r3182117105, r3182117165, r3182241460, r3182241528. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 22 out of 24 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java:393
createOrUpdateIndexTemplates()catches all exceptions from mapping enrichment (includingIllegalStateExceptionfromEsUtils.enrichIndexMappingForElasticsearchon embedding dimension drift) and only logse.getMessage()at WARN. This effectively downgrades a critical misconfiguration to a soft failure and also drops the stack trace, making the root cause harder to diagnose. Consider either (a) logging the full exception (passeto the logger) and/or (b) rethrowing/propagating dimension-mismatch failures so startup/reindex clearly fails and operators are forced to reindex as intended.
public void createOrUpdateIndexTemplates() {
LOG.info("Creating/updating index templates for all entities...");
int success = 0;
int failed = 0;
for (Map.Entry<String, IndexMapping> entry : entityIndexMap.entrySet()) {
try {
IndexMapping indexMapping = entry.getValue();
String indexName = indexMapping.getIndexName(clusterAlias);
String templateName = "om_" + indexName;
String indexPattern = indexName + "*";
String mappingContent = enrichForElasticsearch(readIndexMapping(indexMapping));
if (mappingContent != null) {
searchClient.createOrUpdateIndexTemplate(templateName, indexPattern, mappingContent);
success++;
} else {
failed++;
LOG.warn("No mapping content found for entity type: {}", entry.getKey());
}
} catch (Exception e) {
failed++;
LOG.warn("Failed to create index template for {}: {}", entry.getKey(), e.getMessage());
}
EsUtils.enrichIndexMappingForElasticsearch: - Document the two failure modes in Javadoc: * IllegalArgumentException on null/empty input * IllegalStateException on embedding dimension drift Callers/operators were previously left to discover these the hard way. SearchRepository.createOrUpdateIndexTemplates: - Rethrow IllegalStateException (dimension mismatch) instead of swallowing it as a per-entity WARN. A broken vector setup must fail loudly so the operator runs an explicit reindex. - For other Exceptions, pass the throwable to the logger so the stack trace is preserved (was logging only e.getMessage()). Addresses PR review comments r3182610888 and reviewer follow-up on createOrUpdateIndexTemplates exception handling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lautel I think Copilot is happy now 😄 |
There was a problem hiding this comment.
There's just one clean up to do and the PR should be good to get it in 🚀 Please check my comment to remove the dead code in ElasticSearchVectorService. Also please solve the merge conflict, and don't forget to run mvn spotless:apply before committing to apply the right formatting! Thanks!
…with OpenSearch Reviewer flagged the manual `if (statusCode >= 400)` branch as unreachable, asserting Rest5Client throws ResponseException on any non-2xx. Empirically verified with a probe test that this is only true for 5xx — Rest5Client's internal isCorrectServerResponse returns `code < 500`, so 4xx responses are returned normally and would silently slip through if the manual check is removed. Final shape: - 4xx: manual check throws IOException with same message format used by OpenSearchVectorService. - 5xx: catch ResponseException, extract status + body, rethrow RuntimeException with the same "Elasticsearch request failed with status N: body" wording — symmetric to the OS path. Comment in code documents the asymmetric Rest5Client behavior so future readers don't repeat the same misread. Addresses PR review comment in pullrequestreview-4226084412. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolve conflict in ElasticSearchBulkSinkSimpleTest.java: keep the new testIsVectorEmbeddingEnabledForEntity test from this branch alongside the new semaphoreTimeoutRecordsPermanentFailureWithoutIncrementingActiveRequests test + helper methods landed on main. Also include num_candidates int-overflow fix from copilot review: - VectorSearchQueryBuilder.buildNativeESQuery: compute k * multiplier in long, clamp to Integer.MAX_VALUE so a configurable multiplier can never produce a negative or wrapped num_candidates. - VectorSearchQueryBuilderTest: pin the clamp behavior. Addresses copilot review pullrequestreview-4228238013 (overflow item). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
The Java checkstyle failed. Please run You can install the pre-commit hooks with |
…tity ElasticSearchVectorService.collectSearchHits: - When _source has no parentId we fall back to the document _id for grouping, but the fallback was never written into the result map. SemanticSearchTool.cleanHit pulls parentId via copyIfPresent — a missing key silently drops it from MCP responses. Now persist the fallback into hitMap so consumers always see a populated parentId. ElasticSearchVectorService.executeGenericRequest: - response.getEntity() can be null (some ES endpoints return no body on 4xx). The previous unconditional getEntity().getContent() NPE'd and masked the real HTTP status. Extract a readEntityBody helper that returns "" for a null/unreadable entity and use it on both the 4xx success-but-error path and the 5xx ResponseException path. - Throw RuntimeException directly on 4xx (was IOException then wrapped) so the status + body message survives the outer catch. - Add a "rethrow RuntimeException as-is" branch so a meaningful status-bearing message isn't double-wrapped into the generic "Elasticsearch generic request failed" string. Tests: - testParentIdFallbackToDocIdIsWrittenIntoResultMap: pins fallback write-back. - testExecuteGenericRequestHandlesNullEntityWithout4xxNPE: pins null entity tolerance + status surfacing. Addresses copilot review pullrequestreview-4228238013. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
The Java checkstyle failed. Please run You can install the pre-commit hooks with |
Code Review 👍 Approved with suggestions 6 resolved / 7 findingsNative Elasticsearch vector search integration implemented with robust handling for index mappings and client transport safety. Address the minor fanout logic discrepancy where null targets currently return only staged indices. 💡 Edge Case: getWriteFanoutTargets(null) returns only staged indices📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java:630-632 When Suggested fix✅ 6 resolved✅ Bug: Test calls build() with 4 args but method requires 6 — won't compile
✅ Edge Case: loadIndexMapping dimension replacement is brittle — exact string match
✅ Edge Case: init() assigns instance before registerVectorEmbeddingHandler completes
✅ Bug: ES search pagination is broken vs OpenSearch implementation
✅ Quality: Unsafe downcast defeats purpose of VectorIndexService interface
...and 1 more resolved from earlier reviews 🤖 Prompt for agentsOptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
|
|
hey @lautel, the failing tests in CI don't seem related to this PR |



Summary
Adds native Elasticsearch 9.x vector search support, mirroring the existing OpenSearch implementation. OpenMetadata deployments backed by Elasticsearch 9.x can now use the same semantic/vector search features as OpenSearch deployments.
Architecture: ES vs OpenSearch Vector Search
Query format difference
Ingestion flow
Changes
New
ElasticSearchVectorService: ES implementation ofVectorIndexService. UsesRest5Clientfor all HTTP calls (same approach as the rest of the ES client layer). MirrorsOpenSearchVectorServicefor pagination, fingerprint deduplication, and embedding lifecycle.vector_search_index_es_native.json(en/jp/ru/zh): ES-native index mappings usingdense_vector/dims/cosinesimilarity (ES 9.x format).SemanticSearchQueryBuilder(ES package): mirrors the OpenSearch equivalent.VectorSearchQueryBuilder.buildNativeESQuery(): emits the ES 9.x top-levelknnformat with configurablenum_candidates.Modified
ElasticSearchIndexManager.extractMappingsJson(): extracts themappingssub-object beforeputMapping— ES rejects the full index JSON (withsettings/aliases) at the mappings endpoint.EsUtils.enrichIndexMappingForElasticsearch(): injectsdense_vectorfield into ES index mappings when vector search is enabled.SearchRepository: wiresElasticSearchVectorServicefor theELASTICSEARCHsearch type; appliesenrichIndexMappingForElasticsearchto index templates (OpenSearch path unchanged).ElasticSearchBulkSink: callsvectorIndexService.updateEntityEmbedding()on write, same asOpenSearchBulkSink.VectorSearchQueryBuilder:reformatVectorIndexWithDimension()handles both"dims"(ES) and"dimension"(OpenSearch).Bug fixes (post-review)
appendKnnQueryclosing braces:}}}}→}}}— the extra brace produced malformed JSON, causing OpenSearch to return 400 on every KNN request (silently yielding empty results).executeGenericRequestHTTP status check: now throws on 4xx/5xx instead of silently returning{}.extractRestClienttype guard: replaces unchecked cast withinstanceofpattern match; throwsIllegalArgumentExceptionwith the actual transport class name on mismatch.knnNumCandidatesMultiplier: wired fromnaturalLanguageSearch.knnNumCandidatesMultiplierconfig; defaults to 2×.getFingerprintendpoint: gated to admin users.createOrUpdateIndexTemplates()andcreateOrUpdateIndexTemplate()now callenrichIndexMappingForElasticsearch()so templates includedense_vectorwhen vector search is enabled.Review-feedback fixes (commit
9854029a)SystemRepository.getEmbeddingsValidation(): removed the ES short-circuit. Elasticsearch now flows through the same checks (embedding generation + hybrid pipeline) as OpenSearch instead of always failing with the legacy "not supported" message.EsUtils.addDenseVectorSettings(): detects_meta.embedding_dimensiondrift vs. the active embedding client and emits aWARNlog, mirroringOsUtils.addKnnVectorSettings. Helps surface stale indexes after a model switch (e.g. DJL 384 → OpenAI 1536).ElasticSearchBulkSink.isVectorEmbeddingEnabledForEntity: also gates onsearchRepository.getIndexMapping(entityType) != nullto skip entities whose mapping isn't loaded.ElasticSearchBulkSink.fetchExistingFingerprints: takes aReindexContextand routes to the staged index when one exists. Without this the fingerprint dedup reads the canonical (old) index during a staged reindex and re-processes everything it was supposed to skip.EsUtilsTest,ElasticSearchBulkSinkBehaviorTest,SystemRepositoryEmbeddingsValidationTest).Test plan
mvn test -pl openmetadata-service -Dtest=VectorSearchQueryBuilderTest,ElasticSearchIndexManagerTest,ElasticSearchVectorServiceTest,SearchRepositoryBehaviorTest,EsUtilsTest,ElasticSearchBulkSinkBehaviorTest,SystemRepositoryEmbeddingsValidationTestmvn test -pl openmetadata-integration-tests -Dtest=VectorEmbeddingIntegrationIT(uses Testcontainers OpenSearch 3.4.0)revenue from sales→customer_purchases0.76;login authentication→user_logins0.71;temperature humidity→weather_data0.69). Index hasdense_vector{dims:384, similarity:cosine, BBQ-HNSW}. See docs/elasticsearch-semantic-search-local-test.md.SEARCH_TYPE=opensearch.OpenSearchVectorServiceinitializes,hybrid-rrfpipeline registered,knn_vectormapping created, semantic queries return identical scores to ES (same embedding model → deterministic cosine ranking).References
🤖 Generated with Claude Code
Summary by Gitar
VectorSearchQueryBuilderto clean up clamping logic fornum_candidatesto ensure robustness against integer overflow.ElasticSearchVectorServiceto streamline the service implementation.This will update automatically on new commits.