fix(search): enforce ContextMemory visibility at index time, not per-query#29541
fix(search): enforce ContextMemory visibility at index time, not per-query#29541harshach wants to merge 5 commits into
Conversation
…query Private/Shared ContextMemory could leak through search paths the per-query applyContextMemoryVisibility filter (#29384) did not cover. Instead of filtering on every search path, only org-wide (ENTITY) memories are written to the index: ContextMemoryRepository.isSearchIndexable gates the live create/update/embedding paths and getReindexFilter applies the same rule to the bulk reindex reader and counts (so partition totals stay balanced and the staged index promotes). Removes the applyContextMemoryVisibility apparatus from both search managers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
❌ PR checklist incompleteThis PR cannot be merged until the following are addressed on its linked issue:
The fields live on the linked issue in the Shipping project (open the issue → right sidebar → Projects). After you set them, re-run this check (or push a commit) — issue/project changes do not re-trigger it automatically. Maintainers can bypass this check by adding the |
| if (!Entity.getEntityRepository(entityType).isSearchIndexable(entity)) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Silent early-return on update gives no vector-side cleanup — When a memory is flipped ENTITY→PRIVATE,
SearchRepository.updateEntityIndex calls deleteEntityIndex to remove the entity document. Because vector embeddings are written via partialUpdateEntity as fields on that same document, the entity-doc deletion implicitly cleans up the embedding too. The VectorEmbeddingHandler returning early here is therefore safe today. However, the asymmetry vs. SearchRepository (which explicitly deletes) is easy to miss: if the embedding is ever moved to its own child documents (the chunkIndex/chunkCount/parentId fields already hint at chunking), this silent return would become a real leak with no compiler or test warning. A brief inline comment explaining the implicit-cleanup assumption would protect future readers.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
createEntitiesIndex and updateEntitiesIndex (the bulk live paths reached via SearchIndexHandler.onEntitiesCreated/onEntitiesUpdated) bypassed the isSearchIndexable gate, so a PRIVATE/SHARED ContextMemory created or updated through a bulk lifecycle dispatch (import/batch extraction) could still be indexed. Apply the same gate: skip non-indexable entities on bulk create; on bulk update drop them and delete their stale docs, mirroring updateEntityIndex's ENTITY->PRIVATE flip handling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🔴 Playwright Results — 2 failure(s), 54 flaky✅ 4420 passed · ❌ 2 failed · 🟡 54 flaky · ⏭️ 38 skipped
Genuine Failures (failed on all attempts)❌
|
checkIfIndexingIsSupported gates on the index mapping, not repository existence, so an indexable but repository-less type (e.g. an index-only/time-series sub-entity) reaching createEntityIndex/updateEntityIndex would throw EntityNotFoundException at the getEntityRepository().isSearchIndexable call added in the prior commits. Route all live index-path gates through a new Entity.isSearchIndexable facade that defaults to indexable when no repository is registered, so the rule still lives in ContextMemoryRepository but repository-less types index exactly as before. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_PARAM Warn that the filter param is set only by ContextMemoryRepository.getReindexFilter and must not be applied to the REST /contextCenter/memories listing (owners must still read their own PRIVATE/SHARED memories there). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Code Review 👍 Approved with suggestions 1 resolved / 2 findingsEnforces ContextMemory visibility at index time by gating search indexing to ENTITY-only records, resolving the risk of data leakage via uncovered search paths. Consider refining integration test assertions and adding documentation to the visibility parameter to prevent future misuse. 💡 Edge Case: Deterministic tiebreak now sorts on name.keyword/id.keyword for all queries📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchSearchManager.java:1114-1122 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchSearchManager.java:1254-1268 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchSearchManager.java:1073-1081 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchSearchManager.java:1211-1225 In both OpenSearchSearchManager.buildSearchRequestBuilder (~lines 1254-1278) and ElasticSearchSearchManager.buildSearchRequestBuilder (~lines 1211-1235), the deterministic tiebreaker ( This is a deliberate determinism improvement, but it broadens the set of queries that sort on ✅ 1 resolved✅ Security: Bulk live-index paths bypass isSearchIndexable visibility gate
🤖 Prompt for agentsOptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|



Describe your changes:
This addresses a security finding from the review of #29384: Private/Shared
ContextMemorycould still leak through search paths that the per-queryapplyContextMemoryVisibilityfilter didn't cover. Instead of filtering on every search path (fragile — the review found three uncovered ones), the rule is enforced once at index time: only org-wide (ENTITY) memories are written to the search index.ContextMemoryRepository.isSearchIndexablegates the live create/update/embedding paths andContextMemoryRepository.getReindexFilter()applies the samevisibility == ENTITYrule (a cross-DB JSONListFiltercondition) to the bulk-reindex reader and its entity counts, so partition totals stay balanced and the staged index promotes. TheapplyContextMemoryVisibilityapparatus is removed from both the Elasticsearch and OpenSearch managers; the REST/contextCenter/memoriesvisibility checks are unchanged.Type of change:
High-level design:
Two indexing pipelines need the rule, so it lives entirely in
ContextMemoryRepository(overriding trivialEntityRepositorybase hooks):isSearchIndexable(entity)for the live single-entity path (SearchRepository.createEntityIndex/updateEntityIndex,VectorEmbeddingHandler), andgetReindexFilter()(ashareConfig.visibility = 'Entity'JSON condition onListFilter) for the bulk reindex — applied at the reader (PaginatedEntitiesSource) and every entity-count site (PartitionCalculator/ReindexingOrchestrator/DistributedIndexingStrategy). Count and read must share the same filter because reindex promotion compares reader-total vs sink-success (StatsReconciler/EntityPromotionContext); excluding from the read but not the count would collapse the success ratio and block index promotion. No data migration is needed — ContextMemory search indexing is unreleased (added post-1.13, only on2.0.0-SNAPSHOT), so the feature ships with this fix.Tests:
Use cases covered
Unit tests
ContextMemoryRepositoryTest—isSearchIndexableacross ENTITY/PRIVATE/SHARED/missing-shareConfig +getReindexFilter.ListFilterTest— the generatedmemorySearchVisibilitySQL condition (MySQL + Postgres forms).Backend integration tests
openmetadata-integration-tests/.ContextMemoryIT.privateAndSharedMemoriesAreExcludedFromSearchIndexandContextMemoryIT.memoryVisibilityFlipAddsAndRemovesFromSearchIndex.Ingestion integration tests
Playwright (UI) tests
Manual testing performed
mvn compile+spotless:checkonopenmetadata-serviceandopenmetadata-integration-tests: pass.ContextMemoryRepositoryTest,ListFilterTest,ReindexingUtilTest. The two ContextMemory ITs require the Docker stack and were not run locally.UI screen recording / screenshots:
Not applicable.
Checklist:
🤖 Generated with Claude Code
Greptile Summary
This PR moves
ContextMemoryvisibility enforcement from per-query search filters to index time: only org-wide (ENTITY) memories are written to the search index, whilePRIVATEandSHAREDmemories are excluded at create/update and during bulk reindex. The fragile per-queryapplyContextMemoryVisibilityapparatus in both ES and OS managers is removed, replaced by two repository overrides (isSearchIndexableandgetReindexFilter) and a sharedListFiltercondition that keeps reader counts and partition totals aligned so staged-index promotion succeeds.ContextMemoryRepositoryoverridesisSearchIndexable(live single-entity gate) andgetReindexFilter(DB-level filter for bulk reindex), withSearchRepository.updateEntityIndexcallingdeleteEntityIndexon a visibility flip to PRIVATE/SHARED.PartitionCalculator,ReindexingOrchestrator,DistributedIndexingStrategy) and the bulk reader (PaginatedEntitiesSource) now userepository.getReindexFilter()so counts match what is actually indexed and partition promotion ratios stay correct.ContextMemorySearchVisibilityand its ES/OS wiring are deleted; the REST/contextCenter/memoriesvisibility checks are unchanged.Confidence Score: 5/5
Safe to merge; the privacy boundary is correctly enforced at index time across all live and bulk paths, and the old per-query filter is cleanly removed.
All four live indexing entry points (createEntityIndex, updateEntityIndex, the streaming-buffer batch path, and VectorEmbeddingHandler) now gate on isSearchIndexable, and every entity-count site shares the same getReindexFilter() so promotion ratios stay correct. The one gap found is PartitionWorker.initializeKeysetCursor fallback using new ListFilter(Include.ALL) rather than getReindexFilter(); this can cause ENTITY memories near a failure boundary to be skipped in that specific recovery scenario, but since entity reads in that same path do use getReindexFilter(), no PRIVATE/SHARED memory will ever reach the index. ContextMemory search indexing is also unreleased at the time of this change, limiting blast radius.
PartitionWorker.java line 655 — the fallback cursor in initializeKeysetCursor should use getReindexFilter() to stay consistent with the primary path.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[ContextMemory create/update] --> B{isSearchIndexable?} B -- ENTITY visibility --> C[SearchRepository.createEntityIndex / updateEntityIndex] B -- PRIVATE or SHARED --> D[SearchRepository.deleteEntityIndex\n remove stale doc] B -- PRIVATE or SHARED --> E[VectorEmbeddingHandler\n early return] F[Bulk Reindex trigger] --> G[PaginatedEntitiesSource\n filter = getReindexFilter\n visibility = ENTITY only] G --> H[PartitionCalculator / ReindexingOrchestrator\n DistributedIndexingStrategy counts\n same getReindexFilter] H --> I{Count matches reader} I -- yes --> J[StatsReconciler ratio OK\n staged index promotes] G --> L[BulkSink writes ENTITY memories] L --> M[Search Index\n only ENTITY memories]%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% flowchart TD A[ContextMemory create/update] --> B{isSearchIndexable?} B -- ENTITY visibility --> C[SearchRepository.createEntityIndex / updateEntityIndex] B -- PRIVATE or SHARED --> D[SearchRepository.deleteEntityIndex\n remove stale doc] B -- PRIVATE or SHARED --> E[VectorEmbeddingHandler\n early return] F[Bulk Reindex trigger] --> G[PaginatedEntitiesSource\n filter = getReindexFilter\n visibility = ENTITY only] G --> H[PartitionCalculator / ReindexingOrchestrator\n DistributedIndexingStrategy counts\n same getReindexFilter] H --> I{Count matches reader} I -- yes --> J[StatsReconciler ratio OK\n staged index promotes] G --> L[BulkSink writes ENTITY memories] L --> M[Search Index\n only ENTITY memories]Comments Outside Diff (2)
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/ContextMemoryIT.java, line 77-84 (link)assertFalsewould pass even if PRIVATE/SHARED are later indexed (and the test would not re-run). SinceisSearchIndexableblocks them synchronously atcreateEntityIndex, they can never actually land in the index, so the test will never catch a real regression on that code path either way. Consider wrapping the absence checks inawaitSearchPresence(id, false, …)to make the test symmetric and resilient to any future changes in event ordering.openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/ListFilter.java, line 303-313 (link)getCondition()call —getMemorySearchVisibilityCondition()is wired into the global condition chain and thequeryParams.get(...)lookup plusnullOrEmptycheck runs on everyListFilterinstance, not just those created for ContextMemory. The method returns""for all other entities (safe, consistent with the existing pattern), but a typo that accidentally setsMEMORY_SEARCH_VISIBILITY_PARAMon a non-ContextMemory filter would silently filter out all rows for that entity type because most entity tables don't have a JSONshareConfig.visibilityfield. The risk is low given onlyContextMemoryRepository.getReindexFilter()sets this param today, but a short guard comment on the constant noting it is exclusively for that repository would help prevent misuse.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Reviews (5): Last reviewed commit: "docs(search): document reindex-only scop..." | Re-trigger Greptile