Skip to content

Commit de950c6

Browse files
pmbrullclaude
andauthored
feat(search): new embedding entity + body text extension hook (#27233)
* feat(search): enroll ContextMemory in vector embedding + body text extension hook Add ContextMemory to AvailableEntityTypes.LIST so the existing vector embedding pipeline (VectorEmbeddingHandler live path and the admin reembed/reindex CLIs) iterates Collate memories alongside the built-in data assets. Memories store their semantic payload in title/question/answer/summary, not description, so the default buildBodyText would feed an empty string to the embedder. To fix that without pulling Collate schema classes into OSS, expose a BodyTextExtractor functional interface plus a registerBodyTextExtractor(entityType, extractor) hook on VectorDocBuilder. buildBodyText consults the registry first and falls back to the existing description-based logic when no extractor is registered. Collate registers a typed extractor from its repository static initializer, so both server and CLI paths go through the same code. No behavior change for the default entity types: LIST gains one entry at the end, and the registry starts empty so every existing call falls through to the unchanged default branch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(search): add VectorBodyTextContributor SPI marker interface Document the contract for plugging in entity-specific body text extractors into the vector embedding pipeline. Implementations declare their entity type, return a typed BodyTextExtractor, and call the default register() hook from a stable initialization site (typically the owning EntityRepository static initializer). Pure documentation of the shape — the backing mechanism is still VectorDocBuilder.registerBodyTextExtractor(), so callers that use the raw registration hook keep working unchanged. The interface exists so new contributors get an IDE nudge ("implement this") instead of having to grep an existing extension class for the pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(search): enroll ContextMemory in vector embedding + body text extension hook Add ContextMemory to AvailableEntityTypes.LIST so the existing vector embedding pipeline (VectorEmbeddingHandler live path and the admin reembed/reindex CLIs) iterates Collate memories alongside the built-in data assets. Memories store their semantic payload in title/question/answer/summary, not description, so the default buildBodyText would feed an empty string to the embedder. To fix that without pulling Collate schema classes into OSS, expose a BodyTextExtractor functional interface plus a registerBodyTextExtractor(entityType, extractor) hook on VectorDocBuilder. buildBodyText consults the registry first and falls back to the existing description-based logic when no extractor is registered. Collate registers a typed extractor from its repository static initializer, so both server and CLI paths go through the same code. No behavior change for the default entity types: LIST gains one entry at the end, and the registry starts empty so every existing call falls through to the unchanged default branch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(search): add VectorBodyTextContributor SPI marker interface Document the contract for plugging in entity-specific body text extractors into the vector embedding pipeline. Implementations declare their entity type, return a typed BodyTextExtractor, and call the default register() hook from a stable initialization site (typically the owning EntityRepository static initializer). Pure documentation of the shape — the backing mechanism is still VectorDocBuilder.registerBodyTextExtractor(), so callers that use the raw registration hook keep working unchanged. The interface exists so new contributors get an IDE nudge ("implement this") instead of having to grep an existing extension class for the pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix NPE in buildBodyText when entityType is null ConcurrentHashMap.get(null) throws NullPointerException. GlossaryTerm tests pass null entityType through buildEmbeddingFields. Guard with a null check and wrap custom extractor calls in try/catch so a faulty downstream extractor degrades gracefully instead of crashing the embedding pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix * add tests --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 512771d commit de950c6

4 files changed

Lines changed: 194 additions & 1 deletion

File tree

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
/*
2+
* Copyright 2025 Collate
3+
* Licensed under the Apache License, Version 2.0 (the "License");
4+
* you may not use this file except in compliance with the License.
5+
* You may obtain a copy of the License at
6+
* http://www.apache.org/licenses/LICENSE-2.0
7+
* Unless required by applicable law or agreed to in writing, software
8+
* distributed under the License is distributed on an "AS IS" BASIS,
9+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10+
* See the License for the specific language governing permissions and
11+
* limitations under the License.
12+
*/
13+
package org.openmetadata.service.search.vector;
14+
15+
import org.openmetadata.service.search.vector.VectorDocBuilder.BodyTextExtractor;
16+
17+
/**
18+
* Explicit contract for an entity type that wants to contribute a typed body text extractor to
19+
* the vector embedding pipeline. Implementations are the single place where "this entity type
20+
* has a custom body text extractor" is declared; downstream distributions implement this
21+
* interface once per entity type and invoke {@link #register()} from a stable initialization
22+
* site (typically the owning {@code EntityRepository} static initializer so both server and CLI
23+
* lifecycles register the extractor exactly once).
24+
*
25+
* <p>This interface is intentionally tiny — it exists to document the shape of the contract and
26+
* let IDEs guide contributors, not to impose an inheritance hierarchy. The backing mechanism is
27+
* still {@link VectorDocBuilder#registerBodyTextExtractor(String, BodyTextExtractor)}; callers
28+
* that already use the raw registration hook keep working unchanged.
29+
*/
30+
public interface VectorBodyTextContributor {
31+
32+
/**
33+
* Entity type key as used in the entity reference (for example {@code "table"},
34+
* {@code "contextMemory"}). Must match the value returned by {@code getEntityReference().getType()}
35+
* on instances of the contributed entity class and the key used in
36+
* {@code AvailableEntityTypes.LIST} — mismatches silently disable the custom extractor.
37+
*/
38+
String entityType();
39+
40+
/**
41+
* Typed body text extractor for this entity type. Runs on the hot path of every create / update
42+
* and every reembed iteration; implementations should be fast and side-effect free.
43+
*/
44+
BodyTextExtractor extractor();
45+
46+
/**
47+
* Register this contributor's extractor with the shared {@link VectorDocBuilder} registry.
48+
* Registration is idempotent, so calling it from a static initializer is safe even if the
49+
* owning class is loaded multiple times (for example, once in the server path and once in a
50+
* CLI subcommand).
51+
*/
52+
default void register() {
53+
VectorDocBuilder.registerBodyTextExtractor(entityType(), extractor());
54+
}
55+
}

openmetadata-service/src/main/java/org/openmetadata/service/search/vector/VectorDocBuilder.java

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
import java.util.List;
88
import java.util.Map;
99
import java.util.Objects;
10+
import java.util.concurrent.ConcurrentHashMap;
1011
import java.util.stream.Collectors;
1112
import lombok.experimental.UtilityClass;
1213
import lombok.extern.slf4j.Slf4j;
@@ -28,6 +29,41 @@
2829
@UtilityClass
2930
public class VectorDocBuilder {
3031

32+
/**
33+
* Strategy for producing the semantic "body text" of an entity that will be chunked and fed to
34+
* the embedding model. The default implementation concatenates {@code description} and, for
35+
* tables, the column names — which works for every entity type whose semantic payload lives in
36+
* {@code description}. Entity types whose payload is spread across other fields (for example
37+
* Collate's {@code ContextMemory}, with title/question/answer/summary) can provide a typed
38+
* extractor via {@link #registerBodyTextExtractor(String, BodyTextExtractor)} so the embedding
39+
* pipeline uses their fields instead of an empty description.
40+
*/
41+
@FunctionalInterface
42+
public interface BodyTextExtractor {
43+
/**
44+
* Returns the body text for the given entity, or {@code null} to fall back to the default
45+
* behavior. Implementations should be fast and side-effect free; they run on the hot path of
46+
* every create/update and every reembed iteration.
47+
*/
48+
String extract(EntityInterface entity);
49+
}
50+
51+
private static final Map<String, BodyTextExtractor> BODY_TEXT_EXTRACTORS =
52+
new ConcurrentHashMap<>();
53+
54+
/**
55+
* Register a custom {@link BodyTextExtractor} for an entity type. The registry is consulted by
56+
* {@link #buildBodyText(EntityInterface, String)} before the default description-based logic,
57+
* so callers can cleanly override body text for their own entity types without patching this
58+
* class. Registration is idempotent (last writer wins) and thread-safe.
59+
*/
60+
public static void registerBodyTextExtractor(String entityType, BodyTextExtractor extractor) {
61+
if (entityType == null || entityType.isBlank() || extractor == null) {
62+
return;
63+
}
64+
BODY_TEXT_EXTRACTORS.put(entityType, extractor);
65+
}
66+
3167
public static List<Map<String, Object>> fromEntity(
3268
EntityInterface entity, EmbeddingClient embeddingClient) {
3369
Map<String, Object> doc = new HashMap<>(buildEmbeddingFields(entity, embeddingClient));
@@ -219,6 +255,21 @@ static String buildMetaLightText(EntityInterface entity, String entityType) {
219255
}
220256

221257
static String buildBodyText(EntityInterface entity, String entityType) {
258+
if (entityType != null) {
259+
BodyTextExtractor customExtractor = BODY_TEXT_EXTRACTORS.get(entityType);
260+
if (customExtractor != null) {
261+
try {
262+
String custom = customExtractor.extract(entity);
263+
if (custom != null) {
264+
return custom;
265+
}
266+
} catch (Exception e) {
267+
LOG.warn(
268+
"Custom BodyTextExtractor failed for [{}], falling back to default", entityType, e);
269+
}
270+
}
271+
}
272+
222273
List<String> bodyParts = new ArrayList<>();
223274
bodyParts.add("description: " + removeHtml(orEmpty(entity.getDescription())));
224275

openmetadata-service/src/main/java/org/openmetadata/service/search/vector/utils/AvailableEntityTypes.java

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,8 @@ private AvailableEntityTypes() {}
2727
"page",
2828
"storedProcedure",
2929
"searchIndex",
30-
"topic");
30+
"topic",
31+
"contextMemory");
3132

3233
public static final Set<String> SET =
3334
LIST.stream().map(s -> s.toLowerCase(Locale.ROOT)).collect(Collectors.toUnmodifiableSet());

openmetadata-service/src/test/java/org/openmetadata/service/search/vector/VectorDocBuilderTest.java

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -312,6 +312,92 @@ private GlossaryTerm createTestGlossaryTerm(String name, String displayName, Str
312312
return term;
313313
}
314314

315+
@Test
316+
void testRegisterCustomExtractorIsUsed() {
317+
String type = "customExtractorTest_" + UUID.randomUUID();
318+
VectorDocBuilder.registerBodyTextExtractor(
319+
type, entity -> "custom body for " + entity.getName());
320+
321+
Table table = createTestTable("ext_table", null, "original desc");
322+
String body = VectorDocBuilder.buildBodyText(table, type);
323+
324+
assertEquals("custom body for ext_table", body);
325+
}
326+
327+
@Test
328+
void testCustomExtractorReturningNullFallsBackToDefault() {
329+
String type = "nullExtractorTest_" + UUID.randomUUID();
330+
VectorDocBuilder.registerBodyTextExtractor(type, entity -> null);
331+
332+
Table table = createTestTable("fallback_table", null, "fallback desc");
333+
String body = VectorDocBuilder.buildBodyText(table, type);
334+
335+
assertTrue(body.contains("fallback desc"));
336+
}
337+
338+
@Test
339+
void testCustomExtractorThrowingFallsBackToDefault() {
340+
String type = "throwExtractorTest_" + UUID.randomUUID();
341+
VectorDocBuilder.registerBodyTextExtractor(
342+
type,
343+
entity -> {
344+
throw new RuntimeException("boom");
345+
});
346+
347+
Table table = createTestTable("err_table", null, "safe desc");
348+
String body = VectorDocBuilder.buildBodyText(table, type);
349+
350+
assertTrue(body.contains("safe desc"));
351+
}
352+
353+
@Test
354+
void testRegisterExtractorIgnoresNullAndBlank() {
355+
String type = "ignoreTest_" + UUID.randomUUID();
356+
357+
VectorDocBuilder.registerBodyTextExtractor(null, entity -> "nope");
358+
VectorDocBuilder.registerBodyTextExtractor("", entity -> "nope");
359+
VectorDocBuilder.registerBodyTextExtractor(" ", entity -> "nope");
360+
VectorDocBuilder.registerBodyTextExtractor(type, null);
361+
362+
Table table = createTestTable("guard_table", null, "default desc");
363+
String body = VectorDocBuilder.buildBodyText(table, type);
364+
365+
assertTrue(body.contains("default desc"));
366+
}
367+
368+
@Test
369+
void testVectorBodyTextContributorRegister() {
370+
String type = "contributorTest_" + UUID.randomUUID();
371+
372+
VectorBodyTextContributor contributor =
373+
new VectorBodyTextContributor() {
374+
@Override
375+
public String entityType() {
376+
return type;
377+
}
378+
379+
@Override
380+
public VectorDocBuilder.BodyTextExtractor extractor() {
381+
return entity -> "contributed: " + entity.getName();
382+
}
383+
};
384+
385+
contributor.register();
386+
387+
Table table = createTestTable("contrib_table", null, "ignored");
388+
String body = VectorDocBuilder.buildBodyText(table, type);
389+
390+
assertEquals("contributed: contrib_table", body);
391+
}
392+
393+
@Test
394+
void testBuildBodyTextWithNullEntityType() {
395+
Table table = createTestTable("null_type_table", null, "some desc");
396+
String body = VectorDocBuilder.buildBodyText(table, null);
397+
398+
assertTrue(body.contains("some desc"));
399+
}
400+
315401
private Table createTestTable(String name, String displayName, String description) {
316402
Table table = new Table();
317403
table.setId(UUID.randomUUID());

0 commit comments

Comments
 (0)