Skip to content

Add new "textToLLMContext" field to improve embeddings#27485

Merged
lautel merged 8 commits intomainfrom
update-text-to-embed
Apr 23, 2026
Merged

Add new "textToLLMContext" field to improve embeddings#27485
lautel merged 8 commits intomainfrom
update-text-to-embed

Conversation

@lautel
Copy link
Copy Markdown
Contributor

@lautel lautel commented Apr 17, 2026

Closes https://github.com/open-metadata/ai-platform/issues/303


Summary by Gitar

  • Embedding architecture:
    • Split textToEmbed into textToLLMContext (legacy format for agents) and a new, clean textToEmbed for semantic vector search.
    • Implemented a buildSemanticMetaLightText generator that excludes structural noise (FQN, system fields) to improve search relevance.
  • Refactored builder logic:
    • Replaced reflection-based child lookup with type-safe method references in SEMANTIC_CHILDREN_SPECS.
    • Introduced SEMANTIC_ENRICHERS map to replace instanceof branching for entity-specific metadata.
  • Schema updates:
    • Added textToLLMContext to all elasticsearch index mappings to ensure backward compatibility for tooling.

This will update automatically on new commits.

@lautel lautel self-assigned this Apr 17, 2026
@lautel lautel added the safe to test Add this label to run secure Github workflows on PRs label Apr 17, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 17, 2026

🟡 Playwright Results — all passed (20 flaky)

✅ 3691 passed · ❌ 0 failed · 🟡 20 flaky · ⏭️ 89 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 481 0 0 4
🟡 Shard 2 652 0 4 7
🟡 Shard 3 661 0 5 1
🟡 Shard 4 646 0 2 27
🟡 Shard 5 609 0 2 42
🟡 Shard 6 642 0 7 8
🟡 20 flaky test(s) (passed on retry)
  • Features/ActivityFeed.spec.ts › Mention notification shows correct user details in Notification box (shard 2, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/BulkImport.spec.ts › Database service (shard 2, 1 retry)
  • Features/DataQuality/DataQuality.spec.ts › TestCase filters (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/SampleDataTableOperations.spec.ts › should display sample data tab with rows and columns (shard 3, 1 retry)
  • Features/Workflows/WorkflowOssRestrictions.spec.ts › graph canvas contains workflow nodes (shard 3, 1 retry)
  • Features/Workflows/WorkflowOssRestrictions.spec.ts › exclude-fields-select is enabled in OSS (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for File (shard 4, 1 retry)
  • Pages/EntityDataSteward.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
  • Pages/Glossary.spec.ts › Add and Remove Assets (shard 5, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Table (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Container (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/Lineage/PlatformLineage.spec.ts › Verify domain platform view (shard 6, 1 retry)
  • Pages/UserDetails.spec.ts › Create team with domain and verify visibility of inherited domain in user profile after team removal (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

@pmbrull pmbrull changed the title Add new "textToEmbedSemantic" field to improve embeddings Add new "textToLLMContext" field to improve embeddings Apr 20, 2026
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 21, 2026

Code Review 👍 Approved with suggestions 0 resolved / 1 findings

Incorporates the textToLLMContext field to enhance embedding generation. Re-add the try-catch block for reflection calls to ensure lambda safety against unchecked casts.

💡 Edge Case: Removed try-catch makes unchecked casts in lambdas unguarded

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/VectorDocBuilder.java:554 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/vector/VectorDocBuilder.java:352-354

The old readChildNames wrapped the reflection call in a try-catch that returned Collections.emptyList() on any exception. The new code at line 554 calls spec.childGetter().apply(entity) without any exception handling. While the cast is logically guarded by the entityType map key (derived from entity.getEntityReference().getType() at line 151), a ClassCastException would now propagate uncaught if there's ever an inconsistency (e.g., a subclass returning an unexpected type string). The same applies to SEMANTIC_ENRICHERS at line 352-354.

The risk is low since entityType is derived from the entity itself, but the behavioral change from silent-failure to exception-propagation is worth noting — especially since this runs during search reindexing where a single failure could interrupt batch processing.

Suggested fix
Consider wrapping the lambda invocations in a
try-catch for ClassCastException, logging a warning
and returning null/empty to preserve the previous
fail-safe behavior:

  try {
    List<String> childNames =
        readChildNames(spec.childGetter().apply(entity));
  } catch (ClassCastException e) {
    LOG.warn("Type mismatch for {}: {}",
        entityType, e.getMessage());
    return null;
  }
🤖 Prompt for agents
Code Review: Incorporates the textToLLMContext field to enhance embedding generation. Re-add the try-catch block for reflection calls to ensure lambda safety against unchecked casts.

1. 💡 Edge Case: Removed try-catch makes unchecked casts in lambdas unguarded
   Files: openmetadata-service/src/main/java/org/openmetadata/service/search/vector/VectorDocBuilder.java:554, openmetadata-service/src/main/java/org/openmetadata/service/search/vector/VectorDocBuilder.java:352-354

   The old `readChildNames` wrapped the reflection call in a `try-catch` that returned `Collections.emptyList()` on any exception. The new code at line 554 calls `spec.childGetter().apply(entity)` without any exception handling. While the cast is logically guarded by the `entityType` map key (derived from `entity.getEntityReference().getType()` at line 151), a `ClassCastException` would now propagate uncaught if there's ever an inconsistency (e.g., a subclass returning an unexpected type string). The same applies to `SEMANTIC_ENRICHERS` at line 352-354.
   
   The risk is low since `entityType` is derived from the entity itself, but the behavioral change from silent-failure to exception-propagation is worth noting — especially since this runs during search reindexing where a single failure could interrupt batch processing.

   Suggested fix:
   Consider wrapping the lambda invocations in a
   try-catch for ClassCastException, logging a warning
   and returning null/empty to preserve the previous
   fail-safe behavior:
   
     try {
       List<String> childNames =
           readChildNames(spec.childGetter().apply(entity));
     } catch (ClassCastException e) {
       LOG.warn("Type mismatch for {}: {}",
           entityType, e.getMessage());
       return null;
     }

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

if (spec == null) {
return null;
}
List<String> childNames = readChildNames(spec.childGetter().apply(entity));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Edge Case: Removed try-catch makes unchecked casts in lambdas unguarded

The old readChildNames wrapped the reflection call in a try-catch that returned Collections.emptyList() on any exception. The new code at line 554 calls spec.childGetter().apply(entity) without any exception handling. While the cast is logically guarded by the entityType map key (derived from entity.getEntityReference().getType() at line 151), a ClassCastException would now propagate uncaught if there's ever an inconsistency (e.g., a subclass returning an unexpected type string). The same applies to SEMANTIC_ENRICHERS at line 352-354.

The risk is low since entityType is derived from the entity itself, but the behavioral change from silent-failure to exception-propagation is worth noting — especially since this runs during search reindexing where a single failure could interrupt batch processing.

Suggested fix:

Consider wrapping the lambda invocations in a
try-catch for ClassCastException, logging a warning
and returning null/empty to preserve the previous
fail-safe behavior:

  try {
    List<String> childNames =
        readChildNames(spec.childGetter().apply(entity));
  } catch (ClassCastException e) {
    LOG.warn("Type mismatch for {}: {}",
        entityType, e.getMessage());
    return null;
  }

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@sonarqubecloud
Copy link
Copy Markdown

@lautel lautel merged commit 4cf116f into main Apr 23, 2026
55 checks passed
@lautel lautel deleted the update-text-to-embed branch April 23, 2026 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants