Skip to content

feat(ingestion): incremental metadata extraction for Unity Catalog#28380

Open
ulixius9 wants to merge 1 commit into
mainfrom
unitycatalog-incremental-metadata
Open

feat(ingestion): incremental metadata extraction for Unity Catalog#28380
ulixius9 wants to merge 1 commit into
mainfrom
unitycatalog-incremental-metadata

Conversation

@ulixius9
Copy link
Copy Markdown
Member

Description

Adds incremental metadata extraction to the Unity Catalog connector, matching the existing Snowflake/BigQuery pattern. After a first full run, only tables changed since the last successful run are processed, and dropped tables are detected explicitly.

  • Changed tables: server-side filter on information_schema.tables.last_altered, then each changed table is fetched individually via the Databricks SDK (tables.get) — avoids enumerating the whole catalog every run.
  • Deleted tables: detected from system.access.audit deleteTable events (request_params.full_name_arg). Degrades gracefully (warns and skips delete detection) when the system.access schema is not available.
  • A table present in both the changed and deleted sets (dropped and recreated within the window) is kept, not deleted — information_schema only lists tables that currently exist.
  • Reuses the shared IncrementalConfig framework; no JSON schema change (the incremental config already applies to every database connector via sourceConfig.config.incremental).

Type of change

  • New feature

How was this tested

  • Unit: new test_unitycatalog_incremental.py (17 cases) covering the processor's parsing and graceful degradation, the incremental vs. full discovery path, delete handling, the dropped-and-recreated exclusion, the mark_tables_as_deleted incremental/full branches, and create() wiring. Full Unity Catalog + incremental suite is green (42 tests).
  • Live: validated end-to-end against a real Unity Catalog workspace + OpenMetadata server — change detection, audit-log delete detection, the recreate edge case, and an incremental workflow run that processed only the changed table while leaving unchanged tables intact (none wrongly deleted).

🤖 Generated with Claude Code

Detect changed tables via information_schema.tables.last_altered and deleted tables via system.access.audit deleteTable events, mirroring the Snowflake/BigQuery incremental pattern. Degrades to no-op delete detection when the audit schema is unavailable. Tables present in both the changed and deleted sets (dropped and recreated within the window) are kept rather than deleted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ulixius9 ulixius9 requested a review from a team as a code owner May 22, 2026 14:34
Copilot AI review requested due to automatic review settings May 22, 2026 14:34
@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels May 22, 2026
AND action_name = 'deleteTable'
AND event_date >= date(timestamp_millis({start_timestamp}))
AND event_time >= timestamp_millis({start_timestamp})
AND request_params.full_name_arg LIKE '{catalog}.%%'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: SQL LIKE treats _ in catalog name as single-char wildcard

In UNITY_CATALOG_GET_DELETED_TABLES, the filter request_params.full_name_arg LIKE '{catalog}.%%' uses the catalog name directly in a LIKE pattern. Since _ is a single-character wildcard in SQL, a catalog named e.g. my_catalog will also match myXcatalog.schema.table or any other catalog where the underscore position has an arbitrary character. This could cause tables from other catalogs to be incorrectly marked as deleted.

Unity Catalog catalog names commonly contain underscores, so this is a realistic scenario.

Fix 1: Use ESCAPE clause -- but this only helps if you also escape _ and % in the catalog name at format-time.
AND request_params.full_name_arg LIKE '{catalog}.%%' ESCAPE '\'
  • Apply fix
Fix 2: Use an exact match on the first segment of the dot-delimited name instead of LIKE. This avoids wildcard issues entirely and is clearer in intent.
AND split(request_params.full_name_arg, '.')[0] = '{catalog}'
  • Apply fix

Check a box to apply a fix or reply for a change | Was this helpful? React with 👍 / 👎

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 22, 2026

Code Review ⚠️ Changes requested 0 resolved / 1 findings

Implements incremental metadata extraction for the Unity Catalog connector, but SQL LIKE in UNITY_CATALOG_GET_DELETED_TABLES incorrectly treats underscores in catalog names as single-character wildcards.

⚠️ Bug: SQL LIKE treats _ in catalog name as single-char wildcard

📄 ingestion/src/metadata/ingestion/source/database/unitycatalog/queries.py:124

In UNITY_CATALOG_GET_DELETED_TABLES, the filter request_params.full_name_arg LIKE '{catalog}.%%' uses the catalog name directly in a LIKE pattern. Since _ is a single-character wildcard in SQL, a catalog named e.g. my_catalog will also match myXcatalog.schema.table or any other catalog where the underscore position has an arbitrary character. This could cause tables from other catalogs to be incorrectly marked as deleted.

Unity Catalog catalog names commonly contain underscores, so this is a realistic scenario.

Use ESCAPE clause -- but this only helps if you also escape _ and % in the catalog name at format-time.
AND request_params.full_name_arg LIKE '{catalog}.%%' ESCAPE '\'
Use an exact match on the first segment of the dot-delimited name instead of LIKE. This avoids wildcard issues entirely and is clearer in intent.
AND split(request_params.full_name_arg, '.')[0] = '{catalog}'
🤖 Prompt for agents
Code Review: Implements incremental metadata extraction for the Unity Catalog connector, but SQL LIKE in `UNITY_CATALOG_GET_DELETED_TABLES` incorrectly treats underscores in catalog names as single-character wildcards.

1. ⚠️ Bug: SQL LIKE treats `_` in catalog name as single-char wildcard
   Files: ingestion/src/metadata/ingestion/source/database/unitycatalog/queries.py:124

   In `UNITY_CATALOG_GET_DELETED_TABLES`, the filter `request_params.full_name_arg LIKE '{catalog}.%%'` uses the catalog name directly in a LIKE pattern. Since `_` is a single-character wildcard in SQL, a catalog named e.g. `my_catalog` will also match `myXcatalog.schema.table` or any other catalog where the underscore position has an arbitrary character. This could cause tables from other catalogs to be incorrectly marked as deleted.
   
   Unity Catalog catalog names commonly contain underscores, so this is a realistic scenario.

   Fix (Use ESCAPE clause -- but this only helps if you also escape _ and % in the catalog name at format-time.):
   AND request_params.full_name_arg LIKE '{catalog}.%%' ESCAPE '\'

   Fix (Use an exact match on the first segment of the dot-delimited name instead of LIKE. This avoids wildcard issues entirely and is clearer in intent.):
   AND split(request_params.full_name_arg, '.')[0] = '{catalog}'

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds incremental metadata extraction support to the Unity Catalog ingestion connector, aligning it with the existing incremental framework used by other database connectors. After an initial full run, the source can fetch only tables changed since the last successful workflow run and explicitly mark dropped tables as deleted.

Changes:

  • Add Unity Catalog incremental discovery path that fetches only changed tables (via information_schema.tables.last_altered) and tracks deleted tables (via system.access.audit deleteTable events).
  • Introduce a dedicated UnityCatalogIncrementalTableProcessor to populate changed/deleted table maps per catalog.
  • Add unit tests covering changed/deleted detection, graceful degradation when system schemas aren’t available, and incremental vs full-path behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
ingestion/src/metadata/ingestion/source/database/unitycatalog/metadata.py Wires IncrementalConfig into the Unity Catalog source, adds incremental table listing + explicit delete marking.
ingestion/src/metadata/ingestion/source/database/unitycatalog/queries.py Adds SQL templates to detect changed tables and deleted tables since the incremental watermark.
ingestion/src/metadata/ingestion/source/database/unitycatalog/incremental_table_processor.py New helper to execute the incremental queries and bucket results into per-schema changed/deleted sets.
ingestion/tests/unit/topology/database/test_unitycatalog_incremental.py New unit tests validating incremental processor parsing, degradation, and source incremental flow.

Comment on lines +407 to +410
self.status.failed(
StackTraceError(
name=table.name,
error=f"Unexpected exception to get table [{table.name}]: {exc}",
Comment on lines +83 to +86
table_map: SchemaToTables = {}
try:
rows = self.connection.execute(text(query.format(catalog=catalog, start_timestamp=start_timestamp)))
for row in rows or []:
Comment on lines +106 to +126
UNITY_CATALOG_GET_CHANGED_TABLES = textwrap.dedent(
"""
SELECT
table_schema,
table_name
FROM `{catalog}`.information_schema.tables
WHERE last_altered >= timestamp_millis({start_timestamp})
"""
)

UNITY_CATALOG_GET_DELETED_TABLES = textwrap.dedent(
"""
SELECT DISTINCT request_params.full_name_arg AS table_full_name
FROM system.access.audit
WHERE service_name = 'unityCatalog'
AND action_name = 'deleteTable'
AND event_date >= date(timestamp_millis({start_timestamp}))
AND event_time >= timestamp_millis({start_timestamp})
AND request_params.full_name_arg LIKE '{catalog}.%%'
"""
)
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed for 'open-metadata-ingestion'

Failed conditions
0.0% Coverage on New Code (required ≥ 20%)

See analysis details on SonarQube Cloud

@github-actions
Copy link
Copy Markdown
Contributor

🟡 Playwright Results — all passed (16 flaky)

✅ 4239 passed · ❌ 0 failed · 🟡 16 flaky · ⏭️ 87 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 298 0 1 4
🟡 Shard 2 797 0 6 8
🟡 Shard 3 795 0 5 8
🟡 Shard 4 838 0 3 12
✅ Shard 5 719 0 0 47
🟡 Shard 6 792 0 1 8
🟡 16 flaky test(s) (passed on retry)
  • Features/TeamsDragAndDrop.spec.ts › Should drag and drop on Division team type (shard 1, 1 retry)
  • Features/BulkEditEntity.spec.ts › Table (shard 2, 1 retry)
  • Features/BulkImport.spec.ts › Keyboard Delete selection (shard 2, 1 retry)
  • Features/DataQuality/TestCaseImportExportE2eFlow.spec.ts › Admin: Complete export-import-validate flow (shard 2, 1 retry)
  • Features/DataQuality/TestCaseResultPermissions.spec.ts › User with only VIEW cannot PATCH results (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 2 retries)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 1 retry)
  • Features/OntologyExplorerCardinality.spec.ts › edges for cardinality-typed relations appear in the graph edge data (shard 3, 2 retries)
  • Features/OntologyExplorerE2E.spec.ts › toggling edge labels off and back on leaves the graph and cardinality map intact (shard 3, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/Table.spec.ts › Tags term should be consistent for search (shard 3, 1 retry)
  • Features/UserProfileOnlineStatus.spec.ts › Should show "Active recently" for users active within last hour (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 4, 1 retry)
  • Pages/CustomProperties.spec.ts › Should display custom properties for apiCollection in right panel (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Domain owner should able to edit description of domain (shard 4, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants