Skip to content

fix: resolve path-based lineage for Databricks external tables (#27561)#27648

Open
ShivamChavan01 wants to merge 6 commits into
open-metadata:mainfrom
ShivamChavan01:fix/databricks-external-table-path-lineage-27561
Open

fix: resolve path-based lineage for Databricks external tables (#27561)#27648
ShivamChavan01 wants to merge 6 commits into
open-metadata:mainfrom
ShivamChavan01:fix/databricks-external-table-path-lineage-27561

Conversation

@ShivamChavan01

Copy link
Copy Markdown

Describe your changes:

Fixes #27561

External tables in Databricks are referenced using cloud storage paths (e.g. delta.\abfss://...`) instead of table names. In this case, Databricks system tables populate source_path/target_pathand leavesource_table_full_name/target_table_full_name` as null. The lineage processor was filtering out these rows entirely, resulting in missing lineage for all external tables.

Changes:

  • databricks/queries.py + unitycatalog/queries.py: Added source_path and target_path to SELECT; relaxed WHERE filter from hard IS NOT NULL on name columns to (name IS NOT NULL OR path IS NOT NULL)
  • databricks/client.py: Pass source_path and target_path through the lineage cache dict
  • unitycatalog/lineage.py: Build a reverse path → table_fqn map from the external locations cache; fall back to path resolution when full_name is null; ensure _cache_external_locations() runs before _cache_lineage() so the reverse map is available
  • test_unity_catalog_lineage.py: Updated mock row definitions to include path fields; added tests for path resolution, unresolvable path skipping, and reverse map construction

Type of change:

  • Bug fix

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes #27561: resolve path-based lineage for Databricks external tables
  • I have commented on my code, particularly in hard-to-understand areas.
  • I have added a test that covers the exact scenario we are fixing.

@ShivamChavan01 ShivamChavan01 requested a review from a team as a code owner April 23, 2026 02:40
@github-actions

Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread ingestion/src/metadata/ingestion/source/database/databricks/queries.py Outdated
…-FQN resolution

Reverts the path-based fallback in DATABRICKS_GET_TABLE_LINEAGE and
DATABRICKS_GET_COLUMN_LINEAGE queries since DatabricksClient lacks
the external_path_to_fqn map needed to resolve paths to FQNs.

Without this map, relaxing the IS NOT NULL constraints creates dict keys
containing None values that never match downstream lookups.
@github-actions

Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@ulixius9 ulixius9 added the safe to test Add this label to run secure Github workflows on PRs label Apr 25, 2026
@github-actions

Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@ulixius9

Copy link
Copy Markdown
Member

@ShivamChavan01

#27561 (comment)
can you attach screenshot of how lineage was looking before your fix and how after your fix this is resolved

@ShivamChavan01

Copy link
Copy Markdown
Author

@ShivamChavan01

#27561 (comment)
can you attach screenshot of how lineage was looking before your fix and how after your fix this is resolved

Sure ill attach it

@github-actions

github-actions Bot commented Apr 25, 2026

Copy link
Copy Markdown
Contributor

🟡 Playwright Results — all passed (25 flaky)

✅ 4230 passed · ❌ 0 failed · 🟡 25 flaky · ⏭️ 87 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 299 0 0 4
🟡 Shard 2 796 0 7 8
🟡 Shard 3 795 0 5 8
🟡 Shard 4 836 0 5 12
🟡 Shard 5 718 0 1 47
🟡 Shard 6 786 0 7 8
🟡 25 flaky test(s) (passed on retry)
  • Features/BulkEditEntity.spec.ts › Table (shard 2, 1 retry)
  • Features/BulkImport.spec.ts › Keyboard Delete selection (shard 2, 1 retry)
  • Features/ColumnBulkOperations.spec.ts › should show disabled edit button when no columns are selected (shard 2, 1 retry)
  • Features/DataQuality/TestCaseImportExportE2eFlow.spec.ts › Admin: Complete export-import-validate flow (shard 2, 1 retry)
  • Features/DataQuality/TestCaseResultPermissions.spec.ts › User with only VIEW cannot PATCH results (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 1 retry)
  • Features/IncidentManager.spec.ts › Complete Incident lifecycle with table owner (shard 2, 1 retry)
  • Features/KnowledgeCenterList.spec.ts › Knowledge Center List - Test bookmark functionality (shard 3, 1 retry)
  • Features/OntologyExplorerCardinality.spec.ts › stats reflect the cardinality-typed edges in the relation count (shard 3, 1 retry)
  • Features/OntologyExplorerE2E.spec.ts › toggling edge labels off and back on leaves the graph and cardinality map intact (shard 3, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/Table.spec.ts › Tags term should be consistent for search (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 4, 1 retry)
  • Pages/CustomProperties.spec.ts › Table (shard 4, 1 retry)
  • Pages/CustomProperties.spec.ts › Time (shard 4, 1 retry)
  • Pages/CustomProperties.spec.ts › Integer (shard 4, 1 retry)
  • Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Not_Set (shard 4, 1 retry)
  • Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
  • Pages/Glossary.spec.ts › Glossary & terms creation for reviewer as user (shard 6, 1 retry)
  • Pages/Glossary.spec.ts › Add, Update and Verify Data Glossary Term (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for dashboard -> mlModel (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify Impact Analysis service filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/ODCSImportExport.spec.ts › Multi-object ODCS contract - object selector shows all schema objects (shard 6, 1 retry)
  • Pages/Users.spec.ts › Create and Delete user (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

@ayush-shah

Copy link
Copy Markdown
Member

Thanks for the PR. This needs to be updated against the latest main before we can move it forward.

Could you please rebase on main, resolve any conflicts if present, push the updated branch, and let CI rerun? Once the checks are green, we can re-check merge readiness.

@ShivamChavan01

Copy link
Copy Markdown
Author

Thanks for the PR. This needs to be updated against the latest main before we can move it forward.

Could you please rebase on main, resolve any conflicts if present, push the updated branch, and let CI rerun? Once the checks are green, we can re-check merge readiness.

Sure sorry got busy with some other things ill do it ASAP

Comment thread ingestion/src/metadata/ingestion/source/database/unitycatalog/lineage.py Outdated
@github-actions

Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@github-actions

Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@gitar-bot

gitar-bot Bot commented May 22, 2026

Copy link
Copy Markdown
Code Review ✅ Approved 2 resolved / 2 findings

Resolves Databricks lineage gaps by enabling path-based resolution for external tables and updating query filters. Addresses issues with path fallback in lineage caching and removes residual merge conflict markers.

✅ 2 resolved
Bug: DatabricksClient column lineage caching ignores path fallback

📄 ingestion/src/metadata/ingestion/source/database/databricks/client.py:370-379 📄 ingestion/src/metadata/ingestion/source/database/databricks/queries.py:107-121 📄 ingestion/src/metadata/ingestion/source/database/databricks/client.py:348-355 📄 ingestion/src/metadata/ingestion/source/database/databricks/queries.py:90-104
The DATABRICKS_GET_COLUMN_LINEAGE query was relaxed to allow rows where source_table_full_name or target_table_full_name is NULL (as long as the corresponding path is not null). However, the cache_lineage() method in client.py (lines 370-379) still directly uses row.source_table_full_name and row.target_table_full_name without any path-based fallback. This means:

  1. Column lineage rows for external tables will create dict keys containing None (e.g., (None, 'cat.schema.target')), which won't match any downstream lookup.
  2. These phantom entries silently pollute entity_column_lineage and will never produce useful lineage.

The same path-resolution logic added to unitycatalog/lineage.py should be applied here, or the column lineage query's WHERE clause should retain the IS NOT NULL filter on table name columns (as done before this PR) since there's no external_path_to_fqn map available in DatabricksClient.

Bug: Unresolved merge conflict markers in production code

📄 ingestion/src/metadata/ingestion/source/database/unitycatalog/lineage.py:115-129 📄 ingestion/src/metadata/ingestion/source/database/unitycatalog/lineage.py:156-170 📄 ingestion/src/metadata/ingestion/source/database/unitycatalog/lineage.py:190-199 📄 ingestion/tests/unit/topology/database/test_unity_catalog_lineage.py:93-105 📄 ingestion/tests/unit/topology/database/test_unity_catalog_lineage.py:135-147
The file lineage.py contains unresolved git merge conflict markers (<<<<<<<, =======, >>>>>>>) at lines 115-138, 156-170, and 190-199. Similarly, test_unity_catalog_lineage.py has conflict markers at lines 93-105 and 135-147.

This will cause a SyntaxError at import time, making the entire Unity Catalog lineage module non-functional. The merge of main into the feature branch was not resolved before pushing.

Fix: Resolve all merge conflicts by choosing the appropriate code from each side (likely the feature branch side that includes path-based fallback logic) and remove all conflict markers.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed for 'open-metadata-ingestion'

Failed conditions
0.0% Coverage on New Code (required ≥ 20%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Lineage Databricks is not performed for external tables using path-based queries.

3 participants