Skip to content

feat(hive): Add partition key detection via DESCRIBE FORMATTED #26712#27278

Open
mohitjeswani01 wants to merge 6 commits intoopen-metadata:mainfrom
mohitjeswani01:feat/26712-hive-partition-keys
Open

feat(hive): Add partition key detection via DESCRIBE FORMATTED #26712#27278
mohitjeswani01 wants to merge 6 commits intoopen-metadata:mainfrom
mohitjeswani01:feat/26712-hive-partition-keys

Conversation

@mohitjeswani01
Copy link
Copy Markdown

Fixes #26712

Describe your changes:

Added get_table_partition_details() to HiveSource to identify Hive partition key columns and expose them as TablePartition metadata.

  • Runs DESCRIBE FORMATTED via HiveServer2 engine to fetch table description
  • Parses the # Partition Information section to extract partition key column names
  • Stops collection when the next section header is encountered — prevents metadata rows like Database: and Owner: from being included as partition keys
  • Guards against metastoreConnection — skips engine-based detection when a metastore is configured
  • Returns TablePartition with PartitionColumnDetails using COLUMN_VALUE interval type

Type of change:

  • New feature

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is feat(hive): Add partition key detection via DESCRIBE FORMATTED #26712
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Tests:

  • I have added tests around the new logic.
Test Covers
test_partition_keys_extracted_correctly Correct extraction, metadata rows excluded
test_no_partition_section_returns_false Non-partitioned table returns (False, None)
test_metastore_connection_skips_detection Engine never called when metastore configured
test_engine_exception_returns_false Graceful failure on engine error

Copilot AI review requested due to automatic review settings April 11, 2026 12:05
@mohitjeswani01 mohitjeswani01 requested a review from a team as a code owner April 11, 2026 12:05
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread ingestion/src/metadata/ingestion/source/database/hive/metadata.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/database/hive/metadata.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/database/hive/metadata.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Hive partition key detection by parsing DESCRIBE FORMATTED output so partition columns can be exposed as TablePartition metadata during ingestion.

Changes:

  • Added HiveSource.get_table_partition_details() that runs DESCRIBE FORMATTED and parses the # Partition Information section into TablePartition.
  • Added unit tests covering successful extraction, missing partition section, metastore-skip behavior, and engine errors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
ingestion/src/metadata/ingestion/source/database/hive/metadata.py Implements partition-key extraction via DESCRIBE FORMATTED and maps results to TablePartition.
ingestion/tests/unit/topology/database/test_hive.py Adds unit tests for the new Hive partition parsing logic and skip/error behaviors.

Comment thread ingestion/src/metadata/ingestion/source/database/hive/metadata.py
Comment thread ingestion/src/metadata/ingestion/source/database/hive/metadata.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/database/hive/metadata.py Outdated
Comment thread ingestion/tests/unit/topology/database/test_hive.py
Comment thread ingestion/tests/unit/topology/database/test_hive.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@mohitjeswani01
Copy link
Copy Markdown
Author

All bot review comments from Gitar Bot and Copilot have been addressed in commit b2d7645. Summary of fixes:

  1. self.connection_configself.service_connection (AttributeError fix)
  2. Added identifier_preparer.quote() for safe identifier escaping (SQL injection fix)
  3. Added # pylint: disable=unused-argument for unused inspector parameter
  4. Mock()MagicMock() with __getitem__.side_effect in test helper
  5. Test fixtures updated to use service_connection consistently

@harshach @pmbrull could you please add the safe-to-test label? Thank you!

@PubChimps PubChimps added the safe to test Add this label to run secure Github workflows on PRs label Apr 16, 2026
@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 16, 2026

🟡 Playwright Results — all passed (10 flaky)

✅ 3987 passed · ❌ 0 failed · 🟡 10 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 295 0 4 4
🟡 Shard 2 752 0 2 8
🟡 Shard 3 746 0 1 7
🟡 Shard 4 774 0 1 18
✅ Shard 5 687 0 0 41
🟡 Shard 6 733 0 2 8
🟡 10 flaky test(s) (passed on retry)
  • Flow/Tour.spec.ts › Tour should work from help section (shard 1, 1 retry)
  • Flow/Tour.spec.ts › Tour should work from welcome screen (shard 1, 1 retry)
  • Flow/Tour.spec.ts › Tour should work from URL directly (shard 1, 1 retry)
  • Pages/AuditLogs.spec.ts › should apply both User and EntityType filters simultaneously (shard 1, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event is created when description is updated (shard 2, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Pages/DataContractsSemanticRules.spec.ts › Validate Description Rule Is_Not_Set (shard 4, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread ingestion/src/metadata/ingestion/source/database/hive/metadata.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

- Fix self.connection_config → self.service_connection
- Add identifier_preparer quoting to prevent SQL injection
- Add pylint disable for unused inspector argument
- Fix Mock → MagicMock in tests for __getitem__ support
- Replace raw getattr truthiness check with _get_validated_metastore_connection()
- Use MagicMock for engine to support context manager protocol
- Patch _get_validated_metastore_connection in skip detection test
- All 40 tests pass locally (pytest -v)
@mohitjeswani01 mohitjeswani01 force-pushed the feat/26712-hive-partition-keys branch from 61b08af to de33001 Compare April 30, 2026 18:11
Copilot AI review requested due to automatic review settings April 30, 2026 18:11
Comment thread ingestion/src/metadata/ingestion/source/database/hive/metadata.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread ingestion/src/metadata/ingestion/source/database/hive/metadata.py
return None. Without the guard, dialect.identifier_preparer raises a
misleading AttributeError silently caught by the broad except block.
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 30, 2026

Code Review ✅ Approved 4 resolved / 4 findings

Implements partition key detection using DESCRIBE FORMATTED with sanitized query inputs and robust attribute access. Resolved potential SQL injection risks, attribute errors, and parsing inconsistencies.

✅ 4 resolved
Bug: self.connection_config does not exist — will raise AttributeError

📄 ingestion/src/metadata/ingestion/source/database/hive/metadata.py:163
self.connection_config is not defined anywhere in HiveSource or its parent class hierarchy (CommonDbSourceService, DatabaseServiceSource, etc.). The correct attribute is self.service_connection, as used elsewhere in the same file (e.g., line 97: self.service_connection.metastoreConnection).

Because line 163 is outside the try/except block (which starts at line 174), accessing self.connection_config will raise an uncaught AttributeError on every call to get_table_partition_details, crashing partition detection for every table.

Note: the unit tests don't catch this because they use Mock() as self.source, which auto-creates any attribute on access.

Security: SQL injection via f-string in DESCRIBE FORMATTED query

📄 ingestion/src/metadata/ingestion/source/database/hive/metadata.py:176-180
The table_name and schema_name parameters are interpolated directly into the SQL string via f-string (line 178). While backtick-quoting provides some protection, a table or schema name containing a backtick (e.g., `; DROP TABLE --) would break out of the quoting.

In practice, these values originate from Hive metadata introspection so the risk is limited, but other sources in the codebase (e.g., Postgres) use parameterized queries for the same pattern. Since DESCRIBE FORMATTED is a HiveQL statement that doesn't support bind parameters, consider sanitizing the inputs by rejecting or escaping backtick characters.

Edge Case: Parsing skips # col_name header only by coincidence

📄 ingestion/src/metadata/ingestion/source/database/hive/metadata.py:186-196
After entering the partition section, the parser skips the # col_name sub-header row (e.g., ('# col_name', 'data_type', 'comment')) because it starts with # and partition_keys is still empty. This works correctly but is fragile — if # col_name were ever absent (unlikely but possible with different Hive versions), the logic would still function. However, if there were two consecutive # headers before any partition key, the second would also be silently skipped. The logic is functional but a comment explaining this dependency would improve maintainability.

Bug: AttributeError when dialect is None from getattr fallback

📄 ingestion/src/metadata/ingestion/source/database/hive/metadata.py:170-171
On line 170, getattr(self.engine, "dialect", None) can return None (e.g., if self.engine was released via _release_engine setting it to None, or in an unexpected state). Line 171 then dereferences dialect.identifier_preparer unconditionally, which will raise AttributeError: 'NoneType' object has no attribute 'identifier_preparer'.

While this is inside a broad except Exception block (line 192) so it won't crash the pipeline, it will silently skip partition detection with a misleading warning message about "Failed to get partition details" rather than indicating the actual root cause. The getattr with a None default suggests the author anticipated dialect might be absent, but didn't follow through with the guard.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@mohitjeswani01
Copy link
Copy Markdown
Author

hi @PubChimps , @ulixius9 , @harshach sir i have solved all the merge conflicts here and also recently checked the bots comments and passed the test cases locally could you please give this pr a re-review ? thanks 🙏

image

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add metadata flag to identify Hive partition keys

3 participants