Skip to content

feat(hive): add metadata flag to identify partition keys (#26712)#27029

Open
SaaiAravindhRaja wants to merge 17 commits intoopen-metadata:mainfrom
SaaiAravindhRaja:feat/issue-26712-hive-partition-key-flag
Open

feat(hive): add metadata flag to identify partition keys (#26712)#27029
SaaiAravindhRaja wants to merge 17 commits intoopen-metadata:mainfrom
SaaiAravindhRaja:feat/issue-26712-hive-partition-key-flag

Conversation

@SaaiAravindhRaja
Copy link
Copy Markdown
Contributor

@SaaiAravindhRaja SaaiAravindhRaja commented Apr 4, 2026

Fixes #26712

Marks Hive partition columns with a metadata flag so users can distinguish partitioning columns from regular data columns in the schema view.


Summary by Gitar

  • Enhanced partition discovery:
    • Added fallback to SHOW PARTITIONS in _get_partition_keys_from_describe when standard metadata extraction fails.
  • Improved SQL reliability:
    • Encapsulated identifier quoting into a reusable quote_identifier helper in metadata.py.
  • Test coverage:
    • Added unit tests for SHOW PARTITIONS parsing and updated mock configurations for DESCRIBE FORMATTED tests.

This will update automatically on new commits.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 4, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 4, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread ingestion/src/metadata/ingestion/source/database/hive/metadata.py Outdated
@SaaiAravindhRaja SaaiAravindhRaja changed the title feat(hive): add metadata flag to identify partition keys (#26712) feat(hive): add metadata flag to identify partition keys (#26712) [WIP] Apr 4, 2026
@SaaiAravindhRaja SaaiAravindhRaja marked this pull request as ready for review April 4, 2026 11:16
@SaaiAravindhRaja SaaiAravindhRaja requested a review from a team as a code owner April 4, 2026 11:16
Copilot AI review requested due to automatic review settings April 4, 2026 11:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Hive-specific partition key extraction during ingestion so Hive tables can be marked as partitioned and expose tablePartition details, enabling the UI to distinguish partition columns from regular data columns.

Changes:

  • Import TablePartition / PartitionColumnDetails schema types for partition metadata.
  • Add HiveSource.get_table_partition_details() that runs DESCRIBE FORMATTED and builds a TablePartition from detected partition keys.

Comment on lines +188 to +205
with self.engine.connect() as conn:
rows = conn.execute(
text(f"DESCRIBE FORMATTED `{schema_name}`.`{table_name}`")
)
for row in rows:
col_name = row[0].strip() if row[0] else ""
if col_name == "# Partition Information":
in_partition_section = True
continue
if in_partition_section:
if col_name.startswith("#") or not col_name:
continue
partition_keys.append(col_name)
except Exception as exc:
logger.debug(traceback.format_exc())
logger.warning(
f"Failed to get partition details for {schema_name}.{table_name}: {exc}"
)
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When metastoreConnection is configured, prepare() replaces self.engine with a MySQL/Postgres metastore engine. Executing DESCRIBE FORMATTED against that engine will fail (and will likely log a warning per table), meaning partition details won’t be ingested for metastore-based setups. This method should either (a) query partition keys from the metastore tables (e.g., PARTITION_KEYS joined with TBLS/DBS), or (b) keep a separate HiveServer2 engine for HiveQL statements and use that here.

Copilot uses AI. Check for mistakes.
if col_name == "# Partition Information":
in_partition_section = True
continue
if in_partition_section:
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parsing logic never exits the “Partition Information” section, so it can accidentally treat later non-column rows (e.g., table metadata like Database:/Owner: that appear after the partition block in DESCRIBE FORMATTED) as partition keys. Consider stopping collection when you hit the first blank row after reading at least one partition key, or when a new section header like # Detailed Table Information is encountered.

Suggested change
if in_partition_section:
if in_partition_section:
if partition_keys and (col_name.startswith("#") or not col_name):
break

Copilot uses AI. Check for mistakes.
Comment on lines +174 to +220
def get_table_partition_details(
self,
table_name: str,
schema_name: str,
inspector: Inspector,
) -> Tuple[bool, Optional[TablePartition]]:
"""
Extract partition key columns from DESCRIBE FORMATTED output.
Returns (is_partitioned, TablePartition) where TablePartition lists all
partition key columns with COLUMN_VALUE interval type.
"""
partition_keys: List[str] = []
in_partition_section = False
try:
with self.engine.connect() as conn:
rows = conn.execute(
text(f"DESCRIBE FORMATTED `{schema_name}`.`{table_name}`")
)
for row in rows:
col_name = row[0].strip() if row[0] else ""
if col_name == "# Partition Information":
in_partition_section = True
continue
if in_partition_section:
if col_name.startswith("#") or not col_name:
continue
partition_keys.append(col_name)
except Exception as exc:
logger.debug(traceback.format_exc())
logger.warning(
f"Failed to get partition details for {schema_name}.{table_name}: {exc}"
)
return False, None

if not partition_keys:
return False, None

return True, TablePartition(
columns=[
PartitionColumnDetails(
columnName=key,
intervalType=PartitionIntervalTypes.COLUMN_VALUE,
interval=None,
)
for key in partition_keys
]
)
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new behavior (Hive partition key detection + tablePartition population) isn’t covered by unit tests. There are existing Hive unit tests in ingestion/tests/unit/topology/database/test_hive.py; adding a test that stubs the DESCRIBE FORMATTED result (and a metastoreConnection case) would prevent regressions and verify that only actual partition key rows are returned.

Copilot generated this review using guidance from repository custom instructions.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SaaiAravindhRaja valid comment

SaaiAravindhRaja added a commit to SaaiAravindhRaja/OpenMetadata that referenced this pull request Apr 4, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 4, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@SaaiAravindhRaja SaaiAravindhRaja changed the title feat(hive): add metadata flag to identify partition keys (#26712) [WIP] feat(hive): add metadata flag to identify partition keys (#26712) Apr 5, 2026
@ulixius9
Copy link
Copy Markdown
Member

ulixius9 commented Apr 6, 2026

@SaaiAravindhRaja can you check the copilot comments

@SaaiAravindhRaja
Copy link
Copy Markdown
Contributor Author

@ulixius9 Noted! Will make the necessary updates asap

@Esvanth
Copy link
Copy Markdown

Esvanth commented Apr 8, 2026

@m0hitbansal I have pushed a pr please review I would like to work on this issue please assign me

Copilot AI review requested due to automatic review settings April 12, 2026 13:01
@SaaiAravindhRaja SaaiAravindhRaja force-pushed the feat/issue-26712-hive-partition-key-flag branch from a6fd952 to 5398ec0 Compare April 12, 2026 13:01
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread ingestion/src/metadata/ingestion/source/database/hive/metadata.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread ingestion/src/metadata/ingestion/source/database/hive/metadata.py Outdated
Comment on lines +214 to +229
else:
with self.engine.connect() as conn:
rows = conn.execute(
text(f"DESCRIBE FORMATTED `{schema_name}`.`{table_name}`")
)
for row in rows:
col_name = row[0].strip() if row[0] else ""
if col_name == "# Partition Information":
in_partition_section = True
continue
if in_partition_section:
if not col_name or col_name.startswith("# Detailed"):
break
if col_name.startswith("#"):
continue
partition_keys.append(col_name)
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests added here cover the metastore-backed path, but the non-metastore path (parsing DESCRIBE FORMATTED output) is currently untested. Since that branch is likely the default when metastoreConnection is not configured, add a test that mocks the DESCRIBE output rows and asserts the extracted partition keys (and a test for a non-partitioned table returning (False, None)).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SaaiAravindhRaja valid comment

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment on lines +189 to +228
if drivername in {"hive+mysql", "hive+postgres"}:
query = (
"""
SELECT pk.PKEY_NAME
FROM PARTITION_KEYS pk
JOIN TBLS tbl ON pk.TBL_ID = tbl.TBL_ID
JOIN DBS db ON tbl.DB_ID = db.DB_ID
WHERE db.NAME = :schema AND tbl.TBL_NAME = :table_name
ORDER BY pk.INTEGER_IDX
"""
if drivername == "hive+mysql"
else """
SELECT pk."PKEY_NAME"
FROM "PARTITION_KEYS" pk
JOIN "TBLS" tbl ON pk."TBL_ID" = tbl."TBL_ID"
JOIN "DBS" db ON tbl."DB_ID" = db."DB_ID"
WHERE db."NAME" = :schema AND tbl."TBL_NAME" = :table_name
ORDER BY pk."INTEGER_IDX"
"""
)
rows = self.connection.execute(
text(query),
{"table_name": table_name, "schema": schema_name},
).fetchall()
partition_keys = [row[0] for row in rows if row and row[0]]
else:
with self.engine.connect() as conn:
rows = conn.execute(
text(f"DESCRIBE FORMATTED `{schema_name}`.`{table_name}`")
)
for row in rows:
col_name = row[0].strip() if row[0] else ""
if col_name == "# Partition Information":
in_partition_section = True
continue
if in_partition_section:
if not col_name or col_name.startswith("# Detailed"):
break
if col_name.startswith("#"):
continue
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can split that into 2 distinct methods and have the calling handle in a 3rd method. Let's also not use fetchall()

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Copilot AI review requested due to automatic review settings April 25, 2026 15:39
@SaaiAravindhRaja SaaiAravindhRaja force-pushed the feat/issue-26712-hive-partition-key-flag branch from 8077416 to 2a23bc1 Compare April 25, 2026 15:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Comment on lines +568 to +572
def test_get_table_partition_details_non_partitioned_describe(self):
self.hive.engine = types.SimpleNamespace(
url=types.SimpleNamespace(drivername="hive")
)
mock_connection = Mock()
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DESCRIBE-based tests override self.hive.engine with a SimpleNamespace that only contains url.drivername. get_table_partition_details() now uses self.engine.dialect.identifier_preparer.quote_identifier(...) when drivername is not metastore, so this test setup will raise AttributeError (and the test may pass/fail for the wrong reason). Update the mocked engine to include a dialect.identifier_preparer with quote_identifier, or avoid replacing the whole engine and just patch self.hive.engine.url.drivername.

Copilot uses AI. Check for mistakes.
Comment on lines +586 to +590
def test_get_table_partition_details_from_describe_formatted(self):
self.hive.engine = types.SimpleNamespace(
url=types.SimpleNamespace(drivername="hive")
)
mock_connection = Mock()
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above: this test replaces self.hive.engine with a SimpleNamespace that lacks dialect.identifier_preparer, but the DESCRIBE FORMATTED path in get_table_partition_details() requires it for quoting. Without adding a mock dialect.identifier_preparer.quote_identifier, the code will error and get_table_partition_details() will return (False, None) from the exception handler, causing this test to fail (or not actually exercise the parsing logic).

Copilot uses AI. Check for mistakes.
Comment on lines +175 to +184
q = (
lambda name: f'"{name}"' if drivername == "hive+postgres" else name
) # noqa: E731
return (
f"SELECT pk.{q('PKEY_NAME')}"
f" FROM {q('PARTITION_KEYS')} pk"
f" JOIN {q('TBLS')} tbl ON pk.{q('TBL_ID')} = tbl.{q('TBL_ID')}"
f" JOIN {q('DBS')} db ON tbl.{q('DB_ID')} = db.{q('DB_ID')}"
f" WHERE db.{q('NAME')} = :schema AND tbl.{q('TBL_NAME')} = :table_name"
f" ORDER BY pk.{q('INTEGER_IDX')}"
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_build_metastore_partition_query assigns a lambda to q and then suppresses the linter with # noqa: E731. Prefer a small local function (or conditional helper) instead so we don’t need to disable lint rules, which improves readability and keeps lint behavior consistent across the codebase.

Suggested change
q = (
lambda name: f'"{name}"' if drivername == "hive+postgres" else name
) # noqa: E731
return (
f"SELECT pk.{q('PKEY_NAME')}"
f" FROM {q('PARTITION_KEYS')} pk"
f" JOIN {q('TBLS')} tbl ON pk.{q('TBL_ID')} = tbl.{q('TBL_ID')}"
f" JOIN {q('DBS')} db ON tbl.{q('DB_ID')} = db.{q('DB_ID')}"
f" WHERE db.{q('NAME')} = :schema AND tbl.{q('TBL_NAME')} = :table_name"
f" ORDER BY pk.{q('INTEGER_IDX')}"
def quote_identifier(name: str) -> str:
return f'"{name}"' if drivername == "hive+postgres" else name
return (
f"SELECT pk.{quote_identifier('PKEY_NAME')}"
f" FROM {quote_identifier('PARTITION_KEYS')} pk"
f" JOIN {quote_identifier('TBLS')} tbl ON pk.{quote_identifier('TBL_ID')} = tbl.{quote_identifier('TBL_ID')}"
f" JOIN {quote_identifier('DBS')} db ON tbl.{quote_identifier('DB_ID')} = db.{quote_identifier('DB_ID')}"
f" WHERE db.{quote_identifier('NAME')} = :schema AND tbl.{quote_identifier('TBL_NAME')} = :table_name"
f" ORDER BY pk.{quote_identifier('INTEGER_IDX')}"

Copilot uses AI. Check for mistakes.
Comment on lines +208 to +218
for row in rows:
col_name = row[0].strip() if row[0] else ""
if col_name == "# Partition Information":
in_partition_section = True
continue
if in_partition_section:
if not col_name or col_name.startswith("# Detailed"):
break
if col_name.startswith("#"):
continue
partition_keys.append(col_name)
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_partition_keys_from_describe() only detects partition keys when the DESCRIBE FORMATTED output contains a # Partition Information section. In this test suite we already account for a Hive describe variant where partition columns appear as duplicate rows without that sentinel (see the existing test_get_columns_deduplicates_partition_column_no_sentinel). In that scenario, get_table_partition_details() will incorrectly report the table as non-partitioned and won’t populate tablePartition. Consider adding a fallback (e.g., SHOW PARTITIONS ... parsing, or another heuristic) and a unit test to cover the no-sentinel format.

Copilot uses AI. Check for mistakes.
Comment thread ingestion/src/metadata/ingestion/source/database/hive/metadata.py Outdated
Copilot AI review requested due to automatic review settings April 25, 2026 15:52
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 25, 2026

Code Review ✅ Approved 4 resolved / 4 findings

Implements a metadata flag for partition keys, resolving the recursive parsing issue and adding missing test coverage for DESCRIBE FORMATTED paths. Redundant driver lookups and exceptions during SHOW PARTITIONS on non-partitioned tables are also addressed.

✅ 4 resolved
Bug: Partition parser never exits section, collects metadata as keys

📄 ingestion/src/metadata/ingestion/source/database/hive/metadata.py:186-200
The DESCRIBE FORMATTED parser sets in_partition_section = True when it encounters # Partition Information but never resets it to False. After the actual partition key rows, Hive outputs additional sections like # Detailed Table Information followed by metadata lines (Database:, Owner:, CreateTime:, Location:, etc.). The #-prefixed headers are skipped, but the metadata key-value lines don't start with # and aren't empty, so they get appended to partition_keys.

This means every Hive table with partitions will have corrupted partition metadata containing entries like Database:, Owner:, CreateTime:, etc., in addition to the real partition columns.

The existing get_columns parser in hive/utils.py:48-49 correctly handles this by using break when it hits a section boundary. The Databricks metadata parser similarly breaks on section markers.

Quality: No test coverage for DESCRIBE FORMATTED parsing path

📄 ingestion/src/metadata/ingestion/source/database/hive/metadata.py:214-228 📄 ingestion/tests/unit/topology/database/test_hive.py:521-535
The get_table_partition_details method has three code paths: MySQL metastore, Postgres metastore, and the default DESCRIBE FORMATTED parsing branch (lines 215-229). Only the MySQL and Postgres paths have unit tests. The DESCRIBE FORMATTED parser contains non-trivial logic (section detection, sentinel line handling, skipping header lines) that should be tested—especially since the previous review found a bug in the parser's exit condition that was already fixed. A test would prevent regressions.

Quality: Redundant drivername lookup in metastore partition path

📄 ingestion/src/metadata/ingestion/source/database/hive/metadata.py:188 📄 ingestion/src/metadata/ingestion/source/database/hive/metadata.py:236
get_table_partition_details computes drivername at line 236 to choose the metastore branch, then _get_partition_keys_from_metastore re-derives it independently at line 188. If someone later changes how the engine URL is accessed in one place but not the other, the two could silently diverge. Consider passing drivername as a parameter to the helper method instead.

Edge Case: SHOW PARTITIONS on non-partitioned table raises exception in Hive

📄 ingestion/src/metadata/ingestion/source/database/hive/metadata.py:223-225
In Hive, SHOW PARTITIONS on a non-partitioned table throws an AnalysisException (e.g. "Table is not partitioned") rather than returning an empty result set. When DESCRIBE FORMATTED doesn't contain a partition section (common for non-partitioned tables), the code always falls through to SHOW PARTITIONS, which will raise an exception. This exception is caught by the outer try/except in get_table_partition_details (line 266), so it won't crash — but it will emit a spurious warning log for every non-partitioned table, creating significant log noise in environments with many tables.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment on lines +253 to +254
For other Hive drivers, it falls back to parsing
``DESCRIBE FORMATTED`` output.
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for this method says the non-metastore path “falls back to parsing DESCRIBE FORMATTED output”, but the implementation also falls back to SHOW PARTITIONS when no partition keys are found. Please update the docstring to reflect the actual behavior so callers/debugging aren’t misled.

Suggested change
For other Hive drivers, it falls back to parsing
``DESCRIBE FORMATTED`` output.
For other Hive drivers, it first parses ``DESCRIBE FORMATTED``
output and, if no partition keys are found there, falls back to
``SHOW PARTITIONS`` and infers keys from the partition specification.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

Jest test Coverage

UI tests summary

Lines Statements Branches Functions
Coverage: 61%
61.94% (61764/99709) 42.06% (33022/78508) 45.09% (9763/21648)

@sonarqubecloud
Copy link
Copy Markdown

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add metadata flag to identify Hive partition keys

6 participants