Skip to content

[Fix][Connector-V2][JDBC] Detect Hive partition keys in metadata#10706

Merged
davidzollo merged 7 commits into
apache:devfrom
zhangqs0205:fix/jdbc-hive-partition-keys
May 26, 2026
Merged

[Fix][Connector-V2][JDBC] Detect Hive partition keys in metadata#10706
davidzollo merged 7 commits into
apache:devfrom
zhangqs0205:fix/jdbc-hive-partition-keys

Conversation

@zhangqs0205
Copy link
Copy Markdown
Contributor

@zhangqs0205 zhangqs0205 commented Apr 3, 2026

Fixes: #10597

Purpose of this pull request

This PR fixes missing Hive partitionKeys in JDBC catalog metadata.

It adds a JDBC dialect extension point for partition key detection, detects Hive partition keys from DESCRIBE <table>, and propagates them into the relevant CatalogTable metadata paths.

Does this PR introduce any user-facing change?

It fixes JDBC metadata for Hive tables so partition keys can be auto-detected instead of always being empty.

How was this patch tested?

Added HiveDialectTest and extended CatalogUtilsTest.

Check list

DanielLeens

This comment was marked as duplicate.

DanielLeens

This comment was marked as duplicate.

Copy link
Copy Markdown
Contributor

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that ran CatalogUtilsTest + HiveDialectTest, and walked the JDBC source path from JdbcCatalogUtils.getTables().

Blocking issue: the new Hive partition-key extraction does not hit the normal catalog path.

  • when a catalog is available, JdbcCatalogUtils.getTables() goes through getCatalogTable(tableConfig, jdbcCatalog, jdbcDialect) and then jdbcCatalog.getTable(tablePath) (seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/utils/JdbcCatalogUtils.java:78-94, 174-216)
  • AbstractJdbcCatalog.getTable() calls the protected getPartitionKeys(connection, tablePath) hook (seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/catalog/AbstractJdbcCatalog.java:230-236)
  • that hook still defaults to Collections.emptyList() in AbstractJdbcCatalog.java:257-259
  • the new SQL logic lives in HiveDialect.getPartitionKeys(...) (seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/hive/HiveDialect.java:77-90), but the catalog path never calls it

So the fix only works in the fallback "direct JDBC without catalog" branch, not in the normal catalog-backed path. The new tests pass because they cover HiveDialect and helper merging in isolation, not the real JdbcCatalogUtils -> AbstractJdbcCatalog.getTable() chain.

Please wire the Hive partition-key lookup into the catalog implementation that the normal source path actually uses.

davidzollo
davidzollo previously approved these changes Apr 25, 2026
Copy link
Copy Markdown
Contributor

@davidzollo davidzollo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
Good job, thanks for the fix.

@DanielLeens
Copy link
Copy Markdown
Contributor

Hi @zhangqs0205, I rechecked the current head locally after the latest review activity.

The blocker from my earlier review is addressed on the current head:

  • the Hive partition keys are no longer only parsed in isolation
  • they are now carried into the real CatalogTable construction path
  • the current Build is green

Runtime chain I rechecked

Hive JDBC metadata
  -> HiveDialect.getPartitionKeys(connection, tablePath)
      -> parse DESCRIBE <table> partition section
  -> JdbcCatalogUtils.getCatalogTable(...)
      -> CatalogUtils.getCatalogTable(..., partitionKeys)
      -> CatalogTable.partitionKeys

The important difference from the older blocked version is that the partition keys are now passed through the reachable metadata path instead of only being tested at the parser/helper level.

Conclusion: can merge

Blocking items:

  • None from my side.

Non-blocking follow-up:

  • If you want even stronger confidence later, a more integration-style check against a real Hive-compatible driver output would still be nice hardening, but I would not block this PR on that.

Thanks for the fix. The current head looks good to me now.

@zhangqs0205 zhangqs0205 force-pushed the fix/jdbc-hive-partition-keys branch from 1e8b9eb to 4abb40b Compare April 28, 2026 11:10
Copy link
Copy Markdown
Contributor

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zhangqs0205, I re-reviewed the current head locally from seatunnel-review-10706 against upstream/dev. I did not run Maven locally in this pass; this is a source-level review plus current GitHub check metadata.

What This PR Fixes

  • User pain: Hive partition columns are often missing from JDBC catalog metadata, so table-path-based reads lose partition-key information and downstream schema checks become incomplete.
  • Fix approach: teach HiveDialect to parse partition keys from DESCRIBE <table> output, thread those keys into CatalogTable, and add unit + E2E coverage.
  • One-line value: this makes the Hive JDBC catalog path preserve the real partition-key contract instead of dropping it.

Runtime Chain I Checked

table_path source / catalog loading
  -> AbstractJdbcCatalog.getTable(...)
      -> getPartitionKeys(conn, tablePath)
          -> HiveDialect.getPartitionKeys()
              -> DESCRIBE db.table
              -> parse "# Partition Information" section
      -> CatalogTable.of(..., partitionKeys, ...)
  -> JdbcCatalogUtils.getTables(...)
      -> JdbcSourceTable carries partition keys into source metadata

Findings

  • HiveDialect.getPartitionKeys() is on the real metadata-loading path used by AbstractJdbcCatalog.getTable(), so the new parsing logic is not dead code.
  • The parsed partition-key list is now threaded into CatalogTable, which is the right contract boundary for later source/schema logic.
  • The unit tests cover both the DESCRIBE parsing path and the catalog-table propagation path, and the Hive E2E now asserts the partition table plus dt/hr metadata path. That is good coverage for the user-facing behavior.
  • I did not find a reopened code blocker in the latest implementation.

Merge Decision

Conclusion: can merge after fixes

  1. Blocking items
  • No code blocker from my side.
  • The current Build check is still running, so merge should wait for CI to finish green.
  1. Suggested non-blocking follow-up
  • None from me in the current code path.

Overall, this is a practical metadata fix and the added tests now cover the exact Hive partition-key path that users depend on.

Copy link
Copy Markdown
Contributor

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I reviewed the full current head locally on seatunnel-review-10706 against upstream/dev. This was a source-level review only; I did not run local Maven/tests in this batch.

This is a practical metadata-path fix, and the important chain now looks right on the current head:

table_path source / catalog loading
  -> AbstractJdbcCatalog.getTable(...)
      -> getPartitionKeys(conn, tablePath)
          -> HiveDialect.getPartitionKeys()
              -> DESCRIBE db.table
              -> parse the "# Partition Information" section
      -> CatalogTable.of(..., partitionKeys, ...)
  -> JdbcCatalogUtils.getTables(...)
      -> JdbcSourceTable carries partition keys forward

Key points from the current head:

  • HiveDialect.getPartitionKeys() is on the real metadata-loading path used by AbstractJdbcCatalog.getTable(), so the new parsing logic is not dead code.
  • The parsed partition keys are now threaded into CatalogTable, which is the correct contract boundary for downstream schema/source logic.
  • The unit tests plus the Hive E2E cover the exact partition-key metadata path that users depend on.
  • I did not find a reopened code blocker in the latest implementation.

The only remaining gate from my side is CI: the current GitHub Build check is CANCELLED, so there is no green signal for the latest head yet.

Conclusion: can merge after fixes

Blocking items:

  1. No code blocker from my side.
  2. Re-run Build and make sure the latest head is green before merge.

Non-blocking follow-up:

  1. None from me on the current code path.

This is a useful metadata fix and the runtime path now looks correctly wired.

Copy link
Copy Markdown
Contributor

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I re-reviewed the latest head locally against upstream/dev, focusing on the Hive metadata-loading path that feeds JDBC catalog/table-path reads.

Runtime path I checked:

table_path source / catalog loading
  -> AbstractJdbcCatalog.getTable(...)
      -> getPartitionKeys(conn, tablePath)
          -> HiveDialect.getPartitionKeys()
              -> DESCRIBE db.table
              -> parse "# Partition Information" section
      -> CatalogTable.of(..., partitionKeys, ...)
  -> JdbcCatalogUtils.getTables(...)
      -> JdbcSourceTable carries partition keys into source metadata

Key findings:

  • HiveDialect.getPartitionKeys() is on the real metadata-loading path used by AbstractJdbcCatalog.getTable(), so the new parsing logic is not dead code.
  • The parsed partition-key list is now threaded into CatalogTable, which is the right contract boundary for later source/schema logic.
  • The unit tests cover both the DESCRIBE parsing path and the catalog-table propagation path, and the Hive E2E now asserts the partition-table plus dt/hr metadata path.
  • I did not find a reopened code blocker in the latest implementation.

CI / verification:

  • I did not run Maven locally in this pass; this is a source-level review.
  • The current top-level Build status on the latest head is CANCELLED, so the branch still needs a clean rerun before merge.

Conclusion: merge after fixes

  1. Blocking items
  • No code blocker from my side.
  • Please rerun the latest Build and get a clean green result before merge.
  1. Suggested non-blocking follow-ups
  • None in the current code path.

Overall, this is a practical metadata fix, and the added tests cover the exact Hive partition-key path that users depend on.

Copy link
Copy Markdown
Contributor

@davidzollo davidzollo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I re-reviewed the full current head locally on seatunnel-review-10706 against upstream/dev. This was a source-level review only; I did not run local Maven in this round.

What This PR Fixes

  • User pain: Hive/Inceptor metadata discovery is inconsistent through standard JDBC APIs, especially around partition keys and sometimes primary-key metadata.
  • Fix approach: the PR parses Hive partition keys from DESCRIBE output and also adds a fallback around DatabaseMetaData.getPrimaryKeys().
  • One-line summary: the Hive-specific direction is reasonable, but the shared primary-key fallback is currently too broad and changes behavior for every JDBC dialect, not just Hive.

1. Code Review

1.1 Core Logic Analysis

  • Exact change scope:

    • seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/catalog/utils/CatalogUtils.java:130-163
    • seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/hive/HiveDialect.java:78-112
  • Before / after:

// before
try (ResultSet rs = metaData.getPrimaryKeys(...)) {
    while (rs.next()) {
        primaryKeyColumns.add(...);
    }
}
// after
try (ResultSet rs = metaData.getPrimaryKeys(...)) {
    while (rs.next()) {
        primaryKeyColumns.add(...);
    }
} catch (SQLException e) {
    log.warn("Failed to get primary key info for table {}, returning empty primary key. Error: {}", tablePath, e.getMessage());
    return Optional.empty();
}
  • Key findings:

    1. The Hive DESCRIBE parsing path is on the real runtime path and looks directionally correct.
    2. The shared CatalogUtils.getPrimaryKey() helper is not Hive-only. Other JDBC catalogs also go through it.
    3. Catching every SQLException here means a real metadata failure now becomes "this table has no primary key".
    4. That is not just a logging choice. It changes downstream schema / constraint semantics.
  • Runtime chain I checked:

JDBC catalog introspection
  -> CatalogUtils.getPrimaryKey(DatabaseMetaData, TablePath)
      -> DatabaseMetaData.getPrimaryKeys(...)
      -> Optional<PrimaryKey>
  -> CatalogTable / TableSchema assembly
      -> downstream save-mode / upsert / constraint decisions use that PK signal

Hive-specific partition path
  -> HiveDialect.getPartitionKeys(Connection, TablePath)
      -> DESCRIBE table
      -> parse "# Partition Information" section

1.2 Compatibility Impact

  • Verdict: partially incompatible.
  • API/config/defaults/protocol/serialization are unchanged.
  • The incompatible part is behavior: a shared JDBC metadata failure is now downgraded to "no primary key", which is broader than the Hive problem this PR is trying to fix.

1.3 Performance / Side Effects

  • No meaningful CPU or memory regression from the Hive parser itself.
  • The main side effect is semantic, not performance-related: real metadata failures are now silently flattened into a different business meaning.

1.4 Error Handling and Logs

Issue 1: The shared primary-key fallback is too broad

  • Location: CatalogUtils.java:144
  • Why this is a problem:
    this code sits on the normal JDBC metadata path. Right now any SQLException from getPrimaryKeys() becomes Optional.empty(), even when the driver should have returned real PK metadata and something actually went wrong.
  • Risk:
    downstream logic can now treat a metadata failure as "this table genuinely has no primary key", which can mislead save-mode, upsert, or catalog-sync behavior.
  • Suggested fix:
    narrow the fallback to dialects or exception types that are explicitly known to mean "primary key metadata is unsupported" instead of swallowing every SQLException.
  • Severity: High

Issue 2: The current Build is still red

  • Location: GitHub Build
  • Why this matters:
    the latest Apache-side Build is still failing, and the linked fork run currently shows failures including unit-test (8, ubuntu-latest) and updated-modules-integration-test-part-4 (11, ubuntu-latest).
  • Severity: Medium

2. Code Quality

  • The new Hive partition parsing is readable and scoped correctly.
  • Test coverage is moving in the right direction, but I would still add one regression case proving that a non-Hive JDBC metadata failure is not silently downgraded to "empty PK".

3. Architecture

  • The Hive dialect part is a precise fix.
  • The shared fallback in CatalogUtils is not. It broadens the behavior boundary beyond the original problem.

4. Issue Summary

No. Issue Location Severity
1 Shared PK metadata failures are downgraded too broadly CatalogUtils.java:144 High
2 Current Build is still red GitHub Build Medium

5. Merge Conclusion

Conclusion: merge after fixes

  1. Blocking items
  • Issue 1 must be fixed. The fallback needs to be narrowed so it does not change behavior for every JDBC dialect.
  • Issue 2 also needs to be cleared by getting the latest Build green.
  1. Suggested non-blocking follow-up
  • Add one non-Hive regression test around shared PK metadata failures so this boundary does not widen again later.

The Hive-specific direction still looks good to me. The blocker is that the current shared fallback changes the semantics for all JDBC catalogs, not just Hive.

@davidzollo davidzollo dismissed their stale review May 7, 2026 14:35

Posted from the wrong account by mistake; superseded by the DanielLeens review.

Copy link
Copy Markdown
Contributor

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I re-reviewed the full current head locally on seatunnel-review-10706 against upstream/dev. This was a source-level review only; I did not run local Maven in this round.

What This PR Fixes

  • User pain: Hive/Inceptor metadata discovery is inconsistent through standard JDBC APIs, especially around partition keys and sometimes primary-key metadata.
  • Fix approach: the PR parses Hive partition keys from DESCRIBE output and also adds a fallback around DatabaseMetaData.getPrimaryKeys().
  • One-line summary: the Hive-specific direction is reasonable, but the shared primary-key fallback is currently too broad and changes behavior for every JDBC dialect, not just Hive.

1. Code Review

1.1 Core Logic Analysis

  • Exact change scope:

    • seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/catalog/utils/CatalogUtils.java:130-163
    • seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/hive/HiveDialect.java:78-112
  • Before / after:

// before
try (ResultSet rs = metaData.getPrimaryKeys(...)) {
    while (rs.next()) {
        primaryKeyColumns.add(...);
    }
}
// after
try (ResultSet rs = metaData.getPrimaryKeys(...)) {
    while (rs.next()) {
        primaryKeyColumns.add(...);
    }
} catch (SQLException e) {
    log.warn("Failed to get primary key info for table {}, returning empty primary key. Error: {}", tablePath, e.getMessage());
    return Optional.empty();
}
  • Key findings:

    1. The Hive DESCRIBE parsing path is on the real runtime path and looks directionally correct.
    2. The shared CatalogUtils.getPrimaryKey() helper is not Hive-only. Other JDBC catalogs also go through it.
    3. Catching every SQLException here means a real metadata failure now becomes "this table has no primary key".
    4. That is not just a logging choice. It changes downstream schema / constraint semantics.
  • Runtime chain I checked:

JDBC catalog introspection
  -> CatalogUtils.getPrimaryKey(DatabaseMetaData, TablePath)
      -> DatabaseMetaData.getPrimaryKeys(...)
      -> Optional<PrimaryKey>
  -> CatalogTable / TableSchema assembly
      -> downstream save-mode / upsert / constraint decisions use that PK signal

Hive-specific partition path
  -> HiveDialect.getPartitionKeys(Connection, TablePath)
      -> DESCRIBE table
      -> parse "# Partition Information" section

1.2 Compatibility Impact

  • Verdict: partially incompatible.
  • API/config/defaults/protocol/serialization are unchanged.
  • The incompatible part is behavior: a shared JDBC metadata failure is now downgraded to "no primary key", which is broader than the Hive problem this PR is trying to fix.

1.3 Performance / Side Effects

  • No meaningful CPU or memory regression from the Hive parser itself.
  • The main side effect is semantic, not performance-related: real metadata failures are now silently flattened into a different business meaning.

1.4 Error Handling and Logs

Issue 1: The shared primary-key fallback is too broad

  • Location: CatalogUtils.java:144
  • Why this is a problem:
    this code sits on the normal JDBC metadata path. Right now any SQLException from getPrimaryKeys() becomes Optional.empty(), even when the driver should have returned real PK metadata and something actually went wrong.
  • Risk:
    downstream logic can now treat a metadata failure as "this table genuinely has no primary key", which can mislead save-mode, upsert, or catalog-sync behavior.
  • Suggested fix:
    narrow the fallback to dialects or exception types that are explicitly known to mean "primary key metadata is unsupported" instead of swallowing every SQLException.
  • Severity: High

Issue 2: The current Build is still red

  • Location: GitHub Build
  • Why this matters:
    the latest Apache-side Build is still failing, and the linked fork run currently shows failures including unit-test (8, ubuntu-latest) and updated-modules-integration-test-part-4 (11, ubuntu-latest).
  • Severity: Medium

2. Code Quality

  • The new Hive partition parsing is readable and scoped correctly.
  • Test coverage is moving in the right direction, but I would still add one regression case proving that a non-Hive JDBC metadata failure is not silently downgraded to "empty PK".

3. Architecture

  • The Hive dialect part is a precise fix.
  • The shared fallback in CatalogUtils is not. It broadens the behavior boundary beyond the original problem.

4. Issue Summary

No. Issue Location Severity
1 Shared PK metadata failures are downgraded too broadly CatalogUtils.java:144 High
2 Current Build is still red GitHub Build Medium

5. Merge Conclusion

Conclusion: merge after fixes

  1. Blocking items
  • Issue 1 must be fixed. The fallback needs to be narrowed so it does not change behavior for every JDBC dialect.
  • Issue 2 also needs to be cleared by getting the latest Build green.
  1. Suggested non-blocking follow-up
  • Add one non-Hive regression test around shared PK metadata failures so this boundary does not widen again later.

The Hive-specific direction still looks good to me. The blocker is that the current shared fallback changes the semantics for all JDBC catalogs, not just Hive.

@zhangqs0205
Copy link
Copy Markdown
Contributor Author

Thanks for working on this. I re-reviewed the full current head locally on seatunnel-review-10706 against upstream/dev. This was a source-level review only; I did not run local Maven in this round.

What This PR Fixes

  • User pain: Hive/Inceptor metadata discovery is inconsistent through standard JDBC APIs, especially around partition keys and sometimes primary-key metadata.
  • Fix approach: the PR parses Hive partition keys from DESCRIBE output and also adds a fallback around DatabaseMetaData.getPrimaryKeys().
  • One-line summary: the Hive-specific direction is reasonable, but the shared primary-key fallback is currently too broad and changes behavior for every JDBC dialect, not just Hive.

1. Code Review

1.1 Core Logic Analysis

  • Exact change scope:

    • seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/catalog/utils/CatalogUtils.java:130-163
    • seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/hive/HiveDialect.java:78-112
  • Before / after:

// before
try (ResultSet rs = metaData.getPrimaryKeys(...)) {
    while (rs.next()) {
        primaryKeyColumns.add(...);
    }
}
// after
try (ResultSet rs = metaData.getPrimaryKeys(...)) {
   while (rs.next()) {
       primaryKeyColumns.add(...);
   }
} catch (SQLException e) {
   log.warn("Failed to get primary key info for table {}, returning empty primary key. Error: {}", tablePath, e.getMessage());
   return Optional.empty();
}
  • Key findings:
  1. The Hive DESCRIBE parsing path is on the real runtime path and looks directionally correct.
  2. The shared CatalogUtils.getPrimaryKey() helper is not Hive-only. Other JDBC catalogs also go through it.
  3. Catching every SQLException here means a real metadata failure now becomes "this table has no primary key".
  4. That is not just a logging choice. It changes downstream schema / constraint semantics.
  • Runtime chain I checked:
JDBC catalog introspection
  -> CatalogUtils.getPrimaryKey(DatabaseMetaData, TablePath)
      -> DatabaseMetaData.getPrimaryKeys(...)
      -> Optional<PrimaryKey>
  -> CatalogTable / TableSchema assembly
      -> downstream save-mode / upsert / constraint decisions use that PK signal

Hive-specific partition path
  -> HiveDialect.getPartitionKeys(Connection, TablePath)
      -> DESCRIBE table
      -> parse "# Partition Information" section

1.2 Compatibility Impact

  • Verdict: partially incompatible.
  • API/config/defaults/protocol/serialization are unchanged.
  • The incompatible part is behavior: a shared JDBC metadata failure is now downgraded to "no primary key", which is broader than the Hive problem this PR is trying to fix.

1.3 Performance / Side Effects

  • No meaningful CPU or memory regression from the Hive parser itself.
  • The main side effect is semantic, not performance-related: real metadata failures are now silently flattened into a different business meaning.

1.4 Error Handling and Logs

Issue 1: The shared primary-key fallback is too broad

  • Location: CatalogUtils.java:144
  • Why this is a problem:
    this code sits on the normal JDBC metadata path. Right now any SQLException from getPrimaryKeys() becomes Optional.empty(), even when the driver should have returned real PK metadata and something actually went wrong.
  • Risk:
    downstream logic can now treat a metadata failure as "this table genuinely has no primary key", which can mislead save-mode, upsert, or catalog-sync behavior.
  • Suggested fix:
    narrow the fallback to dialects or exception types that are explicitly known to mean "primary key metadata is unsupported" instead of swallowing every SQLException.
  • Severity: High

Issue 2: The current Build is still red

  • Location: GitHub Build
  • Why this matters:
    the latest Apache-side Build is still failing, and the linked fork run currently shows failures including unit-test (8, ubuntu-latest) and updated-modules-integration-test-part-4 (11, ubuntu-latest).
  • Severity: Medium

2. Code Quality

  • The new Hive partition parsing is readable and scoped correctly.
  • Test coverage is moving in the right direction, but I would still add one regression case proving that a non-Hive JDBC metadata failure is not silently downgraded to "empty PK".

3. Architecture

  • The Hive dialect part is a precise fix.
  • The shared fallback in CatalogUtils is not. It broadens the behavior boundary beyond the original problem.

4. Issue Summary

No. Issue Location Severity
1 Shared PK metadata failures are downgraded too broadly CatalogUtils.java:144 High
2 Current Build is still red GitHub Build Medium

5. Merge Conclusion

Conclusion: merge after fixes

  1. Blocking items
  • Issue 1 must be fixed. The fallback needs to be narrowed so it does not change behavior for every JDBC dialect.
  • Issue 2 also needs to be cleared by getting the latest Build green.
  1. Suggested non-blocking follow-up
  • Add one non-Hive regression test around shared PK metadata failures so this boundary does not widen again later.

The Hive-specific direction still looks good to me. The blocker is that the current shared fallback changes the semantics for all JDBC catalogs, not just Hive.

@DanielLeens
Thanks for the review. I agree that catching SQLException in the shared CatalogUtils.getPrimaryKey() is too broad and may change behavior for non-Hive JDBC catalogs. The original scope of this PR is Hive partition key detection, so I will remove this fallback from CatalogUtils and keep the change scoped to the Hive partition-key metadata path.

Copy link
Copy Markdown
Contributor

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I re-reviewed the latest head from scratch against upstream/dev, and I rechecked the real Hive metadata path rather than only the helper diff.

What This PR Fixes

  • User pain: Hive / Inceptor metadata discovery is inconsistent through standard JDBC APIs, especially for partition keys on table_path reads.
  • Fix approach: the current head adds a Hive-specific DESCRIBE parser for partition keys and threads those keys into CatalogTable without broadening the old shared primary-key fallback.
  • One-line summary: the Hive-specific fix is now scoped to the actual Hive metadata path instead of changing shared JDBC semantics for every dialect.

Runtime Chain Rechecked

user config with table_path
  -> JdbcCatalogUtils.getTables(...) [JdbcCatalogUtils.java:384-416]
      -> jdbcDialect.getPartitionKeys(connection, tablePath)
      -> CatalogUtils.getCatalogTable(connection, tablePath, typeMapper, partitionKeys)
          -> getTableSchema(...) still keeps the normal metadata / PK behavior
          -> CatalogTable(..., partitionKeys)

catalog path
  -> AbstractJdbcCatalog.getTable(...) [AbstractJdbcCatalog.java:232-267]
      -> getPartitionKeys(conn, tablePath)
          -> HiveDialect.getPartitionKeys(...) [HiveDialect.java:76-114]
              -> DESCRIBE db.table
              -> parse "# Partition Information"
      -> CatalogTable.of(..., partitionKeys, ...)

Findings

  1. The previous Daniel blocker around the overly broad shared primary-key fallback is fixed on the current head. CatalogUtils.getTableSchema(...) still propagates SQLException, and the new regression test explicitly locks that behavior.
  2. The Hive partition-key extraction is on the real metadata path now. Both JdbcCatalogUtils.getTables(...) and AbstractJdbcCatalog.getTable(...) thread the dialect-specific partition key list into CatalogTable.
  3. The new tests cover the right contract boundaries for this patch: Hive DESCRIBE parsing, CatalogTable partition-key propagation, and the Hive E2E table-path path with dt/hr partitions.
  4. I do not see a reopened source-level blocker on the current head.

Compatibility / Side Effects

  • No user-facing config or protocol changed.
  • The new behavior is Hive-specific; other JDBC dialects do not inherit the old broad SQLException -> no PK fallback anymore.
  • The only user-visible change is that Hive partition keys now appear correctly in CatalogTable metadata for the table-path flow.

Tests / CI

  • The focused unit tests and Hive E2E additions are aligned with the real code path.
  • The current top-level Build is still red on the reviewed head. The visible failures span a wide set of jobs, including unit-test (11, ubuntu-latest), updated-modules-integration-test-part-1/2/4/8, and many follow-on cancelled jobs, so there is not yet a clean green signal for the latest revision.

Conclusion: can merge after fixes

Blocking items:

  1. No code blocker from my side on the current head.
  2. Please get the latest Build back to green before merge.

Suggested non-blocking follow-up:

  • None for the current code path.

The latest revision fixes the scope problem Daniel previously blocked, and the Hive metadata path now looks correctly wired. The remaining gate is CI, not a reopened source-level issue.

Copy link
Copy Markdown
Contributor

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I went through the latest head from the full JDBC catalog path instead of only the Hive-specific diff, and the Hive partition-key fix itself makes sense. I did find one blocking compatibility issue before this can be merged.

What problem this PR solves

  • User pain point
    In Hive JDBC mode, DatabaseMetaData#getPrimaryKeys() is not reliable enough, so SeaTunnel can lose partition-key metadata while building CatalogTable.
  • Fix approach
    The PR adds a dialect-level supportsPrimaryKeyMetadata() switch and lets HiveDialect parse partition keys from DESCRIBE instead of depending on JDBC primary-key metadata.
  • One-line summary
    This makes Hive table discovery carry partition keys more reliably.

1. Code change review

1.1 Core logic analysis

The main runtime path is:

JdbcCatalogUtils.getTables() [JdbcCatalogUtils.java:143-147]
  -> CatalogUtils.getCatalogTable(connection, tablePath, jdbcDialect) [CatalogUtils.java:283-288]
    -> CatalogUtils.getTableSchema(metadata, tablePath, dialect) [CatalogUtils.java:230-237]
      -> dialect.supportsPrimaryKeyMetadata()
         -> HiveDialect returns false [HiveDialect.java:56-60]
         -> InceptorDialect extends HiveDialect [InceptorDialect.java:24-34]

So the PR is not only changing Hive behavior. It is also changing the generic JDBC fallback path for every dialect that inherits HiveDialect.

Before:

getTableSchema(metadata, tablePath, typeMapper, getPrimaryKey(metadata, tablePath))

After:

Optional<PrimaryKey> primaryKey =
    dialect.supportsPrimaryKeyMetadata()
        ? getPrimaryKey(metadata, tablePath)
        : Optional.empty();

1.2 Compatibility impact

Partially incompatible.

This changes how CatalogTable.getTableSchema().getPrimaryKey() is populated in the generic JDBC path. That is a compatibility-sensitive contract for downstream source/sink logic.

1.3 Performance / side effects

The Hive DESCRIBE path itself is fine, but the side effect is that Inceptor now loses primary-key metadata in the same fallback path unless it overrides the new behavior explicitly.

1.4 Error handling and logging

Issue 1: the new Hive-specific metadata switch also changes Inceptor primary-key behavior

  • Location:
    seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/catalog/utils/CatalogUtils.java:230-237
    seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/hive/HiveDialect.java:56-60
    seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/inceptor/InceptorDialect.java:24-34
  • Problem:
    The PR intends to special-case Hive, but InceptorDialect inherits HiveDialect, so it also inherits supportsPrimaryKeyMetadata() == false. That means the generic JDBC fallback path stops reading Inceptor primary-key metadata as well.
  • Risk:
    This is a compatibility regression. Even if the Hive fix is correct, Inceptor users can silently lose CatalogTable primary-key information and affect downstream logic that depends on it.
  • Best fix:
    Option A, recommended: scope the “skip PK metadata” behavior to real Hive only, and do not let Inceptor inherit it implicitly.
    Option B: if Inceptor really must behave the same way, add explicit source evidence and regression coverage to show that this is intentional and safe.
  • Severity: High

2. Code quality evaluation

2.1 Code style

The code is readable, but the dialect inheritance boundary is too broad right now.

2.2 Test coverage and stability

  • Hive unit coverage was added for DESCRIBE parsing.
  • JdbcHiveIT now checks that partition keys are carried into CatalogTable.
  • I did not find a flaky-test pattern in the new tests.
  • However, there is still no Inceptor regression coverage for the inherited behavior change.

Test stability rating: Stable.

2.3 Documentation

No documentation blocker here; the main problem is the dialect compatibility boundary.

3. Architecture

3.1 Solution quality

This is a precise fix for Hive itself, but not yet a safe fix for the wider Hive-compatible dialect family.

3.2 Maintainability

Leaving this in the parent dialect will make future Hive-compatible dialects harder to reason about.

3.3 Extensibility

Future dialect exceptions should stay explicit at the concrete dialect level.

3.4 Historical compatibility

Not fully compatible in the current form because of the Inceptor regression risk.

4. Issue summary

  1. Issue 1: the new Hive-specific metadata switch also changes Inceptor primary-key behavior
    Location: CatalogUtils.java:230-237 / HiveDialect.java:56-60 / InceptorDialect.java:24-34
    Severity: High

5. Merge conclusion

Conclusion: can merge after fixes

  1. Blocking items
    Issue 1: please narrow the new supportsPrimaryKeyMetadata() behavior so Inceptor does not lose primary-key metadata implicitly.

  2. Suggested follow-ups
    None.

One more note on CI: the current Build check is red on updated-modules-integration-test-part-4 (11, ubuntu-latest), which is consistent with this PR touching the JDBC/Hive integration-test bucket, so that failure should also be checked again after the dialect boundary is fixed.

Copy link
Copy Markdown
Contributor

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for missing this in the last review — that's on me. I re-reviewed the latest head from scratch and found one blocker that I should have caught earlier. This is a carryover from the previous version, not something introduced by the latest commit.

Thanks for the update. I walked the full current JDBC/Hive metadata path again instead of only looking at the helper diff.

What this PR fixes

  • User pain point: Hive table_path reads can lose partition-key metadata, so CatalogTable.partitionKeys stays empty even though the underlying table is partitioned.
  • Fix approach: the latest head routes the direct JDBC table_path path through CatalogUtils.getCatalogTable(connection, tablePath, jdbcDialect), lets HiveDialect.getPartitionKeys(...) parse the DESCRIBE <table> partition section, and keeps Inceptor on the normal primary-key metadata path via an explicit override.
  • One-line summary: this is the right direction for wiring Hive partition-key detection into the real JDBC metadata path that users actually hit.

Full runtime chain I rechecked

Jdbc source initialization
  -> JdbcSource.<init>() [JdbcSource.java:64-68]
      -> JdbcCatalogUtils.getTables(...) [JdbcCatalogUtils.java:78-146]
          -> findCatalog(...)
              -> no dedicated Hive / Inceptor catalog factory here, so the normal flow falls back to direct JDBC loading
          -> getCatalogTable(tableConfig, connection, jdbcDialect) [JdbcCatalogUtils.java:407-412]
              -> CatalogUtils.getCatalogTable(connection, tablePath, jdbcDialect) [CatalogUtils.java:283-288]
                  -> dialect.getPartitionKeys(connection, tablePath)
                      -> HiveDialect.getPartitionKeys(...) [HiveDialect.java:83-119]
                          -> DESCRIBE db.table
                          -> parse the "# Partition Information" section
                  -> getTableSchema(metadata, tablePath, dialect) [CatalogUtils.java:230-237]
                      -> HiveDialect.supportsPrimaryKeyMetadata() == false [HiveDialect.java:56-60]
                      -> InceptorDialect.supportsPrimaryKeyMetadata() == true [InceptorDialect.java:31-35]
                  -> CatalogTable.of(..., partitionKeys, ...)

Findings

  1. The main runtime path does hit this fix. This is not dead helper code.
  2. The Hive-specific metadata boundary now looks correct: Hive skips unreliable PK metadata, while Inceptor explicitly keeps it.
  3. I do not see a reopened source-level blocker in the metadata-loading logic itself.
  4. I did find one blocking regression in the newly added Hive E2E assertion.

Issue 1: the new Hive E2E assertion is internally inconsistent

  • Location: seatunnel-e2e/seatunnel-connector-v2-e2e/connector-jdbc-e2e/connector-jdbc-e2e-part-3/src/test/java/org/apache/seatunnel/connectors/seatunnel/jdbc/JdbcHiveIT.java:160-177, seatunnel-e2e/seatunnel-connector-v2-e2e/connector-jdbc-e2e/connector-jdbc-e2e-part-3/src/test/resources/jdbc_hive_partition_source_and_assert.conf:54-56
  • Problem:
    the new test inserts 6 partitioned rows and increments id from 1 to 6, but the assert-sink config still requires the id field to satisfy [{min = 1}, {max = 3}].
    With the real assert execution path, this is a deterministic failure rather than a theoretical edge case:
    AssertExecutor.fail(...) checks field rules row by row, so rows with id=4/5/6 will violate the configured max = 3.
  • Risk:
    the new E2E cannot serve as a reliable regression guard for this fix, and CI can fail even when the actual Hive metadata path is correct.
  • Recommended fix:
    Option A, recommended: keep the current 6-row partition coverage and update the asserted id max to 6.
    Option B: reduce the inserted sample size to 3 rows, but that weakens the partition-coverage value of the new E2E.
  • Severity: High

Compatibility / side effects

  • No API, config, default-value, protocol, or serialization-format change.
  • The runtime behavior change is the intended Hive metadata repair, while Inceptor keeps its previous PK-metadata behavior through the explicit override.

Tests

  • The new unit coverage is directionally good:
    • HiveDialectTest covers DESCRIBE partition parsing.
    • CatalogUtilsTest covers the Hive PK-metadata boundary and the identifier-case fallback boundary.
  • I do not see a new flaky-test pattern in the added tests.
  • The blocker is test correctness, not test stability.

Conclusion: can merge after fixes

Blocking items:

  1. Fix the new Hive E2E assertion so it matches the 6 inserted partitioned rows.

Non-blocking follow-up:

  1. The latest GitHub Build is still pending on the reviewed head. I do not see a new failure signature to attribute yet, but please wait for the updated Build result after fixing the E2E assertion.

Overall, the metadata-path repair itself looks right on the current head. The remaining blocker is that the newly added regression test is currently asserting an impossible contract.

Copy link
Copy Markdown
Contributor

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-up. I re-reviewed the latest head from scratch, with extra attention to the blocker from the previous Daniel review.

What problem this PR solves

  • User pain point
    Hive table_path reads can lose partition-key metadata, so CatalogTable.partitionKeys stays empty even though the table is partitioned.
  • Fix approach
    The PR routes the direct JDBC table_path path through the real CatalogTable construction flow and lets HiveDialect parse partition keys from DESCRIBE, while InceptorDialect keeps its own primary-key metadata contract.
  • One sentence
    The metadata-path fix still looks right, and the blocking E2E assertion mismatch from the previous round is fixed on the latest head.

Simple example: the new Hive E2E inserts 6 rows across 3 partition dates and 2 partition hours. The assert config now checks id <= 6 instead of the old impossible id <= 3, so the regression guard matches the data it actually inserts.

1. Code change review

1.1 Core logic analysis

Runtime chain I rechecked:

Jdbc source initialization
  -> JdbcCatalogUtils.getCatalogTable(...)
      -> CatalogUtils.getCatalogTable(connection, tablePath, jdbcDialect)
          -> HiveDialect.getPartitionKeys(...)
              -> DESCRIBE db.table
              -> parse the partition section
          -> getTableSchema(...)
              -> HiveDialect.supportsPrimaryKeyMetadata() == false
              -> InceptorDialect.supportsPrimaryKeyMetadata() == true
          -> CatalogTable.of(..., partitionKeys, ...)

The previous blocker was in the new E2E assertion path. On the latest head:

  • JdbcHiveIT still inserts 6 partitioned rows (id = 1..6) in the same way.
  • jdbc_hive_partition_source_and_assert.conf now raises the id max from 3 to 6.
  • InceptorDialect still keeps the explicit supportsPrimaryKeyMetadata() == true override.

Key findings:

  1. The real JDBC metadata path still hits this fix; it is not dead helper logic.
  2. The previous Daniel blocker is fixed on the latest head.
  3. I do not see a reopened source-level blocker in the Hive / Inceptor metadata boundary.
  4. The remaining gate is CI rather than code-path correctness.

1.2 Compatibility impact

Fully compatible from the current source-review perspective. The intended Hive metadata repair remains scoped correctly, while Inceptor keeps its prior primary-key metadata behavior.

1.3 Performance / side effects

No meaningful CPU, memory, GC, concurrency, retry, or resource-release regression surfaced in the current head.

1.4 Error handling and logging

No new source-level blocking issue found on the latest head.

2. Code quality evaluation

2.1 Code style

The current dialect boundary is much cleaner than the earlier versions.

2.2 Test coverage and stability

  • HiveDialectTest covers partition parsing.
  • CatalogUtilsTest covers the Hive/Inceptor PK-metadata boundary.
  • The Hive E2E assertion is now internally consistent with the inserted data.

Test stability rating: Stable.

2.3 Documentation

No new documentation blocker from this round.

3. Architecture

3.1 Solution quality

This is now a precise fix for the real Hive metadata path.

3.2 Maintainability

Good. The Hive-specific exception stays explicit, and Inceptor no longer inherits it accidentally.

3.3 Extensibility

The pattern is reusable for other JDBC dialects that need concrete metadata overrides.

3.4 Historical compatibility

No historical-compatibility blocker found on the latest head.

4. Issue summary

No new source-level blocking issue found.

5. Merge conclusion

Conclusion: can merge

  1. Blocking items
  • No code blocker from my side on the latest head.
  1. Suggested but non-blocking follow-ups
  • The latest GitHub Build check is cancelled, so please rerun / get a green Build before the actual merge.

From the source-review side, the current head looks good to me now. Thanks for closing the E2E mismatch cleanly.

@zhangqs0205
Copy link
Copy Markdown
Contributor Author

@DanielLeens

Thanks for the review and the follow-up.

While adding the JdbcHiveIT coverage for the JdbcCatalogUtils.getTables() path, I found a few Hive JDBC metadata gaps that were hidden before, so I fixed them together with the regression coverage:

  1. Hive partition keys are not exposed reliably through standard JDBC table/column metadata.
    I added a Hive-specific DESCRIBE <table> partition parser and wired dialect.getPartitionKeys(...) into the real getTables() / CatalogTable metadata path, so CatalogTable.partitionKeys can now carry dt/hr correctly.

  2. Hive primary-key metadata is not reliable through DatabaseMetaData#getPrimaryKeys().
    I added the dialect capability supportsPrimaryKeyMetadata() and made Hive skip PK metadata lookup explicitly in getTableSchema(). This avoids treating Hive’s unsupported PK metadata path as a real table failure, while preserving the normal SQLException behavior for dialects that do support PK metadata.

  3. Some Hive JDBC metadata capability methods for identifier case are unsupported.
    I narrowed the fallback in JdbcIdentifierUtils.identifierCaseStrategy(...) to only handle SQLFeatureNotSupportedException and fall back to case-insensitive matching. Other real SQLExceptions are still propagated.

I also fixed the new Hive E2E assert config to match the 6 inserted partition rows and kept explicit field_value = [] entries for type-only checks, because the assert executor expects a non-null rule list when field_rules are configured.

Copy link
Copy Markdown
Contributor

@davidzollo davidzollo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 if CI passes.
LGTM

@zhangqs0205
Copy link
Copy Markdown
Contributor Author

By the way, CI in this project usually needs to be rerun several times before all checks pass.

@davidzollo
Copy link
Copy Markdown
Contributor

By the way, CI in this project usually needs to be rerun several times before all checks pass.

Yes, you can only retry the failed CI tasks

@DanielLeens
Copy link
Copy Markdown
Contributor

Hi @zhangqs0205, thanks for the detailed write-up — that context helps. The current Daniel approval already covers head d848475555ff1976ba906561c46a1f05941781a6, and the points you listed match the paths I traced on this same head through JdbcCatalogUtils.getTables(), JdbcCatalogUtils.getTableSchema(), and JdbcIdentifierUtils.identifierCaseStrategy(...). Since there is no new commit after that approval, I am treating this as a reply-only follow-up rather than starting another full review on unchanged code. On CI, the current Build result looks like a cancelled run rather than a source-level regression signal, so rerunning that check on the same head is the right next step.

@zhangqs0205
Copy link
Copy Markdown
Contributor Author

Hi @davidzollo ,CI is passed

@DanielLeens
Copy link
Copy Markdown
Contributor

Thanks for the update, and good to hear CI is green now.

I checked the current PR head d848475555ff1976ba906561c46a1f05941781a6, and it is still the same revision I already reviewed earlier, so there is no new code delta for me to re-review in this round.

If you push another commit later, feel free to ping me again and I will take another pass.

@davidzollo davidzollo merged commit 7c7701d into apache:dev May 26, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] [connector-jdbc] Hive JDBC CatalogTable.partitionKeys is empty in JdbcSource

3 participants