[Fix][Connector-V2][JDBC] Detect Hive partition keys in metadata#10706
Conversation
c3656ac to
eca75ed
Compare
There was a problem hiding this comment.
I found that ran CatalogUtilsTest + HiveDialectTest, and walked the JDBC source path from JdbcCatalogUtils.getTables().
Blocking issue: the new Hive partition-key extraction does not hit the normal catalog path.
- when a catalog is available,
JdbcCatalogUtils.getTables()goes throughgetCatalogTable(tableConfig, jdbcCatalog, jdbcDialect)and thenjdbcCatalog.getTable(tablePath)(seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/utils/JdbcCatalogUtils.java:78-94,174-216) AbstractJdbcCatalog.getTable()calls the protectedgetPartitionKeys(connection, tablePath)hook (seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/catalog/AbstractJdbcCatalog.java:230-236)- that hook still defaults to
Collections.emptyList()inAbstractJdbcCatalog.java:257-259 - the new SQL logic lives in
HiveDialect.getPartitionKeys(...)(seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/hive/HiveDialect.java:77-90), but the catalog path never calls it
So the fix only works in the fallback "direct JDBC without catalog" branch, not in the normal catalog-backed path. The new tests pass because they cover HiveDialect and helper merging in isolation, not the real JdbcCatalogUtils -> AbstractJdbcCatalog.getTable() chain.
Please wire the Hive partition-key lookup into the catalog implementation that the normal source path actually uses.
davidzollo
left a comment
There was a problem hiding this comment.
+1
Good job, thanks for the fix.
|
Hi @zhangqs0205, I rechecked the current head locally after the latest review activity. The blocker from my earlier review is addressed on the current head:
Runtime chain I rechecked The important difference from the older blocked version is that the partition keys are now passed through the reachable metadata path instead of only being tested at the parser/helper level. Conclusion: can mergeBlocking items:
Non-blocking follow-up:
Thanks for the fix. The current head looks good to me now. |
1e8b9eb to
4abb40b
Compare
DanielLeens
left a comment
There was a problem hiding this comment.
Hi @zhangqs0205, I re-reviewed the current head locally from seatunnel-review-10706 against upstream/dev. I did not run Maven locally in this pass; this is a source-level review plus current GitHub check metadata.
What This PR Fixes
- User pain: Hive partition columns are often missing from JDBC catalog metadata, so table-path-based reads lose partition-key information and downstream schema checks become incomplete.
- Fix approach: teach
HiveDialectto parse partition keys fromDESCRIBE <table>output, thread those keys intoCatalogTable, and add unit + E2E coverage. - One-line value: this makes the Hive JDBC catalog path preserve the real partition-key contract instead of dropping it.
Runtime Chain I Checked
table_path source / catalog loading
-> AbstractJdbcCatalog.getTable(...)
-> getPartitionKeys(conn, tablePath)
-> HiveDialect.getPartitionKeys()
-> DESCRIBE db.table
-> parse "# Partition Information" section
-> CatalogTable.of(..., partitionKeys, ...)
-> JdbcCatalogUtils.getTables(...)
-> JdbcSourceTable carries partition keys into source metadata
Findings
HiveDialect.getPartitionKeys()is on the real metadata-loading path used byAbstractJdbcCatalog.getTable(), so the new parsing logic is not dead code.- The parsed partition-key list is now threaded into
CatalogTable, which is the right contract boundary for later source/schema logic. - The unit tests cover both the
DESCRIBEparsing path and the catalog-table propagation path, and the Hive E2E now asserts the partition table plusdt/hrmetadata path. That is good coverage for the user-facing behavior. - I did not find a reopened code blocker in the latest implementation.
Merge Decision
Conclusion: can merge after fixes
- Blocking items
- No code blocker from my side.
- The current
Buildcheck is still running, so merge should wait for CI to finish green.
- Suggested non-blocking follow-up
- None from me in the current code path.
Overall, this is a practical metadata fix and the added tests now cover the exact Hive partition-key path that users depend on.
DanielLeens
left a comment
There was a problem hiding this comment.
Thanks for working on this. I reviewed the full current head locally on seatunnel-review-10706 against upstream/dev. This was a source-level review only; I did not run local Maven/tests in this batch.
This is a practical metadata-path fix, and the important chain now looks right on the current head:
table_path source / catalog loading
-> AbstractJdbcCatalog.getTable(...)
-> getPartitionKeys(conn, tablePath)
-> HiveDialect.getPartitionKeys()
-> DESCRIBE db.table
-> parse the "# Partition Information" section
-> CatalogTable.of(..., partitionKeys, ...)
-> JdbcCatalogUtils.getTables(...)
-> JdbcSourceTable carries partition keys forward
Key points from the current head:
HiveDialect.getPartitionKeys()is on the real metadata-loading path used byAbstractJdbcCatalog.getTable(), so the new parsing logic is not dead code.- The parsed partition keys are now threaded into
CatalogTable, which is the correct contract boundary for downstream schema/source logic. - The unit tests plus the Hive E2E cover the exact partition-key metadata path that users depend on.
- I did not find a reopened code blocker in the latest implementation.
The only remaining gate from my side is CI: the current GitHub Build check is CANCELLED, so there is no green signal for the latest head yet.
Conclusion: can merge after fixes
Blocking items:
- No code blocker from my side.
- Re-run
Buildand make sure the latest head is green before merge.
Non-blocking follow-up:
- None from me on the current code path.
This is a useful metadata fix and the runtime path now looks correctly wired.
DanielLeens
left a comment
There was a problem hiding this comment.
Thanks for the update. I re-reviewed the latest head locally against upstream/dev, focusing on the Hive metadata-loading path that feeds JDBC catalog/table-path reads.
Runtime path I checked:
table_path source / catalog loading
-> AbstractJdbcCatalog.getTable(...)
-> getPartitionKeys(conn, tablePath)
-> HiveDialect.getPartitionKeys()
-> DESCRIBE db.table
-> parse "# Partition Information" section
-> CatalogTable.of(..., partitionKeys, ...)
-> JdbcCatalogUtils.getTables(...)
-> JdbcSourceTable carries partition keys into source metadata
Key findings:
HiveDialect.getPartitionKeys()is on the real metadata-loading path used byAbstractJdbcCatalog.getTable(), so the new parsing logic is not dead code.- The parsed partition-key list is now threaded into
CatalogTable, which is the right contract boundary for later source/schema logic. - The unit tests cover both the
DESCRIBEparsing path and the catalog-table propagation path, and the Hive E2E now asserts the partition-table plusdt/hrmetadata path. - I did not find a reopened code blocker in the latest implementation.
CI / verification:
- I did not run Maven locally in this pass; this is a source-level review.
- The current top-level
Buildstatus on the latest head isCANCELLED, so the branch still needs a clean rerun before merge.
Conclusion: merge after fixes
- Blocking items
- No code blocker from my side.
- Please rerun the latest
Buildand get a clean green result before merge.
- Suggested non-blocking follow-ups
- None in the current code path.
Overall, this is a practical metadata fix, and the added tests cover the exact Hive partition-key path that users depend on.
davidzollo
left a comment
There was a problem hiding this comment.
Thanks for working on this. I re-reviewed the full current head locally on seatunnel-review-10706 against upstream/dev. This was a source-level review only; I did not run local Maven in this round.
What This PR Fixes
- User pain: Hive/Inceptor metadata discovery is inconsistent through standard JDBC APIs, especially around partition keys and sometimes primary-key metadata.
- Fix approach: the PR parses Hive partition keys from
DESCRIBEoutput and also adds a fallback aroundDatabaseMetaData.getPrimaryKeys(). - One-line summary: the Hive-specific direction is reasonable, but the shared primary-key fallback is currently too broad and changes behavior for every JDBC dialect, not just Hive.
1. Code Review
1.1 Core Logic Analysis
-
Exact change scope:
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/catalog/utils/CatalogUtils.java:130-163seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/hive/HiveDialect.java:78-112
-
Before / after:
// before
try (ResultSet rs = metaData.getPrimaryKeys(...)) {
while (rs.next()) {
primaryKeyColumns.add(...);
}
}// after
try (ResultSet rs = metaData.getPrimaryKeys(...)) {
while (rs.next()) {
primaryKeyColumns.add(...);
}
} catch (SQLException e) {
log.warn("Failed to get primary key info for table {}, returning empty primary key. Error: {}", tablePath, e.getMessage());
return Optional.empty();
}-
Key findings:
- The Hive
DESCRIBEparsing path is on the real runtime path and looks directionally correct. - The shared
CatalogUtils.getPrimaryKey()helper is not Hive-only. Other JDBC catalogs also go through it. - Catching every
SQLExceptionhere means a real metadata failure now becomes "this table has no primary key". - That is not just a logging choice. It changes downstream schema / constraint semantics.
- The Hive
-
Runtime chain I checked:
JDBC catalog introspection
-> CatalogUtils.getPrimaryKey(DatabaseMetaData, TablePath)
-> DatabaseMetaData.getPrimaryKeys(...)
-> Optional<PrimaryKey>
-> CatalogTable / TableSchema assembly
-> downstream save-mode / upsert / constraint decisions use that PK signal
Hive-specific partition path
-> HiveDialect.getPartitionKeys(Connection, TablePath)
-> DESCRIBE table
-> parse "# Partition Information" section
1.2 Compatibility Impact
- Verdict: partially incompatible.
- API/config/defaults/protocol/serialization are unchanged.
- The incompatible part is behavior: a shared JDBC metadata failure is now downgraded to "no primary key", which is broader than the Hive problem this PR is trying to fix.
1.3 Performance / Side Effects
- No meaningful CPU or memory regression from the Hive parser itself.
- The main side effect is semantic, not performance-related: real metadata failures are now silently flattened into a different business meaning.
1.4 Error Handling and Logs
Issue 1: The shared primary-key fallback is too broad
- Location:
CatalogUtils.java:144 - Why this is a problem:
this code sits on the normal JDBC metadata path. Right now anySQLExceptionfromgetPrimaryKeys()becomesOptional.empty(), even when the driver should have returned real PK metadata and something actually went wrong. - Risk:
downstream logic can now treat a metadata failure as "this table genuinely has no primary key", which can mislead save-mode, upsert, or catalog-sync behavior. - Suggested fix:
narrow the fallback to dialects or exception types that are explicitly known to mean "primary key metadata is unsupported" instead of swallowing everySQLException. - Severity: High
Issue 2: The current Build is still red
- Location: GitHub
Build - Why this matters:
the latest Apache-sideBuildis still failing, and the linked fork run currently shows failures includingunit-test (8, ubuntu-latest)andupdated-modules-integration-test-part-4 (11, ubuntu-latest). - Severity: Medium
2. Code Quality
- The new Hive partition parsing is readable and scoped correctly.
- Test coverage is moving in the right direction, but I would still add one regression case proving that a non-Hive JDBC metadata failure is not silently downgraded to "empty PK".
3. Architecture
- The Hive dialect part is a precise fix.
- The shared fallback in
CatalogUtilsis not. It broadens the behavior boundary beyond the original problem.
4. Issue Summary
| No. | Issue | Location | Severity |
|---|---|---|---|
| 1 | Shared PK metadata failures are downgraded too broadly | CatalogUtils.java:144 |
High |
| 2 | Current Build is still red | GitHub Build |
Medium |
5. Merge Conclusion
Conclusion: merge after fixes
- Blocking items
- Issue 1 must be fixed. The fallback needs to be narrowed so it does not change behavior for every JDBC dialect.
- Issue 2 also needs to be cleared by getting the latest Build green.
- Suggested non-blocking follow-up
- Add one non-Hive regression test around shared PK metadata failures so this boundary does not widen again later.
The Hive-specific direction still looks good to me. The blocker is that the current shared fallback changes the semantics for all JDBC catalogs, not just Hive.
Posted from the wrong account by mistake; superseded by the DanielLeens review.
DanielLeens
left a comment
There was a problem hiding this comment.
Thanks for working on this. I re-reviewed the full current head locally on seatunnel-review-10706 against upstream/dev. This was a source-level review only; I did not run local Maven in this round.
What This PR Fixes
- User pain: Hive/Inceptor metadata discovery is inconsistent through standard JDBC APIs, especially around partition keys and sometimes primary-key metadata.
- Fix approach: the PR parses Hive partition keys from
DESCRIBEoutput and also adds a fallback aroundDatabaseMetaData.getPrimaryKeys(). - One-line summary: the Hive-specific direction is reasonable, but the shared primary-key fallback is currently too broad and changes behavior for every JDBC dialect, not just Hive.
1. Code Review
1.1 Core Logic Analysis
-
Exact change scope:
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/catalog/utils/CatalogUtils.java:130-163seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/hive/HiveDialect.java:78-112
-
Before / after:
// before
try (ResultSet rs = metaData.getPrimaryKeys(...)) {
while (rs.next()) {
primaryKeyColumns.add(...);
}
}// after
try (ResultSet rs = metaData.getPrimaryKeys(...)) {
while (rs.next()) {
primaryKeyColumns.add(...);
}
} catch (SQLException e) {
log.warn("Failed to get primary key info for table {}, returning empty primary key. Error: {}", tablePath, e.getMessage());
return Optional.empty();
}-
Key findings:
- The Hive
DESCRIBEparsing path is on the real runtime path and looks directionally correct. - The shared
CatalogUtils.getPrimaryKey()helper is not Hive-only. Other JDBC catalogs also go through it. - Catching every
SQLExceptionhere means a real metadata failure now becomes "this table has no primary key". - That is not just a logging choice. It changes downstream schema / constraint semantics.
- The Hive
-
Runtime chain I checked:
JDBC catalog introspection
-> CatalogUtils.getPrimaryKey(DatabaseMetaData, TablePath)
-> DatabaseMetaData.getPrimaryKeys(...)
-> Optional<PrimaryKey>
-> CatalogTable / TableSchema assembly
-> downstream save-mode / upsert / constraint decisions use that PK signal
Hive-specific partition path
-> HiveDialect.getPartitionKeys(Connection, TablePath)
-> DESCRIBE table
-> parse "# Partition Information" section
1.2 Compatibility Impact
- Verdict: partially incompatible.
- API/config/defaults/protocol/serialization are unchanged.
- The incompatible part is behavior: a shared JDBC metadata failure is now downgraded to "no primary key", which is broader than the Hive problem this PR is trying to fix.
1.3 Performance / Side Effects
- No meaningful CPU or memory regression from the Hive parser itself.
- The main side effect is semantic, not performance-related: real metadata failures are now silently flattened into a different business meaning.
1.4 Error Handling and Logs
Issue 1: The shared primary-key fallback is too broad
- Location:
CatalogUtils.java:144 - Why this is a problem:
this code sits on the normal JDBC metadata path. Right now anySQLExceptionfromgetPrimaryKeys()becomesOptional.empty(), even when the driver should have returned real PK metadata and something actually went wrong. - Risk:
downstream logic can now treat a metadata failure as "this table genuinely has no primary key", which can mislead save-mode, upsert, or catalog-sync behavior. - Suggested fix:
narrow the fallback to dialects or exception types that are explicitly known to mean "primary key metadata is unsupported" instead of swallowing everySQLException. - Severity: High
Issue 2: The current Build is still red
- Location: GitHub
Build - Why this matters:
the latest Apache-sideBuildis still failing, and the linked fork run currently shows failures includingunit-test (8, ubuntu-latest)andupdated-modules-integration-test-part-4 (11, ubuntu-latest). - Severity: Medium
2. Code Quality
- The new Hive partition parsing is readable and scoped correctly.
- Test coverage is moving in the right direction, but I would still add one regression case proving that a non-Hive JDBC metadata failure is not silently downgraded to "empty PK".
3. Architecture
- The Hive dialect part is a precise fix.
- The shared fallback in
CatalogUtilsis not. It broadens the behavior boundary beyond the original problem.
4. Issue Summary
| No. | Issue | Location | Severity |
|---|---|---|---|
| 1 | Shared PK metadata failures are downgraded too broadly | CatalogUtils.java:144 |
High |
| 2 | Current Build is still red | GitHub Build |
Medium |
5. Merge Conclusion
Conclusion: merge after fixes
- Blocking items
- Issue 1 must be fixed. The fallback needs to be narrowed so it does not change behavior for every JDBC dialect.
- Issue 2 also needs to be cleared by getting the latest Build green.
- Suggested non-blocking follow-up
- Add one non-Hive regression test around shared PK metadata failures so this boundary does not widen again later.
The Hive-specific direction still looks good to me. The blocker is that the current shared fallback changes the semantics for all JDBC catalogs, not just Hive.
@DanielLeens |
DanielLeens
left a comment
There was a problem hiding this comment.
Thanks for the update. I re-reviewed the latest head from scratch against upstream/dev, and I rechecked the real Hive metadata path rather than only the helper diff.
What This PR Fixes
- User pain: Hive / Inceptor metadata discovery is inconsistent through standard JDBC APIs, especially for partition keys on
table_pathreads. - Fix approach: the current head adds a Hive-specific
DESCRIBEparser for partition keys and threads those keys intoCatalogTablewithout broadening the old shared primary-key fallback. - One-line summary: the Hive-specific fix is now scoped to the actual Hive metadata path instead of changing shared JDBC semantics for every dialect.
Runtime Chain Rechecked
user config with table_path
-> JdbcCatalogUtils.getTables(...) [JdbcCatalogUtils.java:384-416]
-> jdbcDialect.getPartitionKeys(connection, tablePath)
-> CatalogUtils.getCatalogTable(connection, tablePath, typeMapper, partitionKeys)
-> getTableSchema(...) still keeps the normal metadata / PK behavior
-> CatalogTable(..., partitionKeys)
catalog path
-> AbstractJdbcCatalog.getTable(...) [AbstractJdbcCatalog.java:232-267]
-> getPartitionKeys(conn, tablePath)
-> HiveDialect.getPartitionKeys(...) [HiveDialect.java:76-114]
-> DESCRIBE db.table
-> parse "# Partition Information"
-> CatalogTable.of(..., partitionKeys, ...)
Findings
- The previous Daniel blocker around the overly broad shared primary-key fallback is fixed on the current head.
CatalogUtils.getTableSchema(...)still propagatesSQLException, and the new regression test explicitly locks that behavior. - The Hive partition-key extraction is on the real metadata path now. Both
JdbcCatalogUtils.getTables(...)andAbstractJdbcCatalog.getTable(...)thread the dialect-specific partition key list intoCatalogTable. - The new tests cover the right contract boundaries for this patch: Hive
DESCRIBEparsing,CatalogTablepartition-key propagation, and the Hive E2E table-path path withdt/hrpartitions. - I do not see a reopened source-level blocker on the current head.
Compatibility / Side Effects
- No user-facing config or protocol changed.
- The new behavior is Hive-specific; other JDBC dialects do not inherit the old broad
SQLException -> no PKfallback anymore. - The only user-visible change is that Hive partition keys now appear correctly in
CatalogTablemetadata for the table-path flow.
Tests / CI
- The focused unit tests and Hive E2E additions are aligned with the real code path.
- The current top-level
Buildis still red on the reviewed head. The visible failures span a wide set of jobs, includingunit-test (11, ubuntu-latest),updated-modules-integration-test-part-1/2/4/8, and many follow-on cancelled jobs, so there is not yet a clean green signal for the latest revision.
Conclusion: can merge after fixes
Blocking items:
- No code blocker from my side on the current head.
- Please get the latest
Buildback to green before merge.
Suggested non-blocking follow-up:
- None for the current code path.
The latest revision fixes the scope problem Daniel previously blocked, and the Hive metadata path now looks correctly wired. The remaining gate is CI, not a reopened source-level issue.
DanielLeens
left a comment
There was a problem hiding this comment.
Thanks for working on this. I went through the latest head from the full JDBC catalog path instead of only the Hive-specific diff, and the Hive partition-key fix itself makes sense. I did find one blocking compatibility issue before this can be merged.
What problem this PR solves
- User pain point
In Hive JDBC mode,DatabaseMetaData#getPrimaryKeys()is not reliable enough, so SeaTunnel can lose partition-key metadata while buildingCatalogTable. - Fix approach
The PR adds a dialect-levelsupportsPrimaryKeyMetadata()switch and letsHiveDialectparse partition keys fromDESCRIBEinstead of depending on JDBC primary-key metadata. - One-line summary
This makes Hive table discovery carry partition keys more reliably.
1. Code change review
1.1 Core logic analysis
The main runtime path is:
JdbcCatalogUtils.getTables() [JdbcCatalogUtils.java:143-147]
-> CatalogUtils.getCatalogTable(connection, tablePath, jdbcDialect) [CatalogUtils.java:283-288]
-> CatalogUtils.getTableSchema(metadata, tablePath, dialect) [CatalogUtils.java:230-237]
-> dialect.supportsPrimaryKeyMetadata()
-> HiveDialect returns false [HiveDialect.java:56-60]
-> InceptorDialect extends HiveDialect [InceptorDialect.java:24-34]
So the PR is not only changing Hive behavior. It is also changing the generic JDBC fallback path for every dialect that inherits HiveDialect.
Before:
getTableSchema(metadata, tablePath, typeMapper, getPrimaryKey(metadata, tablePath))After:
Optional<PrimaryKey> primaryKey =
dialect.supportsPrimaryKeyMetadata()
? getPrimaryKey(metadata, tablePath)
: Optional.empty();1.2 Compatibility impact
Partially incompatible.
This changes how CatalogTable.getTableSchema().getPrimaryKey() is populated in the generic JDBC path. That is a compatibility-sensitive contract for downstream source/sink logic.
1.3 Performance / side effects
The Hive DESCRIBE path itself is fine, but the side effect is that Inceptor now loses primary-key metadata in the same fallback path unless it overrides the new behavior explicitly.
1.4 Error handling and logging
Issue 1: the new Hive-specific metadata switch also changes Inceptor primary-key behavior
- Location:
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/catalog/utils/CatalogUtils.java:230-237
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/hive/HiveDialect.java:56-60
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/inceptor/InceptorDialect.java:24-34 - Problem:
The PR intends to special-case Hive, butInceptorDialectinheritsHiveDialect, so it also inheritssupportsPrimaryKeyMetadata() == false. That means the generic JDBC fallback path stops reading Inceptor primary-key metadata as well. - Risk:
This is a compatibility regression. Even if the Hive fix is correct, Inceptor users can silently loseCatalogTableprimary-key information and affect downstream logic that depends on it. - Best fix:
Option A, recommended: scope the “skip PK metadata” behavior to real Hive only, and do not let Inceptor inherit it implicitly.
Option B: if Inceptor really must behave the same way, add explicit source evidence and regression coverage to show that this is intentional and safe. - Severity: High
2. Code quality evaluation
2.1 Code style
The code is readable, but the dialect inheritance boundary is too broad right now.
2.2 Test coverage and stability
- Hive unit coverage was added for
DESCRIBEparsing. JdbcHiveITnow checks that partition keys are carried intoCatalogTable.- I did not find a flaky-test pattern in the new tests.
- However, there is still no Inceptor regression coverage for the inherited behavior change.
Test stability rating: Stable.
2.3 Documentation
No documentation blocker here; the main problem is the dialect compatibility boundary.
3. Architecture
3.1 Solution quality
This is a precise fix for Hive itself, but not yet a safe fix for the wider Hive-compatible dialect family.
3.2 Maintainability
Leaving this in the parent dialect will make future Hive-compatible dialects harder to reason about.
3.3 Extensibility
Future dialect exceptions should stay explicit at the concrete dialect level.
3.4 Historical compatibility
Not fully compatible in the current form because of the Inceptor regression risk.
4. Issue summary
- Issue 1: the new Hive-specific metadata switch also changes Inceptor primary-key behavior
Location:CatalogUtils.java:230-237/HiveDialect.java:56-60/InceptorDialect.java:24-34
Severity: High
5. Merge conclusion
Conclusion: can merge after fixes
-
Blocking items
Issue 1: please narrow the newsupportsPrimaryKeyMetadata()behavior so Inceptor does not lose primary-key metadata implicitly. -
Suggested follow-ups
None.
One more note on CI: the current Build check is red on updated-modules-integration-test-part-4 (11, ubuntu-latest), which is consistent with this PR touching the JDBC/Hive integration-test bucket, so that failure should also be checked again after the dialect boundary is fixed.
DanielLeens
left a comment
There was a problem hiding this comment.
Sorry for missing this in the last review — that's on me. I re-reviewed the latest head from scratch and found one blocker that I should have caught earlier. This is a carryover from the previous version, not something introduced by the latest commit.
Thanks for the update. I walked the full current JDBC/Hive metadata path again instead of only looking at the helper diff.
What this PR fixes
- User pain point: Hive
table_pathreads can lose partition-key metadata, soCatalogTable.partitionKeysstays empty even though the underlying table is partitioned. - Fix approach: the latest head routes the direct JDBC
table_pathpath throughCatalogUtils.getCatalogTable(connection, tablePath, jdbcDialect), letsHiveDialect.getPartitionKeys(...)parse theDESCRIBE <table>partition section, and keeps Inceptor on the normal primary-key metadata path via an explicit override. - One-line summary: this is the right direction for wiring Hive partition-key detection into the real JDBC metadata path that users actually hit.
Full runtime chain I rechecked
Jdbc source initialization
-> JdbcSource.<init>() [JdbcSource.java:64-68]
-> JdbcCatalogUtils.getTables(...) [JdbcCatalogUtils.java:78-146]
-> findCatalog(...)
-> no dedicated Hive / Inceptor catalog factory here, so the normal flow falls back to direct JDBC loading
-> getCatalogTable(tableConfig, connection, jdbcDialect) [JdbcCatalogUtils.java:407-412]
-> CatalogUtils.getCatalogTable(connection, tablePath, jdbcDialect) [CatalogUtils.java:283-288]
-> dialect.getPartitionKeys(connection, tablePath)
-> HiveDialect.getPartitionKeys(...) [HiveDialect.java:83-119]
-> DESCRIBE db.table
-> parse the "# Partition Information" section
-> getTableSchema(metadata, tablePath, dialect) [CatalogUtils.java:230-237]
-> HiveDialect.supportsPrimaryKeyMetadata() == false [HiveDialect.java:56-60]
-> InceptorDialect.supportsPrimaryKeyMetadata() == true [InceptorDialect.java:31-35]
-> CatalogTable.of(..., partitionKeys, ...)
Findings
- The main runtime path does hit this fix. This is not dead helper code.
- The Hive-specific metadata boundary now looks correct: Hive skips unreliable PK metadata, while Inceptor explicitly keeps it.
- I do not see a reopened source-level blocker in the metadata-loading logic itself.
- I did find one blocking regression in the newly added Hive E2E assertion.
Issue 1: the new Hive E2E assertion is internally inconsistent
- Location:
seatunnel-e2e/seatunnel-connector-v2-e2e/connector-jdbc-e2e/connector-jdbc-e2e-part-3/src/test/java/org/apache/seatunnel/connectors/seatunnel/jdbc/JdbcHiveIT.java:160-177,seatunnel-e2e/seatunnel-connector-v2-e2e/connector-jdbc-e2e/connector-jdbc-e2e-part-3/src/test/resources/jdbc_hive_partition_source_and_assert.conf:54-56 - Problem:
the new test inserts 6 partitioned rows and incrementsidfrom1to6, but the assert-sink config still requires theidfield to satisfy[{min = 1}, {max = 3}].
With the real assert execution path, this is a deterministic failure rather than a theoretical edge case:
AssertExecutor.fail(...)checks field rules row by row, so rows withid=4/5/6will violate the configuredmax = 3. - Risk:
the new E2E cannot serve as a reliable regression guard for this fix, and CI can fail even when the actual Hive metadata path is correct. - Recommended fix:
Option A, recommended: keep the current 6-row partition coverage and update the assertedidmax to6.
Option B: reduce the inserted sample size to 3 rows, but that weakens the partition-coverage value of the new E2E. - Severity: High
Compatibility / side effects
- No API, config, default-value, protocol, or serialization-format change.
- The runtime behavior change is the intended Hive metadata repair, while Inceptor keeps its previous PK-metadata behavior through the explicit override.
Tests
- The new unit coverage is directionally good:
HiveDialectTestcoversDESCRIBEpartition parsing.CatalogUtilsTestcovers the Hive PK-metadata boundary and the identifier-case fallback boundary.
- I do not see a new flaky-test pattern in the added tests.
- The blocker is test correctness, not test stability.
Conclusion: can merge after fixes
Blocking items:
- Fix the new Hive E2E assertion so it matches the 6 inserted partitioned rows.
Non-blocking follow-up:
- The latest GitHub
Buildis still pending on the reviewed head. I do not see a new failure signature to attribute yet, but please wait for the updated Build result after fixing the E2E assertion.
Overall, the metadata-path repair itself looks right on the current head. The remaining blocker is that the newly added regression test is currently asserting an impossible contract.
DanielLeens
left a comment
There was a problem hiding this comment.
Thanks for the follow-up. I re-reviewed the latest head from scratch, with extra attention to the blocker from the previous Daniel review.
What problem this PR solves
- User pain point
Hivetable_pathreads can lose partition-key metadata, soCatalogTable.partitionKeysstays empty even though the table is partitioned. - Fix approach
The PR routes the direct JDBCtable_pathpath through the realCatalogTableconstruction flow and letsHiveDialectparse partition keys fromDESCRIBE, whileInceptorDialectkeeps its own primary-key metadata contract. - One sentence
The metadata-path fix still looks right, and the blocking E2E assertion mismatch from the previous round is fixed on the latest head.
Simple example: the new Hive E2E inserts 6 rows across 3 partition dates and 2 partition hours. The assert config now checks id <= 6 instead of the old impossible id <= 3, so the regression guard matches the data it actually inserts.
1. Code change review
1.1 Core logic analysis
Runtime chain I rechecked:
Jdbc source initialization
-> JdbcCatalogUtils.getCatalogTable(...)
-> CatalogUtils.getCatalogTable(connection, tablePath, jdbcDialect)
-> HiveDialect.getPartitionKeys(...)
-> DESCRIBE db.table
-> parse the partition section
-> getTableSchema(...)
-> HiveDialect.supportsPrimaryKeyMetadata() == false
-> InceptorDialect.supportsPrimaryKeyMetadata() == true
-> CatalogTable.of(..., partitionKeys, ...)
The previous blocker was in the new E2E assertion path. On the latest head:
JdbcHiveITstill inserts 6 partitioned rows (id = 1..6) in the same way.jdbc_hive_partition_source_and_assert.confnow raises theidmax from3to6.InceptorDialectstill keeps the explicitsupportsPrimaryKeyMetadata() == trueoverride.
Key findings:
- The real JDBC metadata path still hits this fix; it is not dead helper logic.
- The previous Daniel blocker is fixed on the latest head.
- I do not see a reopened source-level blocker in the Hive / Inceptor metadata boundary.
- The remaining gate is CI rather than code-path correctness.
1.2 Compatibility impact
Fully compatible from the current source-review perspective. The intended Hive metadata repair remains scoped correctly, while Inceptor keeps its prior primary-key metadata behavior.
1.3 Performance / side effects
No meaningful CPU, memory, GC, concurrency, retry, or resource-release regression surfaced in the current head.
1.4 Error handling and logging
No new source-level blocking issue found on the latest head.
2. Code quality evaluation
2.1 Code style
The current dialect boundary is much cleaner than the earlier versions.
2.2 Test coverage and stability
HiveDialectTestcovers partition parsing.CatalogUtilsTestcovers the Hive/Inceptor PK-metadata boundary.- The Hive E2E assertion is now internally consistent with the inserted data.
Test stability rating: Stable.
2.3 Documentation
No new documentation blocker from this round.
3. Architecture
3.1 Solution quality
This is now a precise fix for the real Hive metadata path.
3.2 Maintainability
Good. The Hive-specific exception stays explicit, and Inceptor no longer inherits it accidentally.
3.3 Extensibility
The pattern is reusable for other JDBC dialects that need concrete metadata overrides.
3.4 Historical compatibility
No historical-compatibility blocker found on the latest head.
4. Issue summary
No new source-level blocking issue found.
5. Merge conclusion
Conclusion: can merge
- Blocking items
- No code blocker from my side on the latest head.
- Suggested but non-blocking follow-ups
- The latest GitHub
Buildcheck iscancelled, so please rerun / get a green Build before the actual merge.
From the source-review side, the current head looks good to me now. Thanks for closing the E2E mismatch cleanly.
|
Thanks for the review and the follow-up. While adding the
I also fixed the new Hive E2E assert config to match the 6 inserted partition rows and kept explicit |
davidzollo
left a comment
There was a problem hiding this comment.
+1 if CI passes.
LGTM
|
By the way, CI in this project usually needs to be rerun several times before all checks pass. |
Yes, you can only retry the failed CI tasks |
|
Hi @zhangqs0205, thanks for the detailed write-up — that context helps. The current Daniel approval already covers head |
|
Hi @davidzollo ,CI is passed |
|
Thanks for the update, and good to hear CI is green now. I checked the current PR head If you push another commit later, feel free to ping me again and I will take another pass. |
Fixes: #10597
Purpose of this pull request
This PR fixes missing Hive
partitionKeysin JDBC catalog metadata.It adds a JDBC dialect extension point for partition key detection, detects Hive partition keys from
DESCRIBE <table>, and propagates them into the relevantCatalogTablemetadata paths.Does this PR introduce any user-facing change?
It fixes JDBC metadata for Hive tables so partition keys can be auto-detected instead of always being empty.
How was this patch tested?
Added
HiveDialectTestand extendedCatalogUtilsTest.Check list
New License Guide
incompatible-changes.mdto describe the incompatibility caused by this PR.