Skip to content

[834] Add Hudi table version 9 (Hudi 1.x) support to the Hudi target#835

Draft
vinishjail97 wants to merge 7 commits into
apache:mainfrom
vinishjail97:hudi-table-version-9
Draft

[834] Add Hudi table version 9 (Hudi 1.x) support to the Hudi target#835
vinishjail97 wants to merge 7 commits into
apache:mainfrom
vinishjail97:hudi-table-version-9

Conversation

@vinishjail97

Copy link
Copy Markdown
Contributor

What

Adds Hudi table version 9 (Hudi 1.x) support to the Hudi target, behind a new config and enabled by default, resolving #834.

A new target config xtable.hudi.target.table_version accepts 6 or 9 and defaults to 9. Version 6 keeps the legacy 0.x timeline layout and column-stats index V1; version 9 uses the Hudi 1.x timeline layout (V2) and column-stats index V2.

Target changes

  • HudiTargetConfig parses the new config from the target's additional properties.
  • HudiTableManager.initializeHudiTable initializes the table at the selected version; the write config derives its write version from the table itself (autoUpgrade=false).
  • Column stats for partitioned tables: enable the column-stats index for all tables and disable the partition-stats index independently via hoodie.metadata.index.partition.stats.enable. The partition-stats generation path groups committed files by fileId, which fails for XTable's externally-registered files (non-Hudi names, _hudiext marker) — see Hudi 1.2.0: enable MDT column-stats index for partitioned tables #832. Disabling it independently requires feat: Allow disabling the partition stats index independently of column stats hudi#19111.
  • Timeline archival: pick the archiver by layout version (TimelineArchivers.getInstance) so version 9 uses the V2/LSM archiver; the previous hardcoded TimelineArchiverV1 produced no archives on v9.

Source changes

  • HudiConversionSource selects and orders instants by completion time on version 9 (timeline layout V2) and by requested time on version 6. A commit that completes out of order relative to its requested time is no longer skipped during incremental sync.

Tests

  • TestHudiFileStatsExtractor is parameterized over versions 6 (8 columns, decimal excluded under index V1 per HUDI-8585) and 9 (9 columns, decimal present under V2).
  • TestHudiTargetConfig covers the config defaulting/validation.
  • A deterministic out-of-order-completion incremental-sync test on a v9 source table verifies the straggler commit is still picked up (it is dropped under requested-time selection).
  • Full suite green at v9: xtable-core unit (479) + all ITs (ITHudiConversionTarget, ITHudiConversionSource 27, ITConversionController 44, cross-format sources) + utilities/service/aws/hive-metastore.

⚠️ Dependency note — draft

This bumps hudi.version to 1.3.0-SNAPSHOT to consume apache/hudi#19111 (hoodie.metadata.index.partition.stats.enable). CI will not pass until that PR is merged and a Hudi release containing it is published, at which point this should be re-pinned to the released version. Kept as a draft until then.

🤖 Generated with Claude Code

vinishjail97 and others added 4 commits June 29, 2026 21:07
Adds an `xtable.hudi.target.table_version` config (values 6 or 9, default 9)
so the Hudi target can write the Hudi 1.x table format (timeline layout V2,
column-stats index V2) instead of being pinned to version 6.

Target changes:
- HudiTargetConfig parses the new config from the target's additional
  properties; HudiTableManager.initializeHudiTable and the write config now
  honour the selected version (write version derives from the table itself).
- Enable the column-stats index for all tables and disable the partition-stats
  index independently (hoodie.metadata.index.partition.stats.enable, added in
  apache/hudi#19111) so column stats work for partitioned external-file tables.
- Select the timeline archiver by layout version (TimelineArchivers.getInstance)
  so version 9 uses the V2/LSM archiver; switch to the HoodieCleanStat builder.

Source changes:
- HudiConversionSource selects and orders instants by completion time on
  version 9 (timeline layout V2) and by requested time on version 6, so a
  commit that completes out of order relative to its requested time is no
  longer skipped during incremental sync.

Tests:
- Bump hudi.version to 1.3.0-SNAPSHOT to pick up apache/hudi#19111.
- Parameterize TestHudiFileStatsExtractor over versions 6 (8 columns, decimal
  excluded) and 9 (9 columns, decimal present).
- Add TestHudiTargetConfig and an out-of-order-completion incremental sync test
  on a version 9 source table.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WHcYnBcrWmcmrC1ac4g9B6
…arget

Column stats are now generated for partitioned Hudi targets (column-stats
index enabled, partition-stats index disabled independently), so drop the
`if (!partitioned)` guards that previously skipped the column-stats
assertions for partitioned tables. The partitioned/non-partitioned
parameterization is unchanged; the col-stats checks now run for both.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WHcYnBcrWmcmrC1ac4g9B6
Source test tables already enable the column-stats index unconditionally and
the suite passes with array/map schemas, so remove the leftover commented-out
schemaContainsArrayOrMap guard and its stale apache#773 note.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WHcYnBcrWmcmrC1ac4g9B6
…ble version 9

Re-enables the three conversion cases apache#772 disabled for the Hudi 1.x target:

- Un-partitioned Paimon -> Hudi (ITConversionController and ITConversionService).
  BaseFileUpdatesExtractor now emits Hudi's external file-group-prefix format
  (Hudi PR #17788) for bucketed files instead of folding the "bucket-N" directory
  into the partition path: the file is registered under its true partition (empty
  for un-partitioned) with fileId "bucket-N/<file>" and the 3-arg marker
  "<file>_<commit>_fg%3Dbucket-N_hudiext". Applied consistently across the snapshot
  path, the diff path, and file-id derivation; non-bucketed sources are unaffected.

- HUDI (partitioned on the nested column "nested_record.level") -> ICEBERG. This
  depends on the Hudi reader fix in apache/hudi#19123, so it will only pass once
  that fix is available in the Hudi snapshot the build resolves.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vinishjail97 and others added 3 commits July 1, 2026 16:28
A savepoint instant reuses the requested time of the commit it pins, so
orderByCompletionTimeAndDedup (the table version 9 / completion-time
ordering path) dropped it when deduping the merged commit lists by
requested time alone: putIfAbsent kept the data commit and silently
discarded the savepoint. The version 6 path (mergeAndDedupLists) dedups
by full instant equality, which includes the action, so it never had
this problem.

Include the action in the dedup key. The intended dedup (the same
instant appearing in both the pending list and the newly-completed
list) still collapses, but distinct actions sharing a requested time
survive, and the version 9 backlog matches version 6:
commit, savepoint (no-op), restore, commit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… versions 6 and 9

Flip HudiTargetConfig.DEFAULT_TABLE_VERSION from NINE to SIX so the
default output stays readable by released Hudi readers; version 9
remains fully supported via xtable.hudi.target.table_version=9.

Parameterize the Hudi test suites so every run exercises both table
versions instead of only the default:
- ITHudiConversionTarget: partitioned x {SIX, NINE} via a MethodSource
  cross-product; the target client sets the version through
  HudiTargetConfig.HUDI_TABLE_VERSION.
- ITHudiConversionSource: source tables are created at {SIX, NINE} via
  the table-type/partition MethodSource cross-products, and the
  parameterized tests write through TestSparkHudiTable (Spark writer)
  instead of the Java client.
- ITConversionController: combinations targeting HUDI are emitted once
  per version; getTableSyncConfig gained an overload that applies the
  version to the Hudi target properties.
- TestHudiTargetConfig/TestHudiConversionTarget assert against
  DEFAULT_TABLE_VERSION instead of a hard-coded version.

Version 9 source coverage in ITHudiConversionSource depends on two
Hudi fixes validated against a locally patched 1.3.0-SNAPSHOT:
apache/hudi#19126 (column stats on map/array-nested leaves during MOR
log-append) and the savepoint backlog fix in the previous commit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant