[834] Add Hudi table version 9 (Hudi 1.x) support to the Hudi target#835
Draft
vinishjail97 wants to merge 7 commits into
Draft
[834] Add Hudi table version 9 (Hudi 1.x) support to the Hudi target#835vinishjail97 wants to merge 7 commits into
vinishjail97 wants to merge 7 commits into
Conversation
Adds an `xtable.hudi.target.table_version` config (values 6 or 9, default 9) so the Hudi target can write the Hudi 1.x table format (timeline layout V2, column-stats index V2) instead of being pinned to version 6. Target changes: - HudiTargetConfig parses the new config from the target's additional properties; HudiTableManager.initializeHudiTable and the write config now honour the selected version (write version derives from the table itself). - Enable the column-stats index for all tables and disable the partition-stats index independently (hoodie.metadata.index.partition.stats.enable, added in apache/hudi#19111) so column stats work for partitioned external-file tables. - Select the timeline archiver by layout version (TimelineArchivers.getInstance) so version 9 uses the V2/LSM archiver; switch to the HoodieCleanStat builder. Source changes: - HudiConversionSource selects and orders instants by completion time on version 9 (timeline layout V2) and by requested time on version 6, so a commit that completes out of order relative to its requested time is no longer skipped during incremental sync. Tests: - Bump hudi.version to 1.3.0-SNAPSHOT to pick up apache/hudi#19111. - Parameterize TestHudiFileStatsExtractor over versions 6 (8 columns, decimal excluded) and 9 (9 columns, decimal present). - Add TestHudiTargetConfig and an out-of-order-completion incremental sync test on a version 9 source table. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WHcYnBcrWmcmrC1ac4g9B6
…arget Column stats are now generated for partitioned Hudi targets (column-stats index enabled, partition-stats index disabled independently), so drop the `if (!partitioned)` guards that previously skipped the column-stats assertions for partitioned tables. The partitioned/non-partitioned parameterization is unchanged; the col-stats checks now run for both. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WHcYnBcrWmcmrC1ac4g9B6
Source test tables already enable the column-stats index unconditionally and the suite passes with array/map schemas, so remove the leftover commented-out schemaContainsArrayOrMap guard and its stale apache#773 note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WHcYnBcrWmcmrC1ac4g9B6
…ble version 9 Re-enables the three conversion cases apache#772 disabled for the Hudi 1.x target: - Un-partitioned Paimon -> Hudi (ITConversionController and ITConversionService). BaseFileUpdatesExtractor now emits Hudi's external file-group-prefix format (Hudi PR #17788) for bucketed files instead of folding the "bucket-N" directory into the partition path: the file is registered under its true partition (empty for un-partitioned) with fileId "bucket-N/<file>" and the 3-arg marker "<file>_<commit>_fg%3Dbucket-N_hudiext". Applied consistently across the snapshot path, the diff path, and file-id derivation; non-bucketed sources are unaffected. - HUDI (partitioned on the nested column "nested_record.level") -> ICEBERG. This depends on the Hudi reader fix in apache/hudi#19123, so it will only pass once that fix is available in the Hudi snapshot the build resolves. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2 tasks
A savepoint instant reuses the requested time of the commit it pins, so orderByCompletionTimeAndDedup (the table version 9 / completion-time ordering path) dropped it when deduping the merged commit lists by requested time alone: putIfAbsent kept the data commit and silently discarded the savepoint. The version 6 path (mergeAndDedupLists) dedups by full instant equality, which includes the action, so it never had this problem. Include the action in the dedup key. The intended dedup (the same instant appearing in both the pending list and the newly-completed list) still collapses, but distinct actions sharing a requested time survive, and the version 9 backlog matches version 6: commit, savepoint (no-op), restore, commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… versions 6 and 9
Flip HudiTargetConfig.DEFAULT_TABLE_VERSION from NINE to SIX so the
default output stays readable by released Hudi readers; version 9
remains fully supported via xtable.hudi.target.table_version=9.
Parameterize the Hudi test suites so every run exercises both table
versions instead of only the default:
- ITHudiConversionTarget: partitioned x {SIX, NINE} via a MethodSource
cross-product; the target client sets the version through
HudiTargetConfig.HUDI_TABLE_VERSION.
- ITHudiConversionSource: source tables are created at {SIX, NINE} via
the table-type/partition MethodSource cross-products, and the
parameterized tests write through TestSparkHudiTable (Spark writer)
instead of the Java client.
- ITConversionController: combinations targeting HUDI are emitted once
per version; getTableSyncConfig gained an overload that applies the
version to the Hudi target properties.
- TestHudiTargetConfig/TestHudiConversionTarget assert against
DEFAULT_TABLE_VERSION instead of a hard-coded version.
Version 9 source coverage in ITHudiConversionSource depends on two
Hudi fixes validated against a locally patched 1.3.0-SNAPSHOT:
apache/hudi#19126 (column stats on map/array-nested leaves during MOR
log-append) and the savepoint backlog fix in the previous commit.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds Hudi table version 9 (Hudi 1.x) support to the Hudi target, behind a new config and enabled by default, resolving #834.
A new target config
xtable.hudi.target.table_versionaccepts6or9and defaults to9. Version 6 keeps the legacy 0.x timeline layout and column-stats index V1; version 9 uses the Hudi 1.x timeline layout (V2) and column-stats index V2.Target changes
HudiTargetConfigparses the new config from the target's additional properties.HudiTableManager.initializeHudiTableinitializes the table at the selected version; the write config derives its write version from the table itself (autoUpgrade=false).hoodie.metadata.index.partition.stats.enable. The partition-stats generation path groups committed files by fileId, which fails for XTable's externally-registered files (non-Hudi names,_hudiextmarker) — see Hudi 1.2.0: enable MDT column-stats index for partitioned tables #832. Disabling it independently requires feat: Allow disabling the partition stats index independently of column stats hudi#19111.TimelineArchivers.getInstance) so version 9 uses the V2/LSM archiver; the previous hardcodedTimelineArchiverV1produced no archives on v9.Source changes
HudiConversionSourceselects and orders instants by completion time on version 9 (timeline layout V2) and by requested time on version 6. A commit that completes out of order relative to its requested time is no longer skipped during incremental sync.Tests
TestHudiFileStatsExtractoris parameterized over versions 6 (8 columns, decimal excluded under index V1 per HUDI-8585) and 9 (9 columns, decimal present under V2).TestHudiTargetConfigcovers the config defaulting/validation.This bumps
hudi.versionto1.3.0-SNAPSHOTto consume apache/hudi#19111 (hoodie.metadata.index.partition.stats.enable). CI will not pass until that PR is merged and a Hudi release containing it is published, at which point this should be re-pinned to the released version. Kept as a draft until then.🤖 Generated with Claude Code