Skip to content

Hudi 1.2.0: enable MDT column-stats index for partitioned tables #832

Description

@vinishjail97

Background

Part of the Hudi 1.x upgrade (#762).

In Hudi 1.2.0 the metadata-table (MDT) partition-stats index is coupled to the column-stats index. For partitioned tables, the partition-stats generation path rebuilds a file-system view over the committed external parquet files and groups them by fileId. XTable registers externally-written files whose names are not Hudi-native; once the _hudiext marker is stripped, the remaining file name cannot be parsed into a valid fileId, which causes column/partition-stats generation to fail.

Current workaround

In HudiConversionTarget#getWriteConfig, column stats are enabled only for un-partitioned tables:

.withMetadataIndexColumnStats(!metaClient.getTableConfig().isTablePartitioned())

So partitioned tables synced through XTable currently get no MDT column-stats index.

Ask

Enable the MDT column-stats (and partition-stats) index for partitioned tables, e.g. by making the partition-stats file grouping tolerant of externally-registered (non-Hudi) file names, so col-stats can be turned on regardless of partitioning.

Notes

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions