Core, Spark: RewriteTablePath support for multiple source and destination prefixes#14355
Core, Spark: RewriteTablePath support for multiple source and destination prefixes#14355krisnaru wants to merge 2 commits into
Conversation
| return prefixMappings.entrySet().stream() | ||
| .filter(entry -> path.startsWith(entry.getKey())) | ||
| .max(java.util.Comparator.comparing(entry -> entry.getKey().length())) | ||
| .orElse(null); |
There was a problem hiding this comment.
Hey @krisnaru I didn't get why are we sorting by length of the paths that we sent?
There was a problem hiding this comment.
If you have multiple prefix mappings like:
s3://bucket/warehouse/ → s3://new-bucket/warehouse/
s3://bucket/warehouse/db/tbl/ → s3://other-bucket/data/
And a file path is s3://bucket/warehouse/db/tbl/data.parquet, both prefixes match (both are valid startsWith matches). By picking the longest matching prefix, you ensure the most specific mapping wins — the
file gets rewritten to s3://other-bucket/data/data.parquet rather than s3://new-bucket/warehouse/db/tbl/data.parquet.
Without the length-based sorting, the result would be non-deterministic (depends on map iteration order) and could apply the wrong, less-specific prefix.
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
|
@anuragmantri can you help get eyes on this PR? |
|
I think this is a useful feature for users who have moved the data / metadata locations during the life of a table. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
9a44459 to
414eb87
Compare
Overview
Enhanced Apache Iceberg's RewriteTablePathSparkAction to support multiple source-target prefix pairs through a fluent chaining API, enabling complex table migration scenarios with hierarchical path mappings.
Problem Statement
The original implementation only supported a single source-target prefix pair, limiting users to simple one-to-one path transformations. This was insufficient for:
Multi-cloud migrations with different storage systems
Complex data reorganization with multiple path hierarchies
Cross-environment moves requiring multiple prefix mapping
Multiple hadoop clusters support
Usage
// Before: Single prefix only
.rewriteLocationPrefix(sourcePrefix, targetPrefix)
// After: Chainable multiple prefixes
.rewriteLocationPrefix("s3://old-bucket/", "s3://new-bucket/")
.rewriteLocationPrefix("hdfs://cluster/", "s3://data-lake/")
.rewriteLocationPrefix("/tmp/", "s3://staging/")