Skip to content

feat: Add iceberg.schema to footer for compatibility#2249

Open
vovacf201 wants to merge 4 commits intoapache:mainfrom
risingwavelabs:pr/iceberg-schema-footer-v2
Open

feat: Add iceberg.schema to footer for compatibility#2249
vovacf201 wants to merge 4 commits intoapache:mainfrom
risingwavelabs:pr/iceberg-schema-footer-v2

Conversation

@vovacf201
Copy link
Copy Markdown

This PR adds the iceberg.schema key-value metadata to the Parquet file footer.

It embeds the full Iceberg schema JSON (including field IDs, types, and schema ID) into the Parquet file footer metadata, which is required for compatibility with downstream readers.

Changes:

  • Add ICEBERG_SCHEMA_KEY constant for the iceberg.schema metadata key
  • Add add_iceberg_schema_metadata method to ParquetWriterBuilder that injects the schema JSON into writer properties
  • Preserve any existing key-value metadata from the caller
  • Add comprehensive tests for metadata presence, roundtrip, nested schemas, and preservation of existing metadata

Cherry-picked from risingwavelabs/iceberg-rust commit 18a0e83

* mock change

* remove fmt

* add unit tests

* fix tests

* format

* commit
/// schema ID) into the Parquet file footer.
fn add_iceberg_schema_metadata(props: WriterProperties, schema: &Schema) -> WriterProperties {
let schema_json = serde_json::to_string(schema)
.expect("Iceberg schema serialization to JSON should not fail");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we propagate this error up rather than having it panic?


// Preserve any existing key-value metadata from the caller
let mut kv_metadata = props.key_value_metadata().cloned().unwrap_or_default();
kv_metadata.push(iceberg_kv);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a caller has already set ICEBERG_SCHEMA_KEY then we'll add it twice here. Java uses a map so I think it's not an issue there but I feel like we're better off reporting an error in this case?

let iceberg_kv = KeyValue::new(ICEBERG_SCHEMA_KEY.to_string(), schema_json);

// Preserve any existing key-value metadata from the caller
let mut kv_metadata = props.key_value_metadata().cloned().unwrap_or_default();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible we're better propagating this error upstream also rather than silently ignoring it?

Copy link
Copy Markdown
Contributor

@xanderbailey xanderbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, looks like a good change to me. Left a few comments.

Also confirmed this matches Java here https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L363

@xanderbailey
Copy link
Copy Markdown
Contributor

I realized we also don't write metadata for delete files either so I put this up #2316

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants