Skip to content

feat: add custom_metadata support to RecordBatch with IPC read/write#9445

Open
rustyconover wants to merge 6 commits into
apache:mainfrom
rustyconover:feat/recordbatch-custom-metadata
Open

feat: add custom_metadata support to RecordBatch with IPC read/write#9445
rustyconover wants to merge 6 commits into
apache:mainfrom
rustyconover:feat/recordbatch-custom-metadata

Conversation

@rustyconover
Copy link
Copy Markdown

Which issue does this PR close?

What changes are included in this PR?

Add per-batch custom_metadata to RecordBatch, matching the custom_metadata field on the IPC Message flatbuffer envelope. This allows attaching per-batch metadata separate from schema-level metadata, bringing parity with PyArrow's write_batch(custom_metadata=...) API (available since PyArrow v11.0.0).

Changes:

  • Add custom_metadata: HashMap<String, String> field to RecordBatch with custom_metadata(), custom_metadata_mut(), with_custom_metadata(), and into_parts_with_custom_metadata() accessors
  • IPC writer: serialize custom_metadata to Message flatbuffer
  • IPC reader: extract custom_metadata from Message at FileDecoder, StreamReader, and StreamDecoder call sites
  • arrow-flight: extract and propagate custom_metadata in flight_data_to_arrow_batch
  • arrow-select: preserve custom_metadata through filter_record_batch and take_record_batch
  • Metadata preserved through slice(), project(), normalize(), with_schema(), and remove_column()
  • PyArrow-generated test data for cross-language interop validation

Are these changes tested?

Yes there are tests in the PR.

Are there any user-facing changes?

There are no breaking changes.

Written with AI assistance; all changes reviewed by the author.

@github-actions github-actions Bot added arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Feb 20, 2026
@rustyconover rustyconover force-pushed the feat/recordbatch-custom-metadata branch from 1ebc16a to 31227eb Compare February 25, 2026 17:51
@rustyconover rustyconover force-pushed the feat/recordbatch-custom-metadata branch from 31227eb to 9390ec4 Compare April 26, 2026 20:07
rustyconover added a commit to Query-farm/vgi-rpc-rust that referenced this pull request Apr 26, 2026
…ustom_metadata

Pin arrow-rs to rustyconover/arrow-rs#feat/recordbatch-custom-metadata
(apache/arrow-rs#9445) and rewrite vgi-rpc/src/wire.rs as a thin wrapper
around arrow_ipc::reader::StreamReader / writer::StreamWriter. Per-batch
metadata now travels on RecordBatch directly via with_custom_metadata()
/ custom_metadata(); the Metadata alias becomes HashMap<String, String>
and the ReadBatch wrapper is gone. relax_nullability flips
with_skip_validation(true) on the inner reader since upstream validates
before our schema rewrap.

Also bundles in-progress conformance worker, http, and arrow_type
changes that were already pending on the branch.

Conformance: 723/723 across pipe/subprocess/http/unix/externalize.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rustyconover -- I see the custom_metadata field on the IPC messages and that makes sense to expose somehow

I am less sure about adding a new field to RecordBatch -- mostly as I am not sure about the implications of doing so (though your point that an empty HashMap has no allocations is a good one)

I mostly am thinking about our experience in other libraries trying to handle custom metadata on Fields where many kernels / processing don't preserve the metadata and it has been quite tough

I fear the same thing would happen to this field -- basically that it would not be used by most libraries but they would all pay the size cost on every RecordBatch

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 28, 2026

I wonder if @tustvold @jhorstmann @viirya or @kylebarron have any thoughts on this matter (adding custom metadata to every RecordBatch)

@rustyconover
Copy link
Copy Markdown
Author

Hi @alamb I've patched arrow-go and arrow-js. And I have others with working patches into arrow-java. So its mostly about connectivity for me.

@rustyconover
Copy link
Copy Markdown
Author

rustyconover commented Apr 29, 2026

Hi @alamb, thanks for your review. I think the CI passed. Is there more you'd like me to do? Being a new contributor to arrow-rs I'm a bit unsure.

This is going to be the primary user to start: https://github.com/Query-farm/vgi-rpc-rust

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 29, 2026

Thanks @rustyconover -- there isn't anything I think you need to do

What i think is next needed is some buy in from other maintainers / stakeholders that changing RecordBatch is a reasonable thing to do given the tradeoffs

@jhorstmann
Copy link
Copy Markdown
Contributor

This makes RecordBatch quite a bit larger (40 bytes -> 88 bytes), which by itself is probably not an issue. I think it might be quite common to clone batches though, and both the schema and arrays inside are already reference counted. What do you think of using Option<Arc<HashMap<String, String>>> to both reduce size and make cloning cheaper?

Addresses @jhorstmann's review: the previous inline HashMap pushed
RecordBatch from 40 to 88 bytes and made every clone copy the map.
Using Option<Arc<HashMap<String, String>>> brings the struct to 48
bytes (one niche-optimized pointer) and clones become refcount bumps.

API surface reduced to four methods on RecordBatch:
  - custom_metadata() -> Option<&HashMap<String, String>>
  - custom_metadata_mut() -> &mut HashMap<String, String>
  - with_custom_metadata(HashMap<String, String>) -> Self
  - into_parts_with_custom_metadata() -> (..., Option<HashMap<...>>)

with_custom_metadata normalizes an empty map to None. Because
custom_metadata_mut cannot observe a subsequent .clear(), PartialEq
is now implemented by hand to treat Some(empty) and None as equal,
keeping the two paths to "no metadata" indistinguishable to callers.

message_custom_metadata in arrow-ipc now returns Option<HashMap>
directly; downstream callers in arrow-ipc, arrow-select and
arrow-flight clone the HashMap on transfer (typically tiny).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rustyconover
Copy link
Copy Markdown
Author

hi @jhorstmann, I just switched the field to Option<Arc<HashMap<String, String>>>. RecordBatch is now 48 bytes (one niche-optimized pointer added vs. the original 40), and clones bump an Arc refcount instead of copying the map.

Took the chance to keep the public API minimal — four methods on RecordBatch:

fn custom_metadata(&self) -> Option<&HashMap<String, String>>;
fn custom_metadata_mut(&mut self) -> &mut HashMap<String, String>;
fn with_custom_metadata(self, HashMap<String, String>) -> Self;  // empty -> None
fn into_parts_with_custom_metadata(self) -> (..., Option<HashMap<String, String>>);

with_custom_metadata normalizes an empty map to None. Since custom_metadata_mut().clear() can't observe the post-mutation state, PartialEq treats Some(empty) and None as equal — the two paths to "no metadata" stay indistinguishable to callers.

@rustyconover
Copy link
Copy Markdown
Author

@alamb can you start the CI again please.

Comment thread arrow-array/src/record_batch.rs Outdated
Comment on lines +253 to +255
if self.columns != other.columns {
return false;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move the comparison of arrays after metadata comparison which should be quicker?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends. custom_metadata is rare in practice. I can move the order if you think its worth it.

Comment thread arrow-array/src/record_batch.rs Outdated
///
/// An empty map is normalized to "no metadata", so a batch built with an
/// empty map compares equal to one built without calling this method.
pub fn with_custom_metadata(mut self, metadata: HashMap<String, String>) -> Self {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use Arc<HashMap<String, String>> as the custom metadata type? Otherwise, if you want to attach same metadata for all record batches, you need to copy the same metadata map and the Arc here is useless as it is only for one record batch?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me will update it.

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented May 11, 2026

I'm not very familiar with this part of the code, but with the above suggestions applied this seems a reasonable price to pay. I took a peek at how arrow-cpp handles this, and they have a separate record batch type. We could try having flight_data_to_arrow_batch return an enum encapsulating either a RecordBatch or a new struct that also contains the custom metadata, but I think that would require a lot of (probably breaking) changes.

Address review feedback from @viirya on apache#9445:

- `RecordBatch::with_custom_metadata` now accepts `Arc<HashMap<...>>` so the
  same metadata map can be shared across many batches without cloning.
- `into_parts_with_custom_metadata` returns the shared `Arc` instead of
  unwrap-cloning out of it.
- `PartialEq` now compares custom_metadata before columns: in the common
  case where neither batch has metadata it is a near-free `None == None`
  check that short-circuits real mismatches without scanning arrays.
- Introduce `pub type CustomMetadata = Arc<HashMap<String, String>>` and
  use it on the field, setter, and parts accessor. Avoids the
  type-complexity clippy lint and gives the type a name callers can refer
  to.
- Update IPC reader, arrow-select filter/take, and arrow-flight callsites
  to wrap their cloned maps in `Arc::new` when reattaching.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 11, 2026

return an enum encapsulating either a RecordBatch or a new struct that also contains the custom metadata, but I think that would require a lot of (probably breaking) changes.

Yeah, and I think @rustyconover 's usecase is to use this per batch metadata in other operations (not just at the IPC boundary)

@rustyconover
Copy link
Copy Markdown
Author

return an enum encapsulating either a RecordBatch or a new struct that also contains the custom metadata, but I think that would require a lot of (probably breaking) changes.

Yeah, and I think @rustyconover 's usecase is to use this per batch metadata in other operations (not just at the IPC boundary)

Yes thats right.

///
/// This corresponds to the `custom_metadata` field on the IPC `Message`
/// flatbuffer, separate from schema-level metadata.
pub fn custom_metadata(&self) -> Option<&HashMap<String, String>> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should return Option<&Arc<HashMap<String, String>>> so that callers can clone the arc if they want

Suggested change
pub fn custom_metadata(&self) -> Option<&HashMap<String, String>> {
pub fn custom_metadata(&self) -> Option<&Arc<HashMap<String, String>>> {

/// empty map compares equal to one built without calling this method.
pub fn with_custom_metadata(mut self, metadata: CustomMetadata) -> Self {
self.custom_metadata = if metadata.is_empty() {
None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused -- are you trying to enforce the invariant that custom_metadata never is Some(empty_map)? If so

If so, then the custom PartialEq implementation is not needed -- as it seems you are now enforcing the invariant empty metadata is none.

I would personally suggest not trying to be fancy here, and instead just do

    pub fn with_custom_metadata(mut self, metadata: Option<CustomMetadata>) -> Self {

Which makes the caller have to decide how they want to treat None vs an Empty hash map. Also, it would give people an escape valvue to set the metadata exactly as they wanted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support per-batch custom_metadata on RecordBatch (IPC Message field)

5 participants