feat: add custom_metadata support to RecordBatch with IPC read/write#9445
feat: add custom_metadata support to RecordBatch with IPC read/write#9445rustyconover wants to merge 6 commits into
custom_metadata support to RecordBatch with IPC read/write#9445Conversation
1ebc16a to
31227eb
Compare
31227eb to
9390ec4
Compare
…ustom_metadata Pin arrow-rs to rustyconover/arrow-rs#feat/recordbatch-custom-metadata (apache/arrow-rs#9445) and rewrite vgi-rpc/src/wire.rs as a thin wrapper around arrow_ipc::reader::StreamReader / writer::StreamWriter. Per-batch metadata now travels on RecordBatch directly via with_custom_metadata() / custom_metadata(); the Metadata alias becomes HashMap<String, String> and the ReadBatch wrapper is gone. relax_nullability flips with_skip_validation(true) on the inner reader since upstream validates before our schema rewrap. Also bundles in-progress conformance worker, http, and arrow_type changes that were already pending on the branch. Conformance: 723/723 across pipe/subprocess/http/unix/externalize. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alamb
left a comment
There was a problem hiding this comment.
Thanks @rustyconover -- I see the custom_metadata field on the IPC messages and that makes sense to expose somehow
I am less sure about adding a new field to RecordBatch -- mostly as I am not sure about the implications of doing so (though your point that an empty HashMap has no allocations is a good one)
I mostly am thinking about our experience in other libraries trying to handle custom metadata on Fields where many kernels / processing don't preserve the metadata and it has been quite tough
I fear the same thing would happen to this field -- basically that it would not be used by most libraries but they would all pay the size cost on every RecordBatch
|
I wonder if @tustvold @jhorstmann @viirya or @kylebarron have any thoughts on this matter (adding custom metadata to every RecordBatch) |
|
Hi @alamb I've patched arrow-go and arrow-js. And I have others with working patches into arrow-java. So its mostly about connectivity for me. |
|
Hi @alamb, thanks for your review. I think the CI passed. Is there more you'd like me to do? Being a new contributor to This is going to be the primary user to start: https://github.com/Query-farm/vgi-rpc-rust |
|
Thanks @rustyconover -- there isn't anything I think you need to do What i think is next needed is some buy in from other maintainers / stakeholders that changing |
|
This makes |
Addresses @jhorstmann's review: the previous inline HashMap pushed RecordBatch from 40 to 88 bytes and made every clone copy the map. Using Option<Arc<HashMap<String, String>>> brings the struct to 48 bytes (one niche-optimized pointer) and clones become refcount bumps. API surface reduced to four methods on RecordBatch: - custom_metadata() -> Option<&HashMap<String, String>> - custom_metadata_mut() -> &mut HashMap<String, String> - with_custom_metadata(HashMap<String, String>) -> Self - into_parts_with_custom_metadata() -> (..., Option<HashMap<...>>) with_custom_metadata normalizes an empty map to None. Because custom_metadata_mut cannot observe a subsequent .clear(), PartialEq is now implemented by hand to treat Some(empty) and None as equal, keeping the two paths to "no metadata" indistinguishable to callers. message_custom_metadata in arrow-ipc now returns Option<HashMap> directly; downstream callers in arrow-ipc, arrow-select and arrow-flight clone the HashMap on transfer (typically tiny). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
hi @jhorstmann, I just switched the field to Took the chance to keep the public API minimal — four methods on fn custom_metadata(&self) -> Option<&HashMap<String, String>>;
fn custom_metadata_mut(&mut self) -> &mut HashMap<String, String>;
fn with_custom_metadata(self, HashMap<String, String>) -> Self; // empty -> None
fn into_parts_with_custom_metadata(self) -> (..., Option<HashMap<String, String>>);
|
|
@alamb can you start the CI again please. |
| if self.columns != other.columns { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Should we move the comparison of arrays after metadata comparison which should be quicker?
There was a problem hiding this comment.
I guess it depends. custom_metadata is rare in practice. I can move the order if you think its worth it.
| /// | ||
| /// An empty map is normalized to "no metadata", so a batch built with an | ||
| /// empty map compares equal to one built without calling this method. | ||
| pub fn with_custom_metadata(mut self, metadata: HashMap<String, String>) -> Self { |
There was a problem hiding this comment.
Should we use Arc<HashMap<String, String>> as the custom metadata type? Otherwise, if you want to attach same metadata for all record batches, you need to copy the same metadata map and the Arc here is useless as it is only for one record batch?
There was a problem hiding this comment.
Makes sense to me will update it.
|
I'm not very familiar with this part of the code, but with the above suggestions applied this seems a reasonable price to pay. I took a peek at how arrow-cpp handles this, and they have a separate record batch type. We could try having |
Address review feedback from @viirya on apache#9445: - `RecordBatch::with_custom_metadata` now accepts `Arc<HashMap<...>>` so the same metadata map can be shared across many batches without cloning. - `into_parts_with_custom_metadata` returns the shared `Arc` instead of unwrap-cloning out of it. - `PartialEq` now compares custom_metadata before columns: in the common case where neither batch has metadata it is a near-free `None == None` check that short-circuits real mismatches without scanning arrays. - Introduce `pub type CustomMetadata = Arc<HashMap<String, String>>` and use it on the field, setter, and parts accessor. Avoids the type-complexity clippy lint and gives the type a name callers can refer to. - Update IPC reader, arrow-select filter/take, and arrow-flight callsites to wrap their cloned maps in `Arc::new` when reattaching. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Yeah, and I think @rustyconover 's usecase is to use this per batch metadata in other operations (not just at the IPC boundary) |
Yes thats right. |
| /// | ||
| /// This corresponds to the `custom_metadata` field on the IPC `Message` | ||
| /// flatbuffer, separate from schema-level metadata. | ||
| pub fn custom_metadata(&self) -> Option<&HashMap<String, String>> { |
There was a problem hiding this comment.
I think this should return Option<&Arc<HashMap<String, String>>> so that callers can clone the arc if they want
| pub fn custom_metadata(&self) -> Option<&HashMap<String, String>> { | |
| pub fn custom_metadata(&self) -> Option<&Arc<HashMap<String, String>>> { |
| /// empty map compares equal to one built without calling this method. | ||
| pub fn with_custom_metadata(mut self, metadata: CustomMetadata) -> Self { | ||
| self.custom_metadata = if metadata.is_empty() { | ||
| None |
There was a problem hiding this comment.
I am confused -- are you trying to enforce the invariant that custom_metadata never is Some(empty_map)? If so
If so, then the custom PartialEq implementation is not needed -- as it seems you are now enforcing the invariant empty metadata is none.
I would personally suggest not trying to be fancy here, and instead just do
pub fn with_custom_metadata(mut self, metadata: Option<CustomMetadata>) -> Self {Which makes the caller have to decide how they want to treat None vs an Empty hash map. Also, it would give people an escape valvue to set the metadata exactly as they wanted
Which issue does this PR close?
custom_metadataonRecordBatch(IPC Message field) #9444.What changes are included in this PR?
Add per-batch
custom_metadatatoRecordBatch, matching thecustom_metadatafield on the IPCMessageflatbuffer envelope. This allows attaching per-batch metadata separate from schema-level metadata, bringing parity with PyArrow'swrite_batch(custom_metadata=...)API (available since PyArrow v11.0.0).Changes:
custom_metadata: HashMap<String, String>field toRecordBatchwithcustom_metadata(),custom_metadata_mut(),with_custom_metadata(), andinto_parts_with_custom_metadata()accessorsflight_data_to_arrow_batchfilter_record_batchandtake_record_batchslice(),project(),normalize(),with_schema(), andremove_column()Are these changes tested?
Yes there are tests in the PR.
Are there any user-facing changes?
There are no breaking changes.
Written with AI assistance; all changes reviewed by the author.