Skip to content

arrow-ipc: Write 0 offset buffer for length-0 variable-size arrays#9717

Open
atwam wants to merge 2 commits intoapache:mainfrom
atwam:fix-empty-utf8-buffers
Open

arrow-ipc: Write 0 offset buffer for length-0 variable-size arrays#9717
atwam wants to merge 2 commits intoapache:mainfrom
atwam:fix-empty-utf8-buffers

Conversation

@atwam
Copy link
Copy Markdown
Contributor

@atwam atwam commented Apr 14, 2026

Which issue does this PR close?

Rationale for this change

Current version serializes a length-0 offsets buffer, and relies on the array constructor to set offsets to [0] for empty arrays. This does not conform with spec, which specifies that the offsets buffer should have length + 1 elements.

This correctly serializes a length-1 offsets buffer containing [0]. I have left the current behavior of filling-in offsets with [0] for empty arrays, so that future versions can still read IPC files serialized by previous versions.

What changes are included in this PR?

Added tests for serialization of empty arrays, ensuring length-1 offsets buffer. Fixed serialization of empty arrays.
This fixes serialization of empty Binary/Utf8/List/LargeList (and Map) arrays.

Are these changes tested?

Yes

Are there any user-facing changes?

There should not be any breaking change for users. Serialized files for empty record batches will be slightly different. Older serialized files should still be read fine.

@github-actions github-actions Bot added the arrow Changes to the arrow crate label Apr 14, 2026
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 14, 2026

Current version serializes a length-0 offsets buffer, and relies on the array constructor to set offsets to [0] for empty arrays. This does not conform with spec, which specifies that the offsets buffer should have length + 1 elements.

I don't understand this statement. The spec you reference in #9716 is for the in memory structure (not the ipc format)

It almost sounds like a bug in polars -- shouldn't it accept zero length IPC buffers ?

@alamb alamb changed the title arrow-ipc: Ensure writer conforms to spec on length-0 variable-size arrays arrow-ipc: Write 0 offset buffer for length-0 variable-size arrays Apr 20, 2026
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 20, 2026

For anyone else following this PR, please see discussion on the issue.

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @atwam -- after the discussion on this change makes sense to me

I think the only thing this PR should have is a test showing that the old style (zero length offset buffers) are still accepted.

Maybe we can make a small file with such a buffer and have a test that verifies it can be read successfully

Comment thread arrow-ipc/src/writer.rs
Comment thread arrow-ipc/src/writer.rs
Comment thread arrow-ipc/src/writer.rs
assert_eq!(child_data.len(), 0, "child data should remain empty");
}

#[test]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please also add a test that shows IPC files without any offets are still accepted?

Copy link
Copy Markdown
Contributor Author

@atwam atwam Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added two tests for legacy files.

Creating files with the legacy format means I could not rely on the writer (unless I added options to enable some legacy format, which was ugly), so I just went back to raw flatbuffers for the purpose of these tests.

Another alternative would be to generate such files once with the current version and embed some raw files, but that amounts to embedding small binary files, which I found equally repulsive.

I hope my approach will do.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you have is fine -- if it becomes too annoying to maintain we can make a checked in file

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @atwam

Comment thread arrow-ipc/src/writer.rs
assert_eq!(child_data.len(), 0, "child data should remain empty");
}

#[test]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you have is fine -- if it becomes too annoying to maintain we can make a checked in file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

arrow-ipc writer does not comply with spec for empty variable-size arrays

2 participants