Describe the bug
Arrow's spec for variable-size binary layout (used for binary or Utf8 arrays) states:
The offsets buffer contains length + 1 signed integers (either 32-bit or 64-bit, depending on the data type), which encode the start position of each slot in the data buffer.
This means that an empty array should have a length-1 offsets buffer containing a single 0.
Instead, serializing an empty Utf8/Byte array creates an empty offset buffer (in get_byte_array_buffers).
This means that an empty (no-rows) IPC file created by arrow-rs will not be spec-compliant, and may not be readable by some other implementations (for example polars/arrow2).
Note that this behavior applies to Utf8/Binary arrays, but also List, LargeList, and by extension Map (anything that uses get_byte_array_buffers or get_list_array_buffers)
To Reproduce
- Create a
RecordBatch with a single Utf8 (empty) array, serialize to IPC.
- Try to read that IPC file with a different arrow library. For example, polars will fail reading the serialized IPC file.
Expected behavior
For variable size layouts, IPC writer should output a length 1 offsets buffer containing a single [0].
Additional context
Current version works when doing round-trip because get_offsets_from_buffer fills in an empty offsets for a [0] array.
Note that arrow-cpp seems to take the same approach, and similarly will fill-in an empty offsets buffer with [0]. While this means we can round-trip with arrow-cpp and pyarrow, we don't comply with the spec and can cause issues with other implementations.
Describe the bug
Arrow's spec for variable-size binary layout (used for binary or Utf8 arrays) states:
This means that an empty array should have a length-1 offsets buffer containing a single 0.
Instead, serializing an empty Utf8/Byte array creates an empty offset buffer (in
get_byte_array_buffers).This means that an empty (no-rows) IPC file created by arrow-rs will not be spec-compliant, and may not be readable by some other implementations (for example polars/arrow2).
Note that this behavior applies to Utf8/Binary arrays, but also List, LargeList, and by extension Map (anything that uses
get_byte_array_buffersorget_list_array_buffers)To Reproduce
RecordBatchwith a single Utf8 (empty) array, serialize to IPC.Expected behavior
For variable size layouts, IPC writer should output a length 1 offsets buffer containing a single
[0].Additional context
Current version works when doing round-trip because
get_offsets_from_bufferfills in an empty offsets for a[0]array.Note that arrow-cpp seems to take the same approach, and similarly will fill-in an empty offsets buffer with
[0]. While this means we can round-trip with arrow-cpp and pyarrow, we don't comply with the spec and can cause issues with other implementations.