Skip to content

arrow-ipc writer does not comply with spec for empty variable-size arrays #9716

@atwam

Description

@atwam

Describe the bug

Arrow's spec for variable-size binary layout (used for binary or Utf8 arrays) states:

The offsets buffer contains length + 1 signed integers (either 32-bit or 64-bit, depending on the data type), which encode the start position of each slot in the data buffer.

This means that an empty array should have a length-1 offsets buffer containing a single 0.
Instead, serializing an empty Utf8/Byte array creates an empty offset buffer (in get_byte_array_buffers).

This means that an empty (no-rows) IPC file created by arrow-rs will not be spec-compliant, and may not be readable by some other implementations (for example polars/arrow2).

Note that this behavior applies to Utf8/Binary arrays, but also List, LargeList, and by extension Map (anything that uses get_byte_array_buffers or get_list_array_buffers)

To Reproduce

  • Create a RecordBatch with a single Utf8 (empty) array, serialize to IPC.
  • Try to read that IPC file with a different arrow library. For example, polars will fail reading the serialized IPC file.

Expected behavior
For variable size layouts, IPC writer should output a length 1 offsets buffer containing a single [0].

Additional context

Current version works when doing round-trip because get_offsets_from_buffer fills in an empty offsets for a [0] array.

Note that arrow-cpp seems to take the same approach, and similarly will fill-in an empty offsets buffer with [0]. While this means we can round-trip with arrow-cpp and pyarrow, we don't comply with the spec and can cause issues with other implementations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    arrowChanges to the arrow cratebug

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions