Skip to content

When writing data from a PyArrow DataFrame, how should we handle 'null' Fields? #2119

@ldsantos0911

Description

@ldsantos0911

Question

import pyarrow as pa

# table created with the below pyarrow schema
schema = pa.schema(
    [
        pa.field("col1", pa.string(), nullable=True),
    ]
)

df = pa.Table.from_pylist(
    [
        {"col1": None}
    ]
)

table.overwrite(df)

In the above example, we encounter an error like this UnsupportedPyArrowTypeException: Column 'col1' has an unsupported type: null, with underlying cause

in _ConvertToIceberg.primitive(self, primitive)
   1211     return FixedType(primitive.byte_width)
-> 1213 raise TypeError(f"Unsupported type: {primitive}")

TypeError: Unsupported type: null

Is there any reason we wouldn't want to support the case where pyarrow has marked a Field as null? As a workaround/fix, I was thinking that we could exclude pa.null() Fields in visit_pyarrow(obj: pa.StructType, visitor: PyArrowSchemaVisitor[T]). This way, the column would effectively be missing and any required/nullable enforcement would be performed accordingly. Would this have any undesired consequences?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions