You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: add type mapping tables between PyIceberg and PyArrow (#3098)
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
Closes#2226
# Rationale for this change
This PR adds documentation with tables describing the type mapping
between PyArrow and PyIceberg data types.
## Are these changes tested?
Yes.
The changes are tested locally as shown in the image below.
<img width="1563" height="792" alt="image"
src="https://github.com/user-attachments/assets/1d9fc6a6-a1ea-4feb-a4d7-71d9dd036813"
/>
## Are there any user-facing changes?
Yes.
This PR adds new user-facing documentation.
<!-- In the case of user-facing changes, please add the changelog label.
-->
---------
Co-authored-by: Kevin Liu <kevin.jq.liu@gmail.com>
|`pa.null()`|`UnknownType` (format version 3 only) [[2]](#notes)|
2114
+
2115
+
---
2116
+
2117
+
#### Notes
2118
+
2119
+
[1] Only the `UTC` timezone and its aliases are supported for PyArrow-to-PyIceberg timestamp-with-timezone conversion.
2120
+
2121
+
[2] The PyArrow-to-PyIceberg mappings for `pa.timestamp("ns")`, `pa.timestamp("ns", tz="UTC")`, and `pa.null()` require Iceberg format version 3. By default, `pyarrow_to_schema()` uses format version 2. `TimestampNanoType`, `TimestamptzNanoType`, and `UnknownType` are likewise format-version-3-only Iceberg types.
2122
+
2123
+
[3] For nanosecond Iceberg timestamp types (`TimestampNanoType` and `TimestamptzNanoType`), writing in format version 3 is not yet implemented (see [GitHub issue #1551](https://github.com/apache/iceberg-python/issues/1551)).
2124
+
2125
+
[4] The mappings are not fully symmetric. On read, PyArrow normalizes some families of types into a single Iceberg type, and on write PyIceberg emits a canonical PyArrow type: for example, `pa.int8()` and `pa.int16()` read as `IntegerType` and write back as `pa.int32()`, `pa.string()` reads as `StringType` and writes back as `pa.large_string()`, `pa.binary()` reads as `BinaryType` and writes back as `pa.large_binary()`, `pa.list_(...)` writes back as `pa.large_list(...)`, and `pa.timestamp("s")` / `pa.timestamp("ms")` read as `TimestampType` and write back as `pa.timestamp("us")`.
0 commit comments