Skip to content

Commit 4a8c84e

Browse files
iamluankevinjqliu
andauthored
docs: add type mapping tables between PyIceberg and PyArrow (#3098)
<!-- Thanks for opening a pull request! --> <!-- In the case this PR will resolve an issue, please replace ${GITHUB_ISSUE_ID} below with the actual Github issue id. --> Closes #2226 # Rationale for this change This PR adds documentation with tables describing the type mapping between PyArrow and PyIceberg data types. ## Are these changes tested? Yes. The changes are tested locally as shown in the image below. <img width="1563" height="792" alt="image" src="https://github.com/user-attachments/assets/1d9fc6a6-a1ea-4feb-a4d7-71d9dd036813" /> ## Are there any user-facing changes? Yes. This PR adds new user-facing documentation. <!-- In the case of user-facing changes, please add the changelog label. --> --------- Co-authored-by: Kevin Liu <kevin.jq.liu@gmail.com>
1 parent 44ce51a commit 4a8c84e

File tree

1 file changed

+84
-0
lines changed

1 file changed

+84
-0
lines changed

mkdocs/docs/api.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2039,3 +2039,87 @@ DataFrame()
20392039
| 3 | 6 |
20402040
+---+---+
20412041
```
2042+
2043+
## Type mapping
2044+
2045+
### PyArrow
2046+
2047+
The Iceberg specification only specifies type mapping for Avro, Parquet, and ORC:
2048+
2049+
- [Iceberg to Avro](https://iceberg.apache.org/spec/#avro)
2050+
2051+
- [Iceberg to Parquet](https://iceberg.apache.org/spec/#parquet)
2052+
2053+
- [Iceberg to ORC](https://iceberg.apache.org/spec/#orc)
2054+
2055+
The following tables describe the type mappings between PyIceberg and PyArrow. In the tables below, `pa` refers to the `pyarrow` module:
2056+
2057+
```python
2058+
import pyarrow as pa
2059+
```
2060+
2061+
#### PyIceberg to PyArrow type mapping
2062+
2063+
| PyIceberg type class | PyArrow type |
2064+
|---------------------------------|-------------------------------------|
2065+
| `BooleanType` | `pa.bool_()` |
2066+
| `IntegerType` | `pa.int32()` |
2067+
| `LongType` | `pa.int64()` |
2068+
| `FloatType` | `pa.float32()` |
2069+
| `DoubleType` | `pa.float64()` |
2070+
| `DecimalType(p, s)` | `pa.decimal128(p, s)` |
2071+
| `DateType` | `pa.date32()` |
2072+
| `TimeType` | `pa.time64("us")` |
2073+
| `TimestampType` | `pa.timestamp("us")` |
2074+
| `TimestampNanoType` (format version 3 only) | `pa.timestamp("ns")` [[2]](#notes) |
2075+
| `TimestamptzType` | `pa.timestamp("us", tz="UTC")` [[1]](#notes) |
2076+
| `TimestamptzNanoType` (format version 3 only) | `pa.timestamp("ns", tz="UTC")` [[1]](#notes) [[2]](#notes) |
2077+
| `StringType` | `pa.large_string()` |
2078+
| `UUIDType` | `pa.uuid()` |
2079+
| `BinaryType` | `pa.large_binary()` |
2080+
| `FixedType(L)` | `pa.binary(L)` |
2081+
| `StructType` | `pa.struct()` |
2082+
| `ListType(e)` | `pa.large_list(e)` |
2083+
| `MapType(k, v)` | `pa.map_(k, v)` |
2084+
| `UnknownType` (format version 3 only) | `pa.null()` [[2]](#notes) |
2085+
2086+
---
2087+
2088+
#### PyArrow to PyIceberg type mapping
2089+
2090+
| PyArrow type | PyIceberg type class |
2091+
|------------------------------------|-----------------------------|
2092+
| `pa.bool_()` | `BooleanType` |
2093+
| `pa.int8()` / `pa.int16()` / `pa.int32()` | `IntegerType` |
2094+
| `pa.int64()` | `LongType` |
2095+
| `pa.float32()` | `FloatType` |
2096+
| `pa.float64()` | `DoubleType` |
2097+
| `pa.decimal128(p, s)` | `DecimalType(p, s)` |
2098+
| `pa.decimal256(p, s)` | Unsupported |
2099+
| `pa.date32()` | `DateType` |
2100+
| `pa.date64()` | Unsupported |
2101+
| `pa.time64("us")` | `TimeType` |
2102+
| `pa.timestamp("s")` / `pa.timestamp("ms")` / `pa.timestamp("us")` | `TimestampType` |
2103+
| `pa.timestamp("ns")` | `TimestampNanoType` (format version 3 only) [[2]](#notes) |
2104+
| `pa.timestamp("s", tz="UTC")` / `pa.timestamp("ms", tz="UTC")` / `pa.timestamp("us", tz="UTC")` | `TimestamptzType` [[1]](#notes) |
2105+
| `pa.timestamp("ns", tz="UTC")` | `TimestamptzNanoType` (format version 3 only) [[1]](#notes) [[2]](#notes) |
2106+
| `pa.string()` / `pa.large_string()` / `pa.string_view()` | `StringType` |
2107+
| `pa.uuid()` | `UUIDType` |
2108+
| `pa.binary()` / `pa.large_binary()` / `pa.binary_view()` | `BinaryType` |
2109+
| `pa.binary(L)` | `FixedType(L)` |
2110+
| `pa.struct([...])` | `StructType` |
2111+
| `pa.list_(e)` / `pa.large_list(e)` / `pa.list_(e, fixed_size)` | `ListType(e)` |
2112+
| `pa.map_(k, v)` | `MapType(k, v)` |
2113+
| `pa.null()` | `UnknownType` (format version 3 only) [[2]](#notes) |
2114+
2115+
---
2116+
2117+
#### Notes
2118+
2119+
[1] Only the `UTC` timezone and its aliases are supported for PyArrow-to-PyIceberg timestamp-with-timezone conversion.
2120+
2121+
[2] The PyArrow-to-PyIceberg mappings for `pa.timestamp("ns")`, `pa.timestamp("ns", tz="UTC")`, and `pa.null()` require Iceberg format version 3. By default, `pyarrow_to_schema()` uses format version 2. `TimestampNanoType`, `TimestamptzNanoType`, and `UnknownType` are likewise format-version-3-only Iceberg types.
2122+
2123+
[3] For nanosecond Iceberg timestamp types (`TimestampNanoType` and `TimestamptzNanoType`), writing in format version 3 is not yet implemented (see [GitHub issue #1551](https://github.com/apache/iceberg-python/issues/1551)).
2124+
2125+
[4] The mappings are not fully symmetric. On read, PyArrow normalizes some families of types into a single Iceberg type, and on write PyIceberg emits a canonical PyArrow type: for example, `pa.int8()` and `pa.int16()` read as `IntegerType` and write back as `pa.int32()`, `pa.string()` reads as `StringType` and writes back as `pa.large_string()`, `pa.binary()` reads as `BinaryType` and writes back as `pa.large_binary()`, `pa.list_(...)` writes back as `pa.large_list(...)`, and `pa.timestamp("s")` / `pa.timestamp("ms")` read as `TimestampType` and write back as `pa.timestamp("us")`.

0 commit comments

Comments
 (0)