Commit a572b1d
committed
Fix: Handle bytes values in string column statistics from Parquet
Problem:
When using `add_files()` with Parquet files written by DuckDB, PyIceberg
fails with `AttributeError: 'bytes' object has no attribute 'encode'`.
Root Cause:
The Parquet format stores column statistics (min_value, max_value) as binary
data in the Statistics struct (see parquet.thrift). When PyArrow reads these
statistics from Parquet files, it may return them as Python `bytes` objects
rather than decoded `str` values. This is valid per the Parquet specification:
struct Statistics {
5: optional binary max_value;
6: optional binary min_value;
}
PyIceberg's StatsAggregator expected string statistics to always be `str`,
causing failures when processing Parquet files from writers like DuckDB that
expose this binary representation.
Fix:
1. In `StatsAggregator.min_as_bytes()`: Add handling for bytes values by
decoding to UTF-8 string before truncation and serialization.
2. In `StatsAggregator.max_as_bytes()`: Update existing string handling to
decode bytes values before processing (was raising ValueError).
3. In `to_bytes()` for StringType: Add defensive isinstance check to handle
bytes values as a safety fallback.
4. Add unit tests for both StatsAggregator bytes handling and to_bytes.1 parent c0e7c6d commit a572b1d
File tree
3 files changed
+64
-1
lines changed- pyiceberg/io
- tests
- io
3 files changed
+64
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2183 | 2183 | | |
2184 | 2184 | | |
2185 | 2185 | | |
| 2186 | + | |
| 2187 | + | |
| 2188 | + | |
| 2189 | + | |
| 2190 | + | |
2186 | 2191 | | |
2187 | 2192 | | |
2188 | 2193 | | |
| |||
2194 | 2199 | | |
2195 | 2200 | | |
2196 | 2201 | | |
| 2202 | + | |
| 2203 | + | |
| 2204 | + | |
| 2205 | + | |
2197 | 2206 | | |
2198 | | - | |
| 2207 | + | |
2199 | 2208 | | |
2200 | 2209 | | |
2201 | 2210 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2232 | 2232 | | |
2233 | 2233 | | |
2234 | 2234 | | |
| 2235 | + | |
| 2236 | + | |
| 2237 | + | |
| 2238 | + | |
| 2239 | + | |
| 2240 | + | |
| 2241 | + | |
| 2242 | + | |
| 2243 | + | |
| 2244 | + | |
| 2245 | + | |
| 2246 | + | |
| 2247 | + | |
| 2248 | + | |
| 2249 | + | |
| 2250 | + | |
| 2251 | + | |
| 2252 | + | |
| 2253 | + | |
| 2254 | + | |
| 2255 | + | |
| 2256 | + | |
| 2257 | + | |
| 2258 | + | |
| 2259 | + | |
| 2260 | + | |
| 2261 | + | |
| 2262 | + | |
| 2263 | + | |
| 2264 | + | |
| 2265 | + | |
| 2266 | + | |
| 2267 | + | |
| 2268 | + | |
| 2269 | + | |
| 2270 | + | |
| 2271 | + | |
| 2272 | + | |
| 2273 | + | |
| 2274 | + | |
| 2275 | + | |
| 2276 | + | |
2235 | 2277 | | |
2236 | 2278 | | |
2237 | 2279 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
603 | 603 | | |
604 | 604 | | |
605 | 605 | | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
| 616 | + | |
| 617 | + | |
0 commit comments