Skip to content

Commit d12c5c9

Browse files
authored
docs: start Spark 4.1 known-limitations section, seeded with apache#4199 (apache#4202)
1 parent 1a6cd98 commit d12c5c9

2 files changed

Lines changed: 20 additions & 0 deletions

File tree

docs/source/user-guide/latest/compatibility/scans.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,15 @@ The following shared limitation may produce incorrect results without falling ba
5757
written using the Proleptic Gregorian calendar. This may produce incorrect results for dates before
5858
October 15, 1582.
5959

60+
The following shared limitation raises an error at scan time rather than falling back to Spark:
61+
62+
- Invalid UTF-8 bytes in `STRING` columns. Spark permits arbitrary byte sequences in a `STRING`
63+
column (for example from `CAST(X'C1' AS STRING)`), but Comet's native execution path is built on
64+
Arrow, whose string type is strictly UTF-8. Reading a Parquet file whose `STRING` column contains
65+
non-UTF-8 bytes fails with `Parquet error: encountered non UTF-8 data`. Disable Comet for the
66+
query, or cast the column to `BINARY` before persisting, if you need to preserve non-UTF-8 bytes.
67+
See [#4121](https://github.com/apache/datafusion-comet/issues/4121).
68+
6069
## `native_datafusion` Limitations
6170

6271
The `native_datafusion` scan has some additional limitations, mostly related to Parquet metadata. All of these

docs/source/user-guide/latest/compatibility/spark-versions.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,17 @@ Spark 4.1 support is experimental and intended for development and testing only.
5151
in production.
5252
```
5353

54+
### Known Limitations
55+
56+
- **`NullType` columns in Parquet files**
57+
([#4199](https://github.com/apache/datafusion-comet/issues/4199)): Spark encodes a `NullType`
58+
column as a Parquet `BOOLEAN` physical type annotated with `LogicalType::Unknown`. The Rust
59+
`parquet` crate that Comet depends on accepts `Unknown` only when paired with `INT32` and rejects
60+
any other physical type with `Parquet error: Cannot annotate Unknown from BOOLEAN for field '<name>'`.
61+
Any attempt to read a Parquet file that contains a `NullType` column fails at decode time before
62+
Comet's scan runs. Workaround: project the column away, cast it to a concrete type before
63+
persisting, or read the file with Comet disabled for that query.
64+
5465
## Spark 4.2 (Experimental)
5566

5667
Spark 4.2.0-preview4 is provided as experimental support with Java 17 and Scala 2.13.

0 commit comments

Comments
 (0)