Commit b718cda
[data] DataSourceV2: V2 ARROW-5030 nested-type fallback (#63175)
## Why
Parquet files with nested columns (e.g. `list<struct<..., string>>`)
whose row groups exceed Arrow's ~2 GB chunking threshold hit
`ArrowNotImplementedError` at decode time (ARROW-5030). V1 already has a
metadata-only fallback that detects this and switches to
`pq.ParquetFile.iter_batches`. This PR ports it to V2 and makes the
decision filter-aware.
## What
**Port V1's nested-type fallback to V2.** `FileReader` grows an
`_iter_fragment_tables` hook; `ParquetFileReader` overrides it with V1's
`_needs_nested_type_fallback` metadata check, falling back to
`pq.ParquetFile.iter_batches` (with safe batch sizing, row-group
pushdown via `fragment.subset`, and per-batch row-level filtering) when
the check fires.
**Make the fallback decision filter-aware.** Previously the check looked
only at projected columns. A filter that touches a large nested column
*outside* the projection would still force the scanner to decode it for
row-level evaluation — and hit ARROW-5030. The check now sees the union
of projected + filter-referenced columns:
```python
ds.read_parquet(path).select_columns(["id"]).filter(col("nested_col").is_not_null())
# ^^^^ projection excludes nested_col
# ^^^^ but filter references it
# → fallback must trigger
```
**Carry the predicate as a Ray `Expr` instead of a pyarrow expression.**
`pyarrow.compute.Expression` is opaque (no public visitor), so we can't
extract filter columns from it after the fact. Keeping the Ray `Expr` as
the source of truth — and converting to pyarrow once, at the
scanner-kwargs boundary — lets the reader call `get_column_references`
for the union above. Touches `ArrowFileScanner.predicate`,
`FileReader.predicate`, and `push_filters` (now ANDs Ray `Expr`s).
**Drop the legacy `filter=` kwarg on V2.**
`read_parquet(filter=pc.field("x") > 5)` is already deprecated. Since it
carries a raw pyarrow expression that can't be introspected, it's
silently stripped on the V2 path. Callers should use
`read_parquet(path).filter(expr=...)`.
## Tests
- `test_read_parquet_nested_type_arrow_not_implemented_fallback` — V2
skip removed (regression for
[#61675](#61675)).
-
`test_read_parquet_nested_fallback_triggered_when_filter_references_nested_column`
— new, V2-only. Projects a flat column and filters on the large nested
column; asserts the fallback is invoked.
Signed-off-by: Goutam <goutam@anyscale.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with [ReviewStack](https://reviewstack.dev/ray-project/ray/pull/63175).
* #63326
* __->__ #63175
Signed-off-by: Goutam <goutam@anyscale.com>
Co-authored-by: Goutam V. <>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: anindyam1969 <amukherjee@kinetica.com>1 parent e92dc62 commit b718cda
8 files changed
Lines changed: 340 additions & 53 deletions
File tree
- python/ray/data
- _internal/datasource_v2
- readers
- scanners
- tests
- tests
- datasource
- unit/datasource_v2
Lines changed: 0 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
272 | 272 | | |
273 | 273 | | |
274 | 274 | | |
275 | | - | |
276 | | - | |
277 | | - | |
278 | | - | |
279 | | - | |
280 | 275 | | |
281 | 276 | | |
282 | 277 | | |
| |||
291 | 286 | | |
292 | 287 | | |
293 | 288 | | |
294 | | - | |
295 | 289 | | |
Lines changed: 52 additions & 28 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | | - | |
| 2 | + | |
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
7 | | - | |
8 | 7 | | |
9 | 8 | | |
10 | 9 | | |
| |||
14 | 13 | | |
15 | 14 | | |
16 | 15 | | |
| 16 | + | |
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| |||
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
61 | | - | |
| 61 | + | |
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
| |||
74 | 74 | | |
75 | 75 | | |
76 | 76 | | |
77 | | - | |
| 77 | + | |
| 78 | + | |
78 | 79 | | |
79 | 80 | | |
80 | 81 | | |
| |||
226 | 227 | | |
227 | 228 | | |
228 | 229 | | |
229 | | - | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
230 | 233 | | |
231 | 234 | | |
232 | 235 | | |
233 | 236 | | |
234 | 237 | | |
235 | | - | |
236 | 238 | | |
237 | | - | |
238 | | - | |
239 | | - | |
240 | | - | |
| 239 | + | |
| 240 | + | |
241 | 241 | | |
242 | 242 | | |
243 | 243 | | |
| |||
340 | 340 | | |
341 | 341 | | |
342 | 342 | | |
343 | | - | |
344 | | - | |
345 | | - | |
346 | | - | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
347 | 352 | | |
348 | 353 | | |
349 | 354 | | |
350 | 355 | | |
351 | 356 | | |
352 | 357 | | |
353 | | - | |
354 | | - | |
355 | | - | |
356 | | - | |
357 | | - | |
358 | | - | |
359 | 358 | | |
| 359 | + | |
360 | 360 | | |
361 | | - | |
362 | | - | |
363 | | - | |
364 | | - | |
365 | | - | |
366 | | - | |
367 | 361 | | |
368 | | - | |
369 | | - | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
370 | 367 | | |
371 | 368 | | |
372 | 369 | | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
Lines changed: 174 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
8 | | - | |
9 | 8 | | |
10 | 9 | | |
11 | 10 | | |
| |||
20 | 19 | | |
21 | 20 | | |
22 | 21 | | |
| 22 | + | |
23 | 23 | | |
| 24 | + | |
24 | 25 | | |
25 | 26 | | |
26 | 27 | | |
| |||
135 | 136 | | |
136 | 137 | | |
137 | 138 | | |
138 | | - | |
| 139 | + | |
139 | 140 | | |
140 | 141 | | |
141 | 142 | | |
| |||
151 | 152 | | |
152 | 153 | | |
153 | 154 | | |
154 | | - | |
| 155 | + | |
155 | 156 | | |
156 | 157 | | |
157 | 158 | | |
| |||
229 | 230 | | |
230 | 231 | | |
231 | 232 | | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
232 | 400 | | |
233 | 401 | | |
234 | 402 | | |
| |||
239 | 407 | | |
240 | 408 | | |
241 | 409 | | |
242 | | - | |
| 410 | + | |
243 | 411 | | |
244 | 412 | | |
245 | 413 | | |
246 | 414 | | |
247 | 415 | | |
248 | 416 | | |
| 417 | + | |
0 commit comments