Commit f830ee3
authored
Refactor parquet datasource into an explicit state machine (#21190)
## Which issue does this PR close?
- part of #20529
- Broken out of #20820
## Rationale for this change
1. I am trying to break #20820
into smaller chunks
2. I want to isolate the changes to the parquet opener to see if they
are causing regressions)
At a high level, the parquet opener in DataFusion potentially does
several IOs as part of its reading pipeline. However, those IOs are all
somewhat implicit and hidden in a large 400 line `async` closure:
https://github.com/apache/datafusion/blob/37cd3de82fcfa7619b04cfb9f19607ff55d44bc4/datafusion/datasource-parquet/src/opener.rs#L232-L641
As part of morselizing the FileStream we are trying to make the IO and
CPU split clearer (so they can be scheduled explicitly)
## What changes are included in this PR?
1. Extract the code in the parquet opener into an explicit state
machine, so
## Are these changes tested?
Functionally by existing tests
I also ran performance tests and they didn't show any substantial
performance change
## Are there any user-facing changes?
No, this is internal code reorganization
## Follow on tasks
- [ ] Split BloomFilter into CPU and IO states
- [ ] Split decoder reading into CPU and IO states1 parent 4a36675 commit f830ee3
3 files changed
+847
-417
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
18 | 22 | | |
19 | 23 | | |
20 | 24 | | |
| |||
0 commit comments