Skip to content

Commit f830ee3

Browse files
authored
Refactor parquet datasource into an explicit state machine (#21190)
## Which issue does this PR close? - part of #20529 - Broken out of #20820 ## Rationale for this change 1. I am trying to break #20820 into smaller chunks 2. I want to isolate the changes to the parquet opener to see if they are causing regressions) At a high level, the parquet opener in DataFusion potentially does several IOs as part of its reading pipeline. However, those IOs are all somewhat implicit and hidden in a large 400 line `async` closure: https://github.com/apache/datafusion/blob/37cd3de82fcfa7619b04cfb9f19607ff55d44bc4/datafusion/datasource-parquet/src/opener.rs#L232-L641 As part of morselizing the FileStream we are trying to make the IO and CPU split clearer (so they can be scheduled explicitly) ## What changes are included in this PR? 1. Extract the code in the parquet opener into an explicit state machine, so ## Are these changes tested? Functionally by existing tests I also ran performance tests and they didn't show any substantial performance change ## Are there any user-facing changes? No, this is internal code reorganization ## Follow on tasks - [ ] Split BloomFilter into CPU and IO states - [ ] Split decoder reading into CPU and IO states
1 parent 4a36675 commit f830ee3

File tree

3 files changed

+847
-417
lines changed

3 files changed

+847
-417
lines changed

datafusion/datasource-parquet/src/mod.rs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,10 @@
1515
// specific language governing permissions and limitations
1616
// under the License.
1717

18+
//! DataFusion Parquet Reader: [`ParquetSource`]
19+
//!
20+
//! [`ParquetSource`]: source::ParquetSource
21+
1822
// Make sure fast / cheap clones on Arc are explicit:
1923
// https://github.com/apache/datafusion/issues/11143
2024
#![cfg_attr(not(test), deny(clippy::clone_on_ref_ptr))]

0 commit comments

Comments
 (0)