Skip to content

[improvement](be) Add new parquet page-level skip and profile metrics#64214

Open
suxiaogang223 wants to merge 8 commits into
apache:refact_reader_branchfrom
suxiaogang223:codex/page-level-skip-filter
Open

[improvement](be) Add new parquet page-level skip and profile metrics#64214
suxiaogang223 wants to merge 8 commits into
apache:refact_reader_branchfrom
suxiaogang223:codex/page-level-skip-filter

Conversation

@suxiaogang223
Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: This PR extends the new parquet reader with page-level skip support and the first set of observability counters needed to validate it. Page index planning now produces page skip plans consumed by Arrow page readers, avoiding double skip with row-range pruning. The reader profile now reports scheduler/read path counters, reader read/skip/select rows, Arrow RecordReader timing, materialization timing, page skip pages/bytes, and row group/page index planning timing. Temporary implementation plan docs were removed after the feature work landed.

Release note

None

Check List (For Author)

  • Test: Unit Test / Manual test
    • Fedora DEBUG build: BUILD_TYPE=DEBUG ./build.sh --be
    • Fedora BE UT: ./run-be-ut.sh -j 8 --run --filter="NewParquetReaderTest.*" (24/24 passed)
  • Behavior changed: No
  • Does this need documentation: No

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223 suxiaogang223 marked this pull request as ready for review June 8, 2026 08:34
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Jun 8, 2026
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#64214

Problem Summary: The new parquet page skip profile mixed logical pruning information with physical reader work. Page skip counters used generic names that could be interpreted as all page-index-pruned pages, while they are updated only when Arrow's data page filter callback actually skips a page. ReaderSkipRows also counted scheduler-level logical skips, including rows already removed by page filtering, so it could overstate the actual RecordReader::SkipRecords work. This change renames the page skip counters to data-page-filter-specific names and updates ReaderSkipRows only for rows actually passed to Arrow RecordReader::SkipRecords. Parent complex readers and the synthetic row-position reader no longer add logical read/skip rows to the physical reader counters.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Pending: NewParquetReaderTest.*
- Behavior changed: No
- Does this need documentation: No
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Jun 8, 2026
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#64214

Problem Summary: The new parquet data page filter setup was implemented as an inline closure inside RecordReader creation, which made the page ordinal tracking, profile updates and page skip plan lookup hard to read. This refactors the filter into a small DataPageSkipFilter helper, centralizes page skip plan lookup/filter installation, and documents the important page-index invariants around data-page ordinals, non-repeated leaves, and double-skip accounting. The remaining touched reader files are formatting-only changes from the project clang-format script.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Pending: NewParquetReaderTest.*
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: New parquet reader already uses page index to narrow row ranges, but selected range gaps were still skipped through RecordReader::SkipRecords(). This means page bodies in page-aligned gaps may still be read/decompressed, and directly enabling Arrow PageReader data page filtering would double skip rows because the scheduler also calls SkipRecords() for the same logical gap. This change carries a per-leaf page skip plan from page-index planning to row group readers, injects Arrow PageReader::set_data_page_filter() before RecordReader::SetPageReader(), and adjusts ScalarColumnReader skip accounting so page-filtered rows are not passed to SkipRecords(). The first phase only enables local page skipping for primitive non-repeated leaves and keeps LIST/MAP/repeated leaves on existing row-range pruning.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran targeted clang-format v16 through build-support/run_clang_format.py for modified C++ files
    - Ran git diff --check and git diff --cached --check
    - Attempted ./run-be-ut.sh --run --filter='NewParquetReaderTest.*', but local CMake compiler check failed before Doris code compilation because /opt/homebrew/opt/llvm@16/bin/clang++ cannot link a simple program: ld: library 'c++' not found
- Behavior changed: No
- Does this need documentation: Yes (updated docs/page-level-skip-plan.md)
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Page-level skip filtering in the new Parquet reader did not expose profile counters for the actual pages skipped by Arrow's data_page_filter callback or the compressed bytes associated with those skipped pages. This makes it hard to validate page-level pruning effectiveness from runtime profiles. This change wires page skip profile counters from ParquetReader through ParquetScanScheduler to ParquetColumnReaderFactory, updates them only when the callback actually skips a page, and records skipped compressed bytes from OffsetIndex page locations.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Added NewParquetReaderTest coverage for page skip compressed bytes and profile counters. Not run locally; will run on Fedora after syncing branch.
- Behavior changed: No
- Does this need documentation: Yes (updated docs/observability-profile-plan.md)
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The observability profile plan did not clearly distinguish between counters registered by the new parquet reader and counters that currently have real update paths. This update documents the current effective profile counters, identifies registered-but-unwired counters, and records the next profile gaps to fill for scheduler, column reader, page index, nested assembler, and page-level skip observability.

### Release note

None

### Check List (For Author)

- Test: No need to test (documentation only)

- Behavior changed: No

- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: New parquet registered many parquet reader profile counters but several low-risk scan path counters still had no update path, making it hard to observe row-level filtering, empty batches, range-gap skips, and file reader creation from query profiles. This change publishes file reader lifecycle statistics, wires scheduler-level scan counters and timers, and documents the current profile state.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Added new parquet reader unit coverage for profile counters. Fedora unit test will be run after pushing this branch.
- Behavior changed: No
- Does this need documentation: Yes
Issue Number: close #xxx

Related PR: #xxx

Problem Summary: New parquet now needs observability for the first three profile priorities: scheduler/read path counters, column reader and Arrow adapter timing, and row group/page index planning timing. This change wires reader-level row counters, Arrow RecordReader and materialization timers, and planning timers into the existing ParquetReader profile, with unit test assertions and profile documentation updates.

None

- Test: Unit Test
    - Fedora DEBUG build and NewParquetReaderTest will be run after pushing this branch.
- Behavior changed: No
- Does this need documentation: Yes
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The temporary new parquet page-level skip and observability profile planning documents described implementation work that is now completed on this branch. Keeping these design notes in docs would make the current implementation status harder to read, so this change removes the obsolete plan documents.

### Release note

None

### Check List (For Author)

- Test: No need to test (documentation cleanup only)
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#64214

Problem Summary: The new parquet page skip profile mixed logical pruning information with physical reader work. Page skip counters used generic names that could be interpreted as all page-index-pruned pages, while they are updated only when Arrow's data page filter callback actually skips a page. ReaderSkipRows also counted scheduler-level logical skips, including rows already removed by page filtering, so it could overstate the actual RecordReader::SkipRecords work. This change renames the page skip counters to data-page-filter-specific names and updates ReaderSkipRows only for rows actually passed to Arrow RecordReader::SkipRecords. Parent complex readers and the synthetic row-position reader no longer add logical read/skip rows to the physical reader counters.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Pending: NewParquetReaderTest.*
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#64214

Problem Summary: The new parquet data page filter setup was implemented as an inline closure inside RecordReader creation, which made the page ordinal tracking, profile updates and page skip plan lookup hard to read. This refactors the filter into a small DataPageSkipFilter helper, centralizes page skip plan lookup/filter installation, and documents the important page-index invariants around data-page ordinals, non-repeated leaves, and double-skip accounting. The remaining touched reader files are formatting-only changes from the project clang-format script.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Pending: NewParquetReaderTest.*
- Behavior changed: No
- Does this need documentation: No
@suxiaogang223 suxiaogang223 force-pushed the codex/page-level-skip-filter branch from af16607 to d9c7a81 Compare June 8, 2026 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants