Skip to content

fix(processors): log actual page count being processed, not document total#513

Merged
bundolee merged 1 commit into
mainfrom
fix/pages-log-actual-count
May 15, 2026
Merged

fix(processors): log actual page count being processed, not document total#513
bundolee merged 1 commit into
mainfrom
fix/pages-log-actual-count

Conversation

@bundolee
Copy link
Copy Markdown
Contributor

@bundolee bundolee commented May 15, 2026

Objective

When users pass --pages with values that are out of range (e.g. --pages 99999 on a 15-page PDF), warnings correctly report that no pages will be processed, but the same run also logs Processing 15 pages with 1 threads. Users reading the log cannot tell whether processing actually happened or not, and the contradiction between WARN and INFO lines undermines trust in every other log message.

Example before this patch:

WARN: Requested pages [99999] do not exist in document (total pages: 15). Processing only existing pages: []
WARN: No valid pages to process. Document has 15 pages but requested: [99999]
INFO: Processing 15 pages with 1 threads

The same contradiction occurs with --pages 1,99999, --pages 22-30, and any other input that filters out some or all pages — the INFO line always echoes the document's total page count rather than what is actually being processed.

Approach

In DocumentProcessor.processDocument, switch the INFO log to report the size of pagesToProcess (the validated set) instead of the document's total page count. When pagesToProcess is null (no --pages filter), fall back to totalPages so full-document runs still report correctly.

This is the smallest change that resolves the contradiction. Surrounding behavior — exit code on empty results, empty-output file generation, the range auto-clamp asymmetry between 1-99999 and 22-30 — is intentionally left alone. Those belong to a separate discussion about CLI validation policy, not log accuracy.

Evidence

Built the CLI and ran five scenarios against a 15-page PDF (samples/pdf/1901.03003.pdf), grepping for the Processing N pages with 1 threads line.

Scenario Expected Before After
no --pages (full doc) 15 Processing 15 pages Processing 15 pages
--pages 99999 (all invalid) 0 Processing 15 pages Processing 0 pages
--pages 1,99999 (1 valid + 1 invalid) 1 Processing 15 pages Processing 1 pages
--pages 1-5 (5 valid) 5 Processing 15 pages Processing 5 pages
--pages 22-30 (all out of range) 0 Processing 15 pages Processing 0 pages

After the patch, the INFO line matches both the WARN messages emitted by getValidPageNumbers and the actual JSON output content.

Test plan

  • Build passes (mvn clean install -DskipTests)
  • Full-document runs still report total page count
  • All-invalid --pages reports 0
  • Mixed valid/invalid --pages reports only the valid count
  • Range input reports actual processed count

Summary by CodeRabbit

  • Bug Fixes
    • Improved accuracy of page processing logs to correctly reflect the number of pages selected for processing rather than always reporting the total document page count.

Review Change Stack

…total

Objective: When users pass --pages with values that are out of range
(e.g. --pages 99999 on a 15-page PDF), warnings correctly report that
no pages will be processed, but the same run also logs "Processing 15
pages with 1 threads". Users reading the log cannot tell whether
processing actually happened or not, and the contradiction between
WARN and INFO lines undermines trust in every other log message.

Approach: In DocumentProcessor.processDocument, switch the INFO log to
report the size of pagesToProcess (the validated set) instead of the
document's total page count. When pagesToProcess is null (no --pages
filter), fall back to totalPages so full-document runs still report
correctly. This is the smallest change that resolves the contradiction;
the surrounding behavior (exit code, empty-output handling, range
auto-clamp asymmetry) is left alone — those belong to separate
discussions about CLI validation policy, not log accuracy.

Evidence: Built the CLI and ran 5 scenarios against a 15-page PDF
(samples/pdf/1901.03003.pdf).

| Scenario               | Before          | After           |
|------------------------|-----------------|-----------------|
| no --pages             | "Processing 15" | "Processing 15" |
| --pages 99999          | "Processing 15" | "Processing 0"  |
| --pages 1,99999        | "Processing 15" | "Processing 1"  |
| --pages 1-5            | "Processing 15" | "Processing 5"  |
| --pages 22-30          | "Processing 15" | "Processing 0"  |

The log now matches the WARN message and the actual JSON output content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ada736a0-0007-476b-af37-14192af23859

📥 Commits

Reviewing files that changed from the base of the PR and between 56e5543 and 0eedf5d.

📒 Files selected for processing (1)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java

Walkthrough

The processDocument(...) method now logs the count of pages selected for processing when a pagesToProcess set is provided, rather than always reporting the document's total page count. The informational log message is updated to use pagesToProcess.size() when available, otherwise defaulting to totalPages.

Changes

Page Count Logging

Layer / File(s) Summary
Page count logging in processDocument
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java
Logging statement now computes pagesToProcessCount from the pagesToProcess set (or falls back to totalPages when null) and reports that count alongside the thread count in the informational log message.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Suggested reviewers

  • MaximPlusov
  • LonelyMidoriya
  • hyunhee-jo
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: logging actual page count being processed instead of document total when page filters are applied.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bundolee bundolee merged commit 2e585ac into main May 15, 2026
8 checks passed
@bundolee bundolee deleted the fix/pages-log-actual-count branch May 15, 2026 07:00
@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant