From 0eedf5d00a62aaf03db261721e0ca0009c62c263 Mon Sep 17 00:00:00 2001 From: Bundo Lee Date: Fri, 15 May 2026 15:55:59 +0900 Subject: [PATCH] fix(processors): log actual page count being processed, not document total MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Objective: When users pass --pages with values that are out of range (e.g. --pages 99999 on a 15-page PDF), warnings correctly report that no pages will be processed, but the same run also logs "Processing 15 pages with 1 threads". Users reading the log cannot tell whether processing actually happened or not, and the contradiction between WARN and INFO lines undermines trust in every other log message. Approach: In DocumentProcessor.processDocument, switch the INFO log to report the size of pagesToProcess (the validated set) instead of the document's total page count. When pagesToProcess is null (no --pages filter), fall back to totalPages so full-document runs still report correctly. This is the smallest change that resolves the contradiction; the surrounding behavior (exit code, empty-output handling, range auto-clamp asymmetry) is left alone — those belong to separate discussions about CLI validation policy, not log accuracy. Evidence: Built the CLI and ran 5 scenarios against a 15-page PDF (samples/pdf/1901.03003.pdf). | Scenario | Before | After | |------------------------|-----------------|-----------------| | no --pages | "Processing 15" | "Processing 15" | | --pages 99999 | "Processing 15" | "Processing 0" | | --pages 1,99999 | "Processing 15" | "Processing 1" | | --pages 1-5 | "Processing 15" | "Processing 5" | | --pages 22-30 | "Processing 15" | "Processing 0" | The log now matches the WARN message and the actual JSON output content. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../org/opendataloader/pdf/processors/DocumentProcessor.java | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java b/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java index 24a31d6c9..c0671339d 100644 --- a/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java +++ b/java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java @@ -283,7 +283,8 @@ private static List> processDocument(String inputPdfName, Config c int parallelism = config.getThreads(); ForkJoinPool pool = new ForkJoinPool(parallelism); - LOGGER.log(Level.INFO, "Processing {0} pages with {1} threads", new Object[]{totalPages, parallelism}); + int pagesToProcessCount = (pagesToProcess != null) ? pagesToProcess.size() : totalPages; + LOGGER.log(Level.INFO, "Processing {0} pages with {1} threads", new Object[]{pagesToProcessCount, parallelism}); try { // Loop 1: ContentFilter per-page (largest bottleneck)