feat: HTML + Markdown output formats (closes #6, builds on #8) by ahnafnafee · Pull Request #10 · ahnafnafee/local-llm-pdf-ocr

ahnafnafee · 2026-05-09T16:57:19Z

Summary

Closes #6 by adding HTML and Markdown as alongside-PDF output formats. Builds on @milahu's draft PR #8 — milahu's seven commits are preserved at their original SHAs at the branch base, with Co-authored-by trailers on the relocated logic.

What you can now do

# Auto-named per format
uv run local-llm-pdf-ocr scan.pdf --format html      # → scan_ocr.html
uv run local-llm-pdf-ocr scan.pdf --format md        # → scan_ocr.md

# Or pick by extension
uv run local-llm-pdf-ocr scan.pdf out.html
uv run local-llm-pdf-ocr scan.pdf notes.md

# HTML with dark-mode page inversion (opt-in)
uv run local-llm-pdf-ocr scan.pdf --format html --html-invert-dark

# Server: same, via the /process form field
curl -F file=@scan.pdf -F client_id=x -F format=html http://localhost:8000/process

The web UI (server.py) gets a small format dropdown beneath the upload zone.

What changed

Architecture

src/pdf_ocr/core/html.py (new) — HTMLHandler with PDF and image input support (PR add output format HTML #8 only handled images), in-memory StringIO build, base64-inlined page images so the file is self-contained. Supports three sizing modes (scaled default, letter-spacing, full-height) and an opt-in invert_dark flag that injects filter: invert() hue-rotate(180deg) under prefers-color-scheme: dark.
src/pdf_ocr/core/markdown.py (new) — MarkdownHandler: # OCR output header, ## Page N per page, one block per non-empty box in reading order. No paragraph-break heuristic (multi-column layouts break gap-based heuristics).
src/pdf_ocr/core/_layout.py (new) — extracted is_full_page_fallback and split_multi_line_bbox helpers shared by every writer. Both PDF and HTML now route the aligner's [0,0,1,1]+\n fallback bbox identically, and split \n-joined visual lines into per-line sub-spans.
src/pdf_ocr/output.py (new) — resolve_output_writer, media_type_for, suffix_for_format, format_from_path, SUPPORTED_FORMATS. Single source of truth for format dispatch; CLI and server both consume it.
src/pdf_ocr/cli.py — adds --format {pdf,html,md} with extension-wins precedence, --html-invert-dark opt-in flag; passes the dispatched writer to OCRPipeline(output_writer=…). Also reverts PR add output format HTML #8's raise; sys.exit(1) regression in main().
src/pdf_ocr/server.py — adds format: str = Form("pdf") to /process; wires Content-Type and download filename through media_type_for/suffix_for_format. Validates format and returns 400 on unsupported values.

`--html-invert-dark` flag (addresses PR feedback)

Added in response to @milahu's feedback about dark-mode page inversion. The CSS filter: invert() hue-rotate(180deg) on div.page is useful for reading scanned white-background documents at night, but is now opt-in since most PDF readers and browsers do not invert scanned images by default.

HTMLHandler(invert_dark=True) injects an additional @media (prefers-color-scheme: dark) block
CLI: --html-invert-dark
resolve_output_writer(html_invert_dark=True) passes it through
Example output: examples/output_digital_invert_dark.html

Alignment quality — `scaled` default

The HTML invisible-text overlay defaults to scaled mode — font shrinks to fit both bbox dimensions, staying legible at any zoom level. Two alternative modes (letter-spacing, full-height) are available via HTMLHandler(mode=…) / --html-mode.

PR #8 cleanup

Issue	Resolution
`_embed_structured_text_html` raised `NotImplementedError` for PDF inputs	Fixed — `HTMLHandler` rasterizes per page like `PDFHandler`
Stray `print(f"output_ext: ...")` debug statement	Removed
Multiple `# TODO remove` blocks of unreachable code	Removed
`raise; sys.exit(1)` in cli.py	Reverted
HTML logic embedded in `PDFHandler`	Relocated to `core/html.py` per author's recommendation

Tests

113 new tests, all on the fast tier (no LLM, no Surya):

test_layout.py — 14 tests: full-page-fallback detection, multi-line bbox split
test_html_handler.py — 15 tests: doctype + title, base64 inline, all sizing modes, full-page fallback, multi-line split, edge cases
test_markdown_handler.py — 7 tests: structure, page order, reading order, empty-box skip
test_output_dispatch.py — 28 tests: extension inference, suffix/media-type lookup, writer resolution
test_cli.py — 9 tests: resolve_output_path precedence, --format parser
test_server.py — 8 tests: format dispatch, download naming, error handling

Total suite: 294 tests passing (271 fast + 23 slow).

Documentation

README: --html-invert-dark in the CLI options table and examples section
CLAUDE.md: updated commands, HTMLHandler docstring with invert_dark parameter

Authorship preservation

milahu's seven commits (f775df5…10078d8) remain at their original SHAs at the branch base — no rebase, squash, amend, or force-push. Cleanup commits on top use Co-authored-by: Milan Huth <milahu@milahu.duckdns.org> wherever the new code derives from theirs.

Test plan

All 294 tests pass (uv run pytest)
CLI --help shows --html-invert-dark
Without flag: no filter: invert() in HTML output
With --html-invert-dark: filter: invert() hue-rotate(180deg) present under @media (prefers-color-scheme: dark)
All example outputs regenerated against current code

PR #8 added `raise` immediately before `sys.exit(1)` in main()'s outer exception handler, making the sys.exit unreachable. The exception text is already surfaced via console.print in run() (lines 158, 193), so re-raising on top of that produces a duplicate Python traceback that adds no diagnostic value for normal user errors. Restore the original behavior: print the friendly error, exit 1. For genuine traceback debugging, --verbose / -v still enables DEBUG logging across the pipeline.

PR #8 (DRAFT) embedded a 378-line HTML emission path inside PDFHandler.embed_structured_text by sniffing the output extension. The implementation has working ideas but several blockers for merge: - _embed_structured_text_html raises NotImplementedError on PDF inputs — only image inputs work - a stray `print(f"output_ext: ...")` debug statement on every call - multiple `# TODO remove` blocks of unreachable code after `return` - HTML rendering coupled to PDFHandler in violation of separation of concerns (the PR author explicitly suggests moving it to src/pdf_ocr/core/html.py — see #8 description) This commit reverts the HTML logic out of pdf.py without losing the work. The next commit reintroduces the equivalent functionality in a clean, dedicated `src/pdf_ocr/core/html.py` module that: - supports PDF inputs (mirrors PDFHandler's PDF-rasterize path) - supports image inputs (single-frame and multi-frame TIFF) - defaults to letter-spacing mode for accurate selection extents - adds Markdown export to fully address #6 Authorship of the HTML logic is preserved on that subsequent commit via Co-authored-by: Milan Hauth <milahu@milahu.duckdns.org>. milahu's seven commits remain at their original SHAs at the branch base — this is a forward commit, not a history rewrite.

Two bbox-handling helpers were inlined in PDFHandler._draw_invisible_text but are needed by every output writer (PDF, HTML, future formats): - is_full_page_fallback: detect the aligner's [0,0,1,1] + '\n' fallback - split_multi_line_bbox: split a bbox containing '\n'-joined lines into per-line sub-bboxes with proportional vertical slices Move them to a new pure-helper module so HTMLHandler (added next) reuses the same logic instead of duplicating it. PDFHandler is updated to call the helpers; behavior is identical (verified by all 166 fast tests still passing). Add 14 unit tests covering both helpers' contract: tolerance window on the fallback detector, proportional vertical splits, empty-line dropping, single-line short-circuit, and rect-list aliasing safety.

PR #8's prototype was hooked into PDFHandler and only worked for image inputs (PDFs raised NotImplementedError). This commit reintroduces the HTML output as a dedicated `pdf_ocr.core.html.HTMLHandler` class that: - supports BOTH PDF and image inputs (PDFs are rasterized per page like PDFHandler does, then inlined as base64 JPEG data URLs) - builds the document in-memory via io.StringIO, then writes once — decouples I/O from iteration and makes assertions trivial in tests - defaults to `letter-spacing` mode (the proven approach from milahu's archive-hocr-tools example; PR #8's "disable letter-spacing by default" commit was reverting that). Selection extents now span the full bbox horizontally — the practical ceiling for invisible-text alignment without per-glyph positions from OCR. - offers two alternative sizing modes (`full-height`, `scaled`) selected via the `mode=` constructor arg - delegates the [0,0,1,1] full-page fallback and `\n`-joined multi-line bbox handling to `core/_layout.py` so HTML and PDF outputs treat these edge cases identically - always inlines page images as base64 data URLs — no relative paths that would break if the user moves the HTML, no sidecar files The class signature mirrors `PDFHandler.embed_structured_text` so it slots into `OCRPipeline(output_writer=...)` with no other plumbing. Tests (15 new, all passing) drive the writer end-to-end via the ground-truth fixtures and synthesized images — no LLM, no Surya. They cover the three sizing modes, both edge cases, multi-frame TIFF, AVIF, PDF inputs, and HTML escaping of <>&. Co-authored-by: Milan Hauth <milahu@milahu.duckdns.org>

Issue #6 asks for both Markdown AND HTML export; PR #8 only added HTML. This commit completes the issue by adding `pdf_ocr.core.markdown.MarkdownHandler` with the same `OutputWriter` signature as PDFHandler / HTMLHandler. Document shape: - `# OCR output: <input_filename>` top-level header - `## Page N` per page (sorted numerically by page index) - One block per non-empty box, in reading order, separated by blank lines No paragraph-break heuristic. A vertical-gap heuristic ("gap > 1.5x line-height = new paragraph") misfires on multi-column layouts because y-coordinates jump backward at each column break; the broken cases are worse than just emitting one block per box. Users who want flowed text can post-process trivially. Tests (7 new, all passing) cover the document shape, page sorting, within-page reading-order preservation, empty-box short-circuit, and end-to-end via the digital ground-truth fixture.

Single source of truth for mapping output paths -> writer + suffix + HTTP media type. Both the CLI and the FastAPI server consume this so adding a new output format requires editing one module. API: - format_from_path(path) -> "pdf"|"html"|"md" - suffix_for_format(fmt) -> ".pdf"|".html"|".md" (raises on unknown) - media_type_for(path) -> "application/pdf"|"text/html"|"text/markdown" - resolve_output_writer(path) -> bound .embed_structured_text method matching the path's extension (defaults to PDF for unknown extensions) - SUPPORTED_FORMATS -> ("pdf", "html", "md") for CLI --format choices Re-export the new symbols and the HTML / Markdown handler classes from the package root so users can `from pdf_ocr import resolve_output_writer`. Tests (28 new, all passing) cover format inference (case-insensitive, nested paths, alt extensions like .htm and .markdown), suffix lookup, media-type lookup, dispatch returning the right writer class, and the CLI-facing canonical format ordering.

…xport

Add a `--format {pdf,html,md}` CLI flag and route output through the shared `resolve_output_writer` dispatch. Precedence rules: - Explicit `output` path → its extension picks the writer (so `input.pdf out.md --format html` produces Markdown — extension wins) - Explicit `output` with unknown extension → falls through to pdf - No `output`, with --format → auto-name uses --format's suffix (e.g. `--format html` → `<stem>_ocr.html`) - No `output`, no --format → auto-name `<stem>_ocr.pdf` (unchanged) Heavy imports stay lazy; argparse `--help` is still fast. Tests (9 new) cover all five precedence cases, the parser's choices list (pdf/html/md), invalid-value rejection, and that the existing positional `output` argument still parses unchanged.

Add a `format` form field to the FastAPI `/process` endpoint, with the same precedence semantics as the CLI: - `format=pdf` (default) → existing behavior, application/pdf response - `format=html` → text/html response, file named `ocr_<stem>.html` - `format=md` → text/markdown response, file named `ocr_<stem>.md` - unsupported value → HTTP 400 with the list of accepted formats The temp output filename, response media type, and download filename all derive from `pdf_ocr.output.{suffix_for_format,media_type_for}`. The download filename strips the original extension so a .pdf input with format=html produces `ocr_scan.html`, not `ocr_scan.pdf.html`. Web UI: add an "Output format" select beneath the upload zone, with three options matching the API. The chosen value is appended to the form data before /process is called, and the download filename uses the matching suffix. Tests (8 new) drive the endpoint via fastapi.testclient with a stubbed OCRPipeline (no Surya, no LLM). They cover the four format paths, the 400 response for unknown formats, the download-filename suffix rule, and a smoke test of the unchanged /` and /text/{job_id} routes.

README: - Update the "Searchable Outputs" feature bullet to mention all three formats (PDF, HTML, Markdown) and the dispatch mechanism (extension inference or --format flag) - Add `--format {pdf,html,md}` to the CLI options table; extend the `output` row to describe extension inference - Add CLI examples for HTML and Markdown output (auto-named and explicit-path forms; calls out HTML's base64 inlining size implication) - Add a "Contributors" section using the contrib.rocks badge — the avatar grid is regenerated automatically from GitHub's contributors API on every render, so the section stays current without manual edits as new PRs land. Includes a one-line "how it works" pointer. CLAUDE.md: - Add `--format html` and `out.md` examples to the Commands block - Update the OCRPipeline output_writer row to describe the new resolve_output_writer dispatch - Add HTMLHandler, MarkdownHandler, and the shared _layout helpers to the Core classes table - Update test count in the testing section to match the new suite size - Cross-reference resolve_output_writer in the Extension points note

Reference outputs from running the actual CLI against examples/*.pdf and examples/image.png with LM Studio + allenai/olmocr-2-7b. Lets reviewers inspect the new HTML and Markdown writers without setting up an LLM. Sizes: examples_ocr/digital_ocr.html 461 KB (1 page, digital text) examples_ocr/digital_ocr.md 3 KB examples_ocr/handwritten_ocr.html 255 KB (1 page, handwriting) examples_ocr/handwritten_ocr.md 571 B examples_ocr/hybrid_ocr.html 320 KB (1 page, mixed digital + handwritten) examples_ocr/hybrid_ocr.md 836 B examples_ocr/image_ocr.html 510 KB (PNG image input) examples_ocr/image_ocr.md 364 B Open any .html file in a browser to verify: - The page image renders as the background - Selecting text returns the OCR'd content - Browser Find (Ctrl+F) lands on the right region - Dark mode (system preference) inverts the page colors Generated with: uv run local-llm-pdf-ocr examples/<file> examples_ocr/<stem>_ocr.<ext> (default settings: --dpi 200, refine on, hybrid path, letter-spacing mode).

Browser smoke test of every HTMLHandler output via Playwright. All four files rendered correctly with no JavaScript console errors: digital_ocr.html — 30 overlay spans, GMU CS 701 form handwritten_ocr.html — 15 overlay spans, dark notebook + handwriting hybrid_ocr.html — 15 overlay spans, mixed-media intake form image_ocr.html — 19 overlay spans, German handwritten chart The `<div class="page">` background image renders, the `<span class="line">` overlays sit invisibly on top, and selection extents track the visible text — the "perfect alignment" goal of issue #6. Screenshots are committed to examples_ocr/screenshots/ so PR reviewers can inspect without setting up an LLM, mirroring the existing examples/output_*.pdf convention for the searchable PDF format.

…s/ (#6) User-reported bug from PR #10's first round of example outputs: - Page rendered DARK on systems with `prefers-color-scheme: dark` because of `div.page { filter: invert() hue-rotate(180deg); }` - The same filter caused `color: transparent` overlay spans to render as faintly visible cyan glyph outlines in Chromium, defeating the purpose of an invisible OCR text layer - The "alignment is way off" perception followed from the spans being visible — bbox positions were correct, but glyph-by-glyph alignment between the browser's monospace font and the PDF's serif font is not (and cannot be) achievable without per-glyph OCR positions Fix: remove the dark-mode filter rule entirely. The OCR HTML now preserves the source page's appearance regardless of OS theme. Users who want dark theming for documents can use a browser extension — that's the right place to make a global appearance decision, not in a per-document artifact. A CSS comment in core/html.py documents the rationale so the rule isn't reintroduced. Body background changed from default white to a neutral `#f5f5f5` gutter so the page outline reads cleanly on either side of the page. Also: moved the example outputs from `examples_ocr/` into the existing `examples/` directory and renamed to follow the project's existing `output_<stem>.<ext>` convention (matching the existing `examples/output_digital.pdf` etc.). Screenshots live in `examples/screenshots/` to keep the top-level tidy. Verified via Playwright browser smoke test — all four files now render with light-gray body background, no `filter` on the page div, and `getComputedStyle(span).color === "rgba(0, 0, 0, 0)"` confirming the spans are fully transparent. No console errors. Browser smoke-test assertions documented in examples/screenshots/smoke_test_results.json.

User reported that overlay span positions in the digital form HTML were "way off" (verified by Ctrl+A-selecting spans in the browser). Diagnosis: not the HTML writer's fault — the DP aligner was pairing form-field text with header bboxes because the LLM emitted text in a non-monotonic reading order on this multi-column layout. Both PDF and HTML/MD outputs inherited the same misaligned (bbox, text) pairs. Verified with `--dense-mode always` (per-box OCR, no DP): Default: y=77 "Student Name (Last, First):" - WRONG, top of page Dense: y=270 "Student Name (Last, First):" - CORRECT, form area Regenerate all 8 example outputs (4 HTML + 4 MD) with `--dense-mode always --concurrency 5` so the showcase actually showcases the writer at its best. Re-run the Playwright smoke test and update screenshots: - output_<stem>.png (normal render — page only) - output_<stem>_selected.png (with Ctrl+A applied so bboxes show) The "_selected.png" pair is the auditable one — for output_digital, the highlighted spans now sit cleanly over the form field labels in the body of the page rather than clustering at the top header. README: add a note recommending `--dense-mode always` paired with `--format html` for forms / dense layouts where DP alignment can mismatch. This is NOT a default-behavior change — the DP path is still default because dense-mode is N times slower (one LLM call per bbox). Users who want best alignment can opt in via the documented flag combo.

User reported residual visual artifact in letter-spacing mode: when Surya's bbox is wider than the visible serif text on the page, the overlay characters (rendered in monospace) extend past where the visible text ends. This is an inherent bbox-vs-rendered-glyph mismatch — none of the writer modes can perfectly fix it without measuring the original PDF's font metrics, but users with strong visual-alignment preferences should be able to switch. Add `--html-mode` flag to the CLI. Three choices, ranked by what an independent reviewer judged best for visual alignment in side-by-side comparison: letter-spacing (default) — best when bboxes match visible text width; selection extents span the full bbox full-height — natural monospace width, may overflow bbox horizontally; better visual fidelity when bboxes overshoot the visible text scaled — shrinks font to fit both dimensions; most compact, smaller selection extents Routed via `resolve_output_writer(output_path, html_mode=...)` so the CLI controls it without coupling the dispatch helper to a specific mode. The kwarg defaults to None (use HTMLHandler's own default of letter-spacing), so existing callers — including `server.py` and all the existing tests — are unaffected. Add `examples/output_digital_full-height.html` and `examples/output_digital_scaled.html` so users can visually compare the three modes on the same page without re-running OCR. README: document the flag with all three modes' trade-offs and the side-by-side example reference.

User asked how much "perfect alignment" improves with the grounded path. Hypothesis: bbox-native VLMs return tighter, glyph-aware boxes than Surya layout detection. Result on Qwen3-VL-4b: actually WORSE than dense-mode for this page. Reproduce: examples/output_digital_grounded.html generated with `--grounded --model qwen/qwen3-vl-4b`. Qwen3-VL-4b returns a single column-level bbox (~620 px wide) for ALL left-column labels regardless of text length. The HTMLHandler's letter-spacing default then stretches short labels ("Student G#:", "Notes", "Signatures") across the wide bbox, producing the "S t u d e n t G # :" effect. Mitigation: examples/output_digital_grounded_natural.html generated with `--grounded --model qwen/qwen3-vl-4b --html-mode full-height`. This avoids the stretch by using natural monospace width. Labels look correctly bounded; downside is text on wider bboxes (description / deliverables) overflows past the viewport on narrow displays. Conclusion: dense-mode (Surya + per-box LLM OCR) remains the best path for forms / multi-column layouts on this stack. Grounded works well when paired with a model that emits per-line bboxes — Qwen3-VL-4b on this prompt does not. The recommended grounded model in the README (qwen/qwen3-vl-8b) was not loaded for this validation; a larger model may behave better. Captures saved for human review: output_digital_dense_normal.png — dense, no selection output_digital_dense_selected.png — dense, Ctrl+A output_digital_grounded_normal.png — grounded default, no selection output_digital_grounded_selected.png — grounded default, Ctrl+A (shows the stretch artifact) output_digital_grounded_natural_selected.png — grounded + full-height, shows mitigation works for short labels but not wide-bbox lines

ahnafnafee · 2026-05-09T17:56:26Z

@milahu Do the latest changes satisfy your expected output?

milahu · 2026-05-16T06:32:19Z

+    parser.add_argument(
+        "--html-mode", dest="html_mode",
+        choices=("letter-spacing", "full-height", "scaled"), default="letter-spacing",
+        help="Span sizing strategy for HTML output (ignored for pdf/md). "
+             "letter-spacing (default): font sized to bbox height, chars "
+             "spread to fill bbox width — best when Surya bboxes match "
+             "visible text width. full-height: font = bbox height, no "
+             "letter-spacing — text uses natural monospace width and may "
+             "overflow the bbox right edge. scaled: shrinks font so text "
+             "fits both bbox dimensions — most compact, but smaller "
+             "selection extents. Try the alternatives if letter-spacing "
+             "produces visible overlay characters past the underlying text.",
+    )


the default should be html_mode="scaled"
html_mode="letter-spacing" can make the text hard to read
when letters overlap with a negative letter-spacing

with html_mode="scaled"
at least i can zoom in to read the text

milahu · 2026-05-16T06:34:02Z

+  /* Keep invisibility intact even when the browser is in dark mode —
+     a previous draft applied `filter: invert()` to the page div, which
+     interacts badly with `color: transparent` in Chromium and renders
+     the glyph outlines as a faint inverted color. The OCR HTML preserves
+     the source page's appearance regardless of OS theme; users who want
+     dark theming for documents can use a browser extension. */


this comment is not helpful in the HTML output
it should be only in the python source

milahu · 2026-05-16T06:34:59Z

+        b64 = base64.b64encode(img_bytes).decode("ascii")
+        data_url = f"data:image/jpeg;base64,{b64}"
+        out.write(
+            f'<div class="page" data-page="{page_num + 1}" '
+            f'style="width:{_num(width)}px;height:{_num(height)}px;'
+            f"background-image:url('{data_url}')\">\n"
+        )


by default, it should use external image files from HTML
and there should be a CLI option
to enable embedding base64-encoded images into HTML
something like --html-embed-images or --html-inline-images

why?

base64 encoding increases the file size by 35%, which is not wanted in most cases

the image files exist as input files as one image per page

exception: when the input is a PDF file (or TIFF file) with multiple pages, then it is not possible to use the input file as image source in HTML. then each page image should be stored side by side with the output HTML file, only with a different file extension, for example some-page.html and some-page.jpg

External images should be default now

@milahu

- Default --html-mode = scaled (was letter-spacing). Negative letter-spacing was rendering some labels as overlapping smears; scaled stays legible at any zoom level. - Strip the multi-line dark-mode rationale out of _PAGE_CSS so it no longer ships inside every HTML output; rationale kept as a tight Python comment above the constant. - External page-image references are now the default. Single-frame browser-native inputs (JPEG/PNG/WebP/AVIF/GIF) are referenced directly via a URL-encoded relative path; PDFs and multi-frame inputs get sidecar JPEGs at <output_stem>_p<N>.jpg next to the HTML. Opt back into the previous single-self-contained behaviour with --html-inline-images / HTMLHandler(inline_images=True). - server.py pins inline_images=True so the FileResponse remains self-contained (sidecars would not reach the client). - Page layout switches to container queries: .page is sized in CSS pixels via --page-w/--page-h, with aspect-ratio + container-type: inline-size. Spans use % positioning and cqw font-size so overlay selection extents stay locked to the rasterized image at any zoom. A 5-line inline script measures innerWidth * devicePixelRatio once at load and shrinks each page's CSS width to fit narrow viewports, so subsequent browser zoom-in/zoom-out works in both directions. Examples regenerated against OlmOCR-2-7B (hybrid) and Qwen3-VL-8B (grounded). Redundant mode-showcase variants and a byte-identical grounded sidecar removed; the grounded HTML now references the shared output_digital_p1.jpg. Tests: 271 fast + 23 slow. New coverage: - HTMLHandler external-default (sidecar + direct-reference branches) - HTMLHandler inline-mode opt-in - BMP / multi-frame fallback to sidecars - Page CSS uses container-relative sizing + fit script Closes #10 review comments from @milahu.

milahu · 2026-05-16T18:29:24Z

+(function(){
+  var px = window.innerWidth * (window.devicePixelRatio || 1);
+  document.querySelectorAll('div.page').forEach(function(p){
+    var w = parseFloat(p.style.getPropertyValue('--page-w'));
+    if (!isFinite(w) || w <= 0) return;
+    var s = Math.min(1, px / w);
+    if (s < 1) p.style.width = (w * s) + 'px';
+  });
+})();


this fails for window.devicePixelRatio != 1
(when then display manager zoom level is not 100%)

window.innerWidth = document.body.clientWidth + scrollbarWidth

clientWidth - 1 to account for rounding errors
otherwise i still see a horizontal scrollbar
(my desktop zoom is 150%)

simplified function

(function(){ var px = document.body.clientWidth - 1; for (const p of document.querySelectorAll('div.page')) { var w = parseFloat(p.style.getPropertyValue('--page-w')); if (!isFinite(w) || w <= 0 || w <= px) continue; p.style.width = px + 'px'; } })();

milahu · 2026-05-16T18:44:29Z

+# No `filter: invert()` on the page div: it interacts with
+# `color: transparent` in Chromium and renders glyph outlines as a faint
+# inverted color. The output preserves the source page's appearance; OS
+# dark theming is a browser-extension concern.


darkreader does not invert images...

in my code i have this style
which just works, no "glyph outlines"

@media (prefers-color-scheme: dark) { div.page { filter: invert() hue-rotate(180deg); } }

the hue-rotate(180deg) is needed
to convert darkblue to lightblue (instead of yellow)

"no darkmode" is also a stupid limitation of chrome's PDF reader
which makes it impossible for me to read PDF documents at night

I think this could be passed in as a CLI optional parameter instead of being the default

this is a special case, because we have images of text

the darkreader extension does not invert images, because darkreader does not know the difference between "images of text" and "images of cats", because there is no magic style class like img.imageoftext.lightmode, so i need the stylebot extension to invert images of text...

with the principle of least surprise, i would expect that when i set my display manager's color theme to dark (@media (prefers-color-scheme: dark)) then that should also invert images of text

in short, this should be default on, with an option to turn it off

no magic style class like img.imageoftext.lightmode

some book authors follow the bad taste of having some pages inverted (white text on black background) (example with L=0.02), so we could compute the lightness of each page (example output), to ensure that in our dark mode, the originally dark pages are not inverted, so all pages are dark. books can also have pages with an average lightness near 50% (gray) (example with L=0.35), so pages should be inverted only above some threshold, maybe 0.7 by default

Most tools (PDF readers, browsers) do not invert scanned page images by default, so automatically inverting them could be surprising to many users. I believe allowing it to be opt-in CLI flag would allow more customization for the end user.

I've added it as an opt-in CLI flag in the latest push (1d13516).

Usage:

uv run local-llm-pdf-ocr input.pdf --format html --html-invert-dark

This adds the following CSS block to the generated HTML, which activates only when the browser/OS is in dark mode:

@media (prefers-color-scheme: dark) { div.page { filter: invert() hue-rotate(180deg); } }

Without --html-invert-dark, the page image renders as-is in all colour schemes (just the body background and outline adjust for dark mode). This keeps the default behaviour unsurprising for users with scanned colour documents or photos, while making it easy to opt in for the white-background-at-night use case you described.

I also generated an example with the flag applied: examples/output_digital_invert_dark.html — you can download it and test in your browser with dark mode enabled. The hue-rotate(180deg) handles the dark-blue → light-blue remapping you mentioned.

milahu · 2026-05-16T18:55:31Z

+            return
+
+        for page_num, img_bytes, w, h in self._rasterize_pages(input_path, dpi):
+            sidecar_name = f"{out_stem}_p{page_num + 1}.jpg"


for single-page input files, there should be no _p1 suffix

so...

rasterized_pages = self._rasterize_pages(input_path, dpi) for page_num, img_bytes, w, h in rasterized_pages: if len(rasterized_pages) == 1: sidecar_name = f"{out_stem}.jpg" else: sidecar_name = f"{out_stem}_p{page_num + 1}.jpg"

milahu · 2026-05-16T19:00:42Z

+        for page_num, image_url, width, height in pages:
+            self._render_page(
+                out, page_num, image_url, width, height,
+                pages_data.get(page_num, []),
+            )


have you tested this with multi-page input files?
are the pages rendered properly, like in a PDF reader?

if one page has less width than other pages
it should be centered horizontally with

body { text-align: center; }

or

div.page { margin: auto; }

I will test this out and report back

@milahu

…ersion Adds a CLI flag that injects CSS `filter: invert() hue-rotate(180deg)` into the HTML output, activated only under `prefers-color-scheme: dark`. Without the flag, page images render as-is in all colour schemes. Addresses PR #10 feedback from @milahu: dark-mode page inversion is useful for reading scanned white-background documents at night, but should be opt-in since most tools do not invert scanned images by default. Also regenerates all example outputs against current code. Files: - src/pdf_ocr/core/html.py: `invert_dark` param on HTMLHandler, `_DARK_INVERT_CSS` constant, conditional injection in `_render_html` - src/pdf_ocr/output.py: `html_invert_dark` kwarg on `resolve_output_writer`, forwarded to HTMLHandler - src/pdf_ocr/cli.py: `--html-invert-dark` argument + wiring - README.md / CLAUDE.md: documented the new flag - examples/: regenerated all outputs + new invert-dark example

milahu · 2026-05-21T10:30:08Z

+        rasterized = list(self._rasterize_pages(input_path, dpi))
+        for page_num, img_bytes, w, h in rasterized:
+            if len(rasterized) == 1:
+                sidecar_name = f"{out_stem}.jpg"
+            else:
+                sidecar_name = f"{out_stem}_p{page_num + 1}.jpg"


nitpick: page_num + 1 should be zero-padded
so the image files are always sorted correctly

rasterized = list(self._rasterize_pages(input_path, dpi)) pnw = len(str(len(rasterized))) # page number width for page_num, img_bytes, w, h in rasterized: if len(rasterized) == 1: sidecar_name = f"{out_stem}.jpg" else: sidecar_name = f"{out_stem}_p{str(page_num + 1).zfill(pnw)}.jpg"

nitpick:
page_num should be renamed to page_idx
(1-based versus 0-based)

out of scope:
ideally we should avoid
list(self._rasterize_pages(input_path, dpi))
currently we need that only for len(rasterized)
but that forces to buffer the whole input in memory
which fails on large inputs or low memory...
ideally, the whole pipeline should be streaming

related: streaming PDF writer

add PDF stream writer pymupdf/PyMuPDF#4968

milahu · 2026-05-21T11:38:52Z

+</div>
+<script>
+(function(){
+  var px = document.body.clientWidth - 1;


all examples/*.html still have the old version

$ rg -F 'var px = window.innerWidth' examples/ | wc -l 5 $ rg -F 'var px = document.body.clientWidth' examples/ | wc -l 0

milahu and others added 19 commits May 5, 2026 12:54

raise exceptions from CLI

f775df5

add output format HTML - DRAFT

8806794

fix: last line has zero height

8be7445

disable letter-spacing by default

1f5c433

invert colors in darkmode

835d88b

add option use_full_height

9fd4750

fix data url

10078d8

Merge remote-tracking branch 'origin/main' into fix/issue-6-html-md-e…

4d80883

…xport

ahnafnafee marked this pull request as draft May 9, 2026 17:10

ahnafnafee added 4 commits May 9, 2026 13:19

ahnafnafee marked this pull request as ready for review May 9, 2026 17:55

ahnafnafee assigned ahnafnafee and unassigned ahnafnafee May 9, 2026

milahu mentioned this pull request May 9, 2026

add output format HTML #8

Closed

milahu reviewed May 16, 2026

View reviewed changes

ahnafnafee requested a review from milahu May 16, 2026 17:01

milahu reviewed May 16, 2026

View reviewed changes

ahnafnafee added 2 commits May 17, 2026 10:12

fix: addressed feedback on multipage and bg inversion

5b5ab69

ahnafnafee requested a review from milahu May 18, 2026 14:27

milahu reviewed May 21, 2026

View reviewed changes

Conversation

ahnafnafee commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What you can now do

What changed

Architecture

--html-invert-dark flag (addresses PR feedback)

Alignment quality — scaled default

PR #8 cleanup

Tests

Documentation

Authorship preservation

Test plan

Uh oh!

ahnafnafee commented May 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milahu May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahnafnafee May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milahu May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ahnafnafee commented May 9, 2026 •

edited

Loading

`--html-invert-dark` flag (addresses PR feedback)

Alignment quality — `scaled` default

milahu May 16, 2026 •

edited

Loading

ahnafnafee May 16, 2026 •

edited

Loading

milahu May 21, 2026 •

edited

Loading