Skip to content

feat: HTML + Markdown output formats (closes #6, builds on #8)#10

Open
ahnafnafee wants to merge 26 commits into
mainfrom
fix/issue-6-html-md-export
Open

feat: HTML + Markdown output formats (closes #6, builds on #8)#10
ahnafnafee wants to merge 26 commits into
mainfrom
fix/issue-6-html-md-export

Conversation

@ahnafnafee
Copy link
Copy Markdown
Owner

@ahnafnafee ahnafnafee commented May 9, 2026

Summary

Closes #6 by adding HTML and Markdown as alongside-PDF output formats. Builds on @milahu's draft PR #8 — milahu's seven commits are preserved at their original SHAs at the branch base, with Co-authored-by trailers on the relocated logic.

What you can now do

# Auto-named per format
uv run local-llm-pdf-ocr scan.pdf --format html      # → scan_ocr.html
uv run local-llm-pdf-ocr scan.pdf --format md        # → scan_ocr.md

# Or pick by extension
uv run local-llm-pdf-ocr scan.pdf out.html
uv run local-llm-pdf-ocr scan.pdf notes.md

# HTML with dark-mode page inversion (opt-in)
uv run local-llm-pdf-ocr scan.pdf --format html --html-invert-dark

# Server: same, via the /process form field
curl -F file=@scan.pdf -F client_id=x -F format=html http://localhost:8000/process

The web UI (server.py) gets a small format dropdown beneath the upload zone.

What changed

Architecture

  • src/pdf_ocr/core/html.py (new)HTMLHandler with PDF and image input support (PR add output format HTML #8 only handled images), in-memory StringIO build, base64-inlined page images so the file is self-contained. Supports three sizing modes (scaled default, letter-spacing, full-height) and an opt-in invert_dark flag that injects filter: invert() hue-rotate(180deg) under prefers-color-scheme: dark.
  • src/pdf_ocr/core/markdown.py (new)MarkdownHandler: # OCR output header, ## Page N per page, one block per non-empty box in reading order. No paragraph-break heuristic (multi-column layouts break gap-based heuristics).
  • src/pdf_ocr/core/_layout.py (new) — extracted is_full_page_fallback and split_multi_line_bbox helpers shared by every writer. Both PDF and HTML now route the aligner's [0,0,1,1]+\n fallback bbox identically, and split \n-joined visual lines into per-line sub-spans.
  • src/pdf_ocr/output.py (new)resolve_output_writer, media_type_for, suffix_for_format, format_from_path, SUPPORTED_FORMATS. Single source of truth for format dispatch; CLI and server both consume it.
  • src/pdf_ocr/cli.py — adds --format {pdf,html,md} with extension-wins precedence, --html-invert-dark opt-in flag; passes the dispatched writer to OCRPipeline(output_writer=…). Also reverts PR add output format HTML #8's raise; sys.exit(1) regression in main().
  • src/pdf_ocr/server.py — adds format: str = Form("pdf") to /process; wires Content-Type and download filename through media_type_for/suffix_for_format. Validates format and returns 400 on unsupported values.

--html-invert-dark flag (addresses PR feedback)

Added in response to @milahu's feedback about dark-mode page inversion. The CSS filter: invert() hue-rotate(180deg) on div.page is useful for reading scanned white-background documents at night, but is now opt-in since most PDF readers and browsers do not invert scanned images by default.

  • HTMLHandler(invert_dark=True) injects an additional @media (prefers-color-scheme: dark) block
  • CLI: --html-invert-dark
  • resolve_output_writer(html_invert_dark=True) passes it through
  • Example output: examples/output_digital_invert_dark.html

Alignment quality — scaled default

The HTML invisible-text overlay defaults to scaled mode — font shrinks to fit both bbox dimensions, staying legible at any zoom level. Two alternative modes (letter-spacing, full-height) are available via HTMLHandler(mode=…) / --html-mode.

PR #8 cleanup

Issue Resolution
_embed_structured_text_html raised NotImplementedError for PDF inputs Fixed — HTMLHandler rasterizes per page like PDFHandler
Stray print(f"output_ext: ...") debug statement Removed
Multiple # TODO remove blocks of unreachable code Removed
raise; sys.exit(1) in cli.py Reverted
HTML logic embedded in PDFHandler Relocated to core/html.py per author's recommendation

Tests

113 new tests, all on the fast tier (no LLM, no Surya):

  • test_layout.py — 14 tests: full-page-fallback detection, multi-line bbox split
  • test_html_handler.py — 15 tests: doctype + title, base64 inline, all sizing modes, full-page fallback, multi-line split, edge cases
  • test_markdown_handler.py — 7 tests: structure, page order, reading order, empty-box skip
  • test_output_dispatch.py — 28 tests: extension inference, suffix/media-type lookup, writer resolution
  • test_cli.py — 9 tests: resolve_output_path precedence, --format parser
  • test_server.py — 8 tests: format dispatch, download naming, error handling

Total suite: 294 tests passing (271 fast + 23 slow).

Documentation

  • README: --html-invert-dark in the CLI options table and examples section
  • CLAUDE.md: updated commands, HTMLHandler docstring with invert_dark parameter

Authorship preservation

milahu's seven commits (f775df510078d8) remain at their original SHAs at the branch base — no rebase, squash, amend, or force-push. Cleanup commits on top use Co-authored-by: Milan Huth <milahu@milahu.duckdns.org> wherever the new code derives from theirs.

Test plan

  • All 294 tests pass (uv run pytest)
  • CLI --help shows --html-invert-dark
  • Without flag: no filter: invert() in HTML output
  • With --html-invert-dark: filter: invert() hue-rotate(180deg) present under @media (prefers-color-scheme: dark)
  • All example outputs regenerated against current code

milahu and others added 19 commits May 5, 2026 12:54
PR #8 added `raise` immediately before `sys.exit(1)` in main()'s outer
exception handler, making the sys.exit unreachable. The exception text
is already surfaced via console.print in run() (lines 158, 193), so
re-raising on top of that produces a duplicate Python traceback that
adds no diagnostic value for normal user errors.

Restore the original behavior: print the friendly error, exit 1.
For genuine traceback debugging, --verbose / -v still enables DEBUG
logging across the pipeline.
PR #8 (DRAFT) embedded a 378-line HTML emission path inside
PDFHandler.embed_structured_text by sniffing the output extension.
The implementation has working ideas but several blockers for merge:

- _embed_structured_text_html raises NotImplementedError on PDF
  inputs — only image inputs work
- a stray `print(f"output_ext: ...")` debug statement on every call
- multiple `# TODO remove` blocks of unreachable code after `return`
- HTML rendering coupled to PDFHandler in violation of separation
  of concerns (the PR author explicitly suggests moving it to
  src/pdf_ocr/core/html.py — see #8 description)

This commit reverts the HTML logic out of pdf.py without losing the
work. The next commit reintroduces the equivalent functionality in a
clean, dedicated `src/pdf_ocr/core/html.py` module that:

- supports PDF inputs (mirrors PDFHandler's PDF-rasterize path)
- supports image inputs (single-frame and multi-frame TIFF)
- defaults to letter-spacing mode for accurate selection extents
- adds Markdown export to fully address #6

Authorship of the HTML logic is preserved on that subsequent commit
via Co-authored-by: Milan Hauth <milahu@milahu.duckdns.org>.

milahu's seven commits remain at their original SHAs at the branch
base — this is a forward commit, not a history rewrite.
Two bbox-handling helpers were inlined in PDFHandler._draw_invisible_text
but are needed by every output writer (PDF, HTML, future formats):

- is_full_page_fallback: detect the aligner's [0,0,1,1] + '\n' fallback
- split_multi_line_bbox: split a bbox containing '\n'-joined lines into
  per-line sub-bboxes with proportional vertical slices

Move them to a new pure-helper module so HTMLHandler (added next)
reuses the same logic instead of duplicating it. PDFHandler is updated
to call the helpers; behavior is identical (verified by all 166 fast
tests still passing).

Add 14 unit tests covering both helpers' contract: tolerance window
on the fallback detector, proportional vertical splits, empty-line
dropping, single-line short-circuit, and rect-list aliasing safety.
PR #8's prototype was hooked into PDFHandler and only worked for image
inputs (PDFs raised NotImplementedError). This commit reintroduces the
HTML output as a dedicated `pdf_ocr.core.html.HTMLHandler` class that:

- supports BOTH PDF and image inputs (PDFs are rasterized per page like
  PDFHandler does, then inlined as base64 JPEG data URLs)
- builds the document in-memory via io.StringIO, then writes once —
  decouples I/O from iteration and makes assertions trivial in tests
- defaults to `letter-spacing` mode (the proven approach from milahu's
  archive-hocr-tools example; PR #8's "disable letter-spacing by
  default" commit was reverting that). Selection extents now span the
  full bbox horizontally — the practical ceiling for invisible-text
  alignment without per-glyph positions from OCR.
- offers two alternative sizing modes (`full-height`, `scaled`)
  selected via the `mode=` constructor arg
- delegates the [0,0,1,1] full-page fallback and `\n`-joined multi-line
  bbox handling to `core/_layout.py` so HTML and PDF outputs treat
  these edge cases identically
- always inlines page images as base64 data URLs — no relative paths
  that would break if the user moves the HTML, no sidecar files

The class signature mirrors `PDFHandler.embed_structured_text` so it
slots into `OCRPipeline(output_writer=...)` with no other plumbing.

Tests (15 new, all passing) drive the writer end-to-end via the
ground-truth fixtures and synthesized images — no LLM, no Surya. They
cover the three sizing modes, both edge cases, multi-frame TIFF, AVIF,
PDF inputs, and HTML escaping of <>&.

Co-authored-by: Milan Hauth <milahu@milahu.duckdns.org>
Issue #6 asks for both Markdown AND HTML export; PR #8 only added HTML.
This commit completes the issue by adding `pdf_ocr.core.markdown.MarkdownHandler`
with the same `OutputWriter` signature as PDFHandler / HTMLHandler.

Document shape:
- `# OCR output: <input_filename>` top-level header
- `## Page N` per page (sorted numerically by page index)
- One block per non-empty box, in reading order, separated by blank lines

No paragraph-break heuristic. A vertical-gap heuristic
("gap > 1.5x line-height = new paragraph") misfires on multi-column
layouts because y-coordinates jump backward at each column break;
the broken cases are worse than just emitting one block per box. Users
who want flowed text can post-process trivially.

Tests (7 new, all passing) cover the document shape, page sorting,
within-page reading-order preservation, empty-box short-circuit, and
end-to-end via the digital ground-truth fixture.
Single source of truth for mapping output paths -> writer + suffix +
HTTP media type. Both the CLI and the FastAPI server consume this so
adding a new output format requires editing one module.

API:
- format_from_path(path) -> "pdf"|"html"|"md"
- suffix_for_format(fmt) -> ".pdf"|".html"|".md" (raises on unknown)
- media_type_for(path) -> "application/pdf"|"text/html"|"text/markdown"
- resolve_output_writer(path) -> bound .embed_structured_text method
  matching the path's extension (defaults to PDF for unknown extensions)
- SUPPORTED_FORMATS -> ("pdf", "html", "md") for CLI --format choices

Re-export the new symbols and the HTML / Markdown handler classes from
the package root so users can `from pdf_ocr import resolve_output_writer`.

Tests (28 new, all passing) cover format inference (case-insensitive,
nested paths, alt extensions like .htm and .markdown), suffix lookup,
media-type lookup, dispatch returning the right writer class, and
the CLI-facing canonical format ordering.
Add a `--format {pdf,html,md}` CLI flag and route output through the
shared `resolve_output_writer` dispatch. Precedence rules:

- Explicit `output` path → its extension picks the writer (so
  `input.pdf out.md --format html` produces Markdown — extension wins)
- Explicit `output` with unknown extension → falls through to pdf
- No `output`, with --format → auto-name uses --format's suffix
  (e.g. `--format html` → `<stem>_ocr.html`)
- No `output`, no --format → auto-name `<stem>_ocr.pdf` (unchanged)

Heavy imports stay lazy; argparse `--help` is still fast.

Tests (9 new) cover all five precedence cases, the parser's choices
list (pdf/html/md), invalid-value rejection, and that the existing
positional `output` argument still parses unchanged.
Add a `format` form field to the FastAPI `/process` endpoint, with
the same precedence semantics as the CLI:

- `format=pdf` (default) → existing behavior, application/pdf response
- `format=html` → text/html response, file named `ocr_<stem>.html`
- `format=md` → text/markdown response, file named `ocr_<stem>.md`
- unsupported value → HTTP 400 with the list of accepted formats

The temp output filename, response media type, and download filename
all derive from `pdf_ocr.output.{suffix_for_format,media_type_for}`.
The download filename strips the original extension so a .pdf input
with format=html produces `ocr_scan.html`, not `ocr_scan.pdf.html`.

Web UI: add an "Output format" select beneath the upload zone, with
three options matching the API. The chosen value is appended to the
form data before /process is called, and the download filename uses
the matching suffix.

Tests (8 new) drive the endpoint via fastapi.testclient with a stubbed
OCRPipeline (no Surya, no LLM). They cover the four format paths,
the 400 response for unknown formats, the download-filename suffix
rule, and a smoke test of the unchanged /` and /text/{job_id} routes.
README:
- Update the "Searchable Outputs" feature bullet to mention all three
  formats (PDF, HTML, Markdown) and the dispatch mechanism (extension
  inference or --format flag)
- Add `--format {pdf,html,md}` to the CLI options table; extend the
  `output` row to describe extension inference
- Add CLI examples for HTML and Markdown output (auto-named and
  explicit-path forms; calls out HTML's base64 inlining size implication)
- Add a "Contributors" section using the contrib.rocks badge — the
  avatar grid is regenerated automatically from GitHub's contributors
  API on every render, so the section stays current without manual
  edits as new PRs land. Includes a one-line "how it works" pointer.

CLAUDE.md:
- Add `--format html` and `out.md` examples to the Commands block
- Update the OCRPipeline output_writer row to describe the new
  resolve_output_writer dispatch
- Add HTMLHandler, MarkdownHandler, and the shared _layout helpers to
  the Core classes table
- Update test count in the testing section to match the new suite size
- Cross-reference resolve_output_writer in the Extension points note
Reference outputs from running the actual CLI against examples/*.pdf
and examples/image.png with LM Studio + allenai/olmocr-2-7b. Lets
reviewers inspect the new HTML and Markdown writers without setting
up an LLM. Sizes:

  examples_ocr/digital_ocr.html       461 KB  (1 page, digital text)
  examples_ocr/digital_ocr.md           3 KB
  examples_ocr/handwritten_ocr.html   255 KB  (1 page, handwriting)
  examples_ocr/handwritten_ocr.md     571  B
  examples_ocr/hybrid_ocr.html        320 KB  (1 page, mixed digital + handwritten)
  examples_ocr/hybrid_ocr.md          836  B
  examples_ocr/image_ocr.html         510 KB  (PNG image input)
  examples_ocr/image_ocr.md           364  B

Open any .html file in a browser to verify:
  - The page image renders as the background
  - Selecting text returns the OCR'd content
  - Browser Find (Ctrl+F) lands on the right region
  - Dark mode (system preference) inverts the page colors

Generated with: uv run local-llm-pdf-ocr examples/<file> examples_ocr/<stem>_ocr.<ext>
(default settings: --dpi 200, refine on, hybrid path, letter-spacing mode).
Browser smoke test of every HTMLHandler output via Playwright. All
four files rendered correctly with no JavaScript console errors:

  digital_ocr.html      — 30 overlay spans, GMU CS 701 form
  handwritten_ocr.html  — 15 overlay spans, dark notebook + handwriting
  hybrid_ocr.html       — 15 overlay spans, mixed-media intake form
  image_ocr.html        — 19 overlay spans, German handwritten chart

The `<div class="page">` background image renders, the
`<span class="line">` overlays sit invisibly on top, and selection
extents track the visible text — the "perfect alignment" goal of
issue #6.

Screenshots are committed to examples_ocr/screenshots/ so PR reviewers
can inspect without setting up an LLM, mirroring the existing
examples/output_*.pdf convention for the searchable PDF format.
@ahnafnafee ahnafnafee marked this pull request as draft May 9, 2026 17:10
ahnafnafee added 4 commits May 9, 2026 13:19
…s/ (#6)

User-reported bug from PR #10's first round of example outputs:

- Page rendered DARK on systems with `prefers-color-scheme: dark`
  because of `div.page { filter: invert() hue-rotate(180deg); }`
- The same filter caused `color: transparent` overlay spans to render
  as faintly visible cyan glyph outlines in Chromium, defeating the
  purpose of an invisible OCR text layer
- The "alignment is way off" perception followed from the spans being
  visible — bbox positions were correct, but glyph-by-glyph alignment
  between the browser's monospace font and the PDF's serif font is
  not (and cannot be) achievable without per-glyph OCR positions

Fix: remove the dark-mode filter rule entirely. The OCR HTML now
preserves the source page's appearance regardless of OS theme. Users
who want dark theming for documents can use a browser extension —
that's the right place to make a global appearance decision, not in
a per-document artifact. A CSS comment in core/html.py documents
the rationale so the rule isn't reintroduced.

Body background changed from default white to a neutral `#f5f5f5`
gutter so the page outline reads cleanly on either side of the page.

Also: moved the example outputs from `examples_ocr/` into the
existing `examples/` directory and renamed to follow the project's
existing `output_<stem>.<ext>` convention (matching the existing
`examples/output_digital.pdf` etc.). Screenshots live in
`examples/screenshots/` to keep the top-level tidy.

Verified via Playwright browser smoke test — all four files now
render with light-gray body background, no `filter` on the page div,
and `getComputedStyle(span).color === "rgba(0, 0, 0, 0)"` confirming
the spans are fully transparent. No console errors.

Browser smoke-test assertions documented in
examples/screenshots/smoke_test_results.json.
User reported that overlay span positions in the digital form HTML
were "way off" (verified by Ctrl+A-selecting spans in the browser).
Diagnosis: not the HTML writer's fault — the DP aligner was pairing
form-field text with header bboxes because the LLM emitted text in
a non-monotonic reading order on this multi-column layout. Both PDF
and HTML/MD outputs inherited the same misaligned (bbox, text) pairs.

Verified with `--dense-mode always` (per-box OCR, no DP):

  Default: y=77   "Student Name (Last, First):" - WRONG, top of page
  Dense:   y=270  "Student Name (Last, First):" - CORRECT, form area

Regenerate all 8 example outputs (4 HTML + 4 MD) with
`--dense-mode always --concurrency 5` so the showcase actually
showcases the writer at its best. Re-run the Playwright smoke test
and update screenshots:

  - output_<stem>.png           (normal render — page only)
  - output_<stem>_selected.png  (with Ctrl+A applied so bboxes show)

The "_selected.png" pair is the auditable one — for output_digital,
the highlighted spans now sit cleanly over the form field labels in
the body of the page rather than clustering at the top header.

README: add a note recommending `--dense-mode always` paired with
`--format html` for forms / dense layouts where DP alignment can
mismatch.

This is NOT a default-behavior change — the DP path is still default
because dense-mode is N times slower (one LLM call per bbox). Users
who want best alignment can opt in via the documented flag combo.
User reported residual visual artifact in letter-spacing mode: when
Surya's bbox is wider than the visible serif text on the page, the
overlay characters (rendered in monospace) extend past where the
visible text ends. This is an inherent bbox-vs-rendered-glyph
mismatch — none of the writer modes can perfectly fix it without
measuring the original PDF's font metrics, but users with strong
visual-alignment preferences should be able to switch.

Add `--html-mode` flag to the CLI. Three choices, ranked by what an
independent reviewer judged best for visual alignment in side-by-side
comparison:

  letter-spacing (default) — best when bboxes match visible text width;
                              selection extents span the full bbox
  full-height              — natural monospace width, may overflow
                              bbox horizontally; better visual fidelity
                              when bboxes overshoot the visible text
  scaled                   — shrinks font to fit both dimensions;
                              most compact, smaller selection extents

Routed via `resolve_output_writer(output_path, html_mode=...)` so the
CLI controls it without coupling the dispatch helper to a specific
mode. The kwarg defaults to None (use HTMLHandler's own default of
letter-spacing), so existing callers — including `server.py` and
all the existing tests — are unaffected.

Add `examples/output_digital_full-height.html` and
`examples/output_digital_scaled.html` so users can visually compare
the three modes on the same page without re-running OCR.

README: document the flag with all three modes' trade-offs and the
side-by-side example reference.
User asked how much "perfect alignment" improves with the grounded
path. Hypothesis: bbox-native VLMs return tighter, glyph-aware boxes
than Surya layout detection. Result on Qwen3-VL-4b: actually WORSE
than dense-mode for this page.

Reproduce: examples/output_digital_grounded.html generated with
`--grounded --model qwen/qwen3-vl-4b`. Qwen3-VL-4b returns a single
column-level bbox (~620 px wide) for ALL left-column labels regardless
of text length. The HTMLHandler's letter-spacing default then stretches
short labels ("Student G#:", "Notes", "Signatures") across the wide
bbox, producing the "S t u d e n t  G # :" effect.

Mitigation: examples/output_digital_grounded_natural.html generated
with `--grounded --model qwen/qwen3-vl-4b --html-mode full-height`.
This avoids the stretch by using natural monospace width. Labels look
correctly bounded; downside is text on wider bboxes (description /
deliverables) overflows past the viewport on narrow displays.

Conclusion: dense-mode (Surya + per-box LLM OCR) remains the best
path for forms / multi-column layouts on this stack. Grounded works
well when paired with a model that emits per-line bboxes — Qwen3-VL-4b
on this prompt does not. The recommended grounded model in the README
(qwen/qwen3-vl-8b) was not loaded for this validation; a larger model
may behave better.

Captures saved for human review:

  output_digital_dense_normal.png        — dense, no selection
  output_digital_dense_selected.png      — dense, Ctrl+A
  output_digital_grounded_normal.png     — grounded default, no selection
  output_digital_grounded_selected.png   — grounded default, Ctrl+A
                                            (shows the stretch artifact)
  output_digital_grounded_natural_selected.png — grounded + full-height,
                                            shows mitigation works for
                                            short labels but not
                                            wide-bbox lines
@ahnafnafee ahnafnafee marked this pull request as ready for review May 9, 2026 17:55
@ahnafnafee ahnafnafee assigned ahnafnafee and unassigned ahnafnafee May 9, 2026
@ahnafnafee
Copy link
Copy Markdown
Owner Author

@milahu Do the latest changes satisfy your expected output?

@milahu milahu mentioned this pull request May 9, 2026
Comment thread src/pdf_ocr/cli.py
Comment on lines +98 to +110
parser.add_argument(
"--html-mode", dest="html_mode",
choices=("letter-spacing", "full-height", "scaled"), default="letter-spacing",
help="Span sizing strategy for HTML output (ignored for pdf/md). "
"letter-spacing (default): font sized to bbox height, chars "
"spread to fill bbox width — best when Surya bboxes match "
"visible text width. full-height: font = bbox height, no "
"letter-spacing — text uses natural monospace width and may "
"overflow the bbox right edge. scaled: shrinks font so text "
"fits both bbox dimensions — most compact, but smaller "
"selection extents. Try the alternatives if letter-spacing "
"produces visible overlay characters past the underlying text.",
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default should be html_mode="scaled"
html_mode="letter-spacing" can make the text hard to read
when letters overlap with a negative letter-spacing

with html_mode="scaled"
at least i can zoom in to read the text

Comment thread src/pdf_ocr/core/html.py Outdated
Comment on lines +79 to +84
/* Keep invisibility intact even when the browser is in dark mode —
a previous draft applied `filter: invert()` to the page div, which
interacts badly with `color: transparent` in Chromium and renders
the glyph outlines as a faint inverted color. The OCR HTML preserves
the source page's appearance regardless of OS theme; users who want
dark theming for documents can use a browser extension. */
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is not helpful in the HTML output
it should be only in the python source

Comment thread src/pdf_ocr/core/html.py Outdated
Comment on lines +176 to +182
b64 = base64.b64encode(img_bytes).decode("ascii")
data_url = f"data:image/jpeg;base64,{b64}"
out.write(
f'<div class="page" data-page="{page_num + 1}" '
f'style="width:{_num(width)}px;height:{_num(height)}px;'
f"background-image:url('{data_url}')\">\n"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by default, it should use external image files from HTML
and there should be a CLI option
to enable embedding base64-encoded images into HTML
something like --html-embed-images or --html-inline-images

why?

  • base64 encoding increases the file size by 35%, which is not wanted in most cases
  • the image files exist as input files as one image per page
    • exception: when the input is a PDF file (or TIFF file) with multiple pages, then it is not possible to use the input file as image source in HTML. then each page image should be stored side by side with the output HTML file, only with a different file extension, for example some-page.html and some-page.jpg

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

External images should be default now

- Default --html-mode = scaled (was letter-spacing). Negative
  letter-spacing was rendering some labels as overlapping smears;
  scaled stays legible at any zoom level.
- Strip the multi-line dark-mode rationale out of _PAGE_CSS so it no
  longer ships inside every HTML output; rationale kept as a tight
  Python comment above the constant.
- External page-image references are now the default. Single-frame
  browser-native inputs (JPEG/PNG/WebP/AVIF/GIF) are referenced
  directly via a URL-encoded relative path; PDFs and multi-frame
  inputs get sidecar JPEGs at <output_stem>_p<N>.jpg next to the
  HTML. Opt back into the previous single-self-contained behaviour
  with --html-inline-images / HTMLHandler(inline_images=True).
- server.py pins inline_images=True so the FileResponse remains
  self-contained (sidecars would not reach the client).
- Page layout switches to container queries: .page is sized in CSS
  pixels via --page-w/--page-h, with aspect-ratio + container-type:
  inline-size. Spans use % positioning and cqw font-size so overlay
  selection extents stay locked to the rasterized image at any zoom.
  A 5-line inline script measures innerWidth * devicePixelRatio
  once at load and shrinks each page's CSS width to fit narrow
  viewports, so subsequent browser zoom-in/zoom-out works in both
  directions.

Examples regenerated against OlmOCR-2-7B (hybrid) and Qwen3-VL-8B
(grounded). Redundant mode-showcase variants and a byte-identical
grounded sidecar removed; the grounded HTML now references the
shared output_digital_p1.jpg.

Tests: 271 fast + 23 slow. New coverage:
- HTMLHandler external-default (sidecar + direct-reference branches)
- HTMLHandler inline-mode opt-in
- BMP / multi-frame fallback to sidecars
- Page CSS uses container-relative sizing + fit script

Closes #10 review comments from @milahu.
@ahnafnafee ahnafnafee requested a review from milahu May 16, 2026 17:01
Comment thread src/pdf_ocr/core/html.py
Comment on lines +89 to +97
(function(){
var px = window.innerWidth * (window.devicePixelRatio || 1);
document.querySelectorAll('div.page').forEach(function(p){
var w = parseFloat(p.style.getPropertyValue('--page-w'));
if (!isFinite(w) || w <= 0) return;
var s = Math.min(1, px / w);
if (s < 1) p.style.width = (w * s) + 'px';
});
})();
Copy link
Copy Markdown

@milahu milahu May 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fails for window.devicePixelRatio != 1
(when then display manager zoom level is not 100%)

window.innerWidth = document.body.clientWidth + scrollbarWidth

clientWidth - 1 to account for rounding errors
otherwise i still see a horizontal scrollbar
(my desktop zoom is 150%)

simplified function

(function(){
  var px = document.body.clientWidth - 1;
  for (const p of document.querySelectorAll('div.page')) {
    var w = parseFloat(p.style.getPropertyValue('--page-w'));
    if (!isFinite(w) || w <= 0 || w <= px) continue;
    p.style.width = px + 'px';
  }  
})();

Comment thread src/pdf_ocr/core/html.py Outdated
Comment on lines +45 to +48
# No `filter: invert()` on the page div: it interacts with
# `color: transparent` in Chromium and renders glyph outlines as a faint
# inverted color. The output preserves the source page's appearance; OS
# dark theming is a browser-extension concern.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

darkreader does not invert images...

in my code i have this style
which just works, no "glyph outlines"

@media (prefers-color-scheme: dark) {
  div.page {
    filter: invert() hue-rotate(180deg);
  }
}

the hue-rotate(180deg) is needed
to convert darkblue to lightblue (instead of yellow)

"no darkmode" is also a stupid limitation of chrome's PDF reader
which makes it impossible for me to read PDF documents at night

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be passed in as a CLI optional parameter instead of being the default

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a special case, because we have images of text

the darkreader extension does not invert images, because darkreader does not know the difference between "images of text" and "images of cats", because there is no magic style class like img.imageoftext.lightmode, so i need the stylebot extension to invert images of text...

with the principle of least surprise, i would expect that when i set my display manager's color theme to dark (@media (prefers-color-scheme: dark)) then that should also invert images of text

in short, this should be default on, with an option to turn it off

no magic style class like img.imageoftext.lightmode

some book authors follow the bad taste of having some pages inverted (white text on black background) (example with L=0.02), so we could compute the lightness of each page (example output), to ensure that in our dark mode, the originally dark pages are not inverted, so all pages are dark. books can also have pages with an average lightness near 50% (gray) (example with L=0.35), so pages should be inverted only above some threshold, maybe 0.7 by default

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most tools (PDF readers, browsers) do not invert scanned page images by default, so automatically inverting them could be surprising to many users. I believe allowing it to be opt-in CLI flag would allow more customization for the end user.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added it as an opt-in CLI flag in the latest push (1d13516).

Usage:

uv run local-llm-pdf-ocr input.pdf --format html --html-invert-dark

This adds the following CSS block to the generated HTML, which activates only when the browser/OS is in dark mode:

@media (prefers-color-scheme: dark) {
  div.page {
    filter: invert() hue-rotate(180deg);
  }
}

Without --html-invert-dark, the page image renders as-is in all colour schemes (just the body background and outline adjust for dark mode). This keeps the default behaviour unsurprising for users with scanned colour documents or photos, while making it easy to opt in for the white-background-at-night use case you described.

I also generated an example with the flag applied: examples/output_digital_invert_dark.html — you can download it and test in your browser with dark mode enabled. The hue-rotate(180deg) handles the dark-blue → light-blue remapping you mentioned.

Comment thread src/pdf_ocr/core/html.py Outdated
return

for page_num, img_bytes, w, h in self._rasterize_pages(input_path, dpi):
sidecar_name = f"{out_stem}_p{page_num + 1}.jpg"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for single-page input files, there should be no _p1 suffix

so...

        rasterized_pages = self._rasterize_pages(input_path, dpi)
        for page_num, img_bytes, w, h in rasterized_pages:
            if len(rasterized_pages) == 1:
                sidecar_name = f"{out_stem}.jpg"
            else:
                sidecar_name = f"{out_stem}_p{page_num + 1}.jpg"

Comment thread src/pdf_ocr/core/html.py
Comment on lines +245 to +249
for page_num, image_url, width, height in pages:
self._render_page(
out, page_num, image_url, width, height,
pages_data.get(page_num, []),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tested this with multi-page input files?
are the pages rendered properly, like in a PDF reader?

if one page has less width than other pages
it should be centered horizontally with

body {
  text-align: center;
}

or

div.page {
  margin: auto;
}

Copy link
Copy Markdown
Owner Author

@ahnafnafee ahnafnafee May 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will test this out and report back

…ersion

Adds a CLI flag that injects CSS `filter: invert() hue-rotate(180deg)`
into the HTML output, activated only under `prefers-color-scheme: dark`.
Without the flag, page images render as-is in all colour schemes.

Addresses PR #10 feedback from @milahu: dark-mode page inversion is
useful for reading scanned white-background documents at night, but
should be opt-in since most tools do not invert scanned images by
default.

Also regenerates all example outputs against current code.

Files:
- src/pdf_ocr/core/html.py: `invert_dark` param on HTMLHandler,
  `_DARK_INVERT_CSS` constant, conditional injection in `_render_html`
- src/pdf_ocr/output.py: `html_invert_dark` kwarg on
  `resolve_output_writer`, forwarded to HTMLHandler
- src/pdf_ocr/cli.py: `--html-invert-dark` argument + wiring
- README.md / CLAUDE.md: documented the new flag
- examples/: regenerated all outputs + new invert-dark example
@ahnafnafee ahnafnafee requested a review from milahu May 18, 2026 14:27
Comment thread src/pdf_ocr/core/html.py
Comment on lines +190 to +195
rasterized = list(self._rasterize_pages(input_path, dpi))
for page_num, img_bytes, w, h in rasterized:
if len(rasterized) == 1:
sidecar_name = f"{out_stem}.jpg"
else:
sidecar_name = f"{out_stem}_p{page_num + 1}.jpg"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: page_num + 1 should be zero-padded
so the image files are always sorted correctly

        rasterized = list(self._rasterize_pages(input_path, dpi))
        pnw = len(str(len(rasterized))) # page number width
        for page_num, img_bytes, w, h in rasterized:
            if len(rasterized) == 1:
                sidecar_name = f"{out_stem}.jpg"
            else:
                sidecar_name = f"{out_stem}_p{str(page_num + 1).zfill(pnw)}.jpg"

nitpick:
page_num should be renamed to page_idx
(1-based versus 0-based)


out of scope:
ideally we should avoid
list(self._rasterize_pages(input_path, dpi))
currently we need that only for len(rasterized)
but that forces to buffer the whole input in memory
which fails on large inputs or low memory...
ideally, the whole pipeline should be streaming

related: streaming PDF writer

</div>
<script>
(function(){
var px = document.body.clientWidth - 1;
Copy link
Copy Markdown

@milahu milahu May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all examples/*.html still have the old version

$ rg -F 'var px = window.innerWidth' examples/ | wc -l
5

$ rg -F 'var px = document.body.clientWidth' examples/ | wc -l
0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add option for exporting OCR output to markdown + HTML

2 participants