Skip to content

add output format HTML#8

Closed
milahu wants to merge 7 commits into
ahnafnafee:mainfrom
milahu:add-output-html
Closed

add output format HTML#8
milahu wants to merge 7 commits into
ahnafnafee:mainfrom
milahu:add-output-html

Conversation

@milahu
Copy link
Copy Markdown

@milahu milahu commented May 5, 2026

fix #6 (comment)

example output with grounded olmocr
2026-03-26.22-00.olmocr.html

this is a minimal-diff patch, hooking into the PDF writer module
ideally this should be a separate module in src/pdf_ocr/core/html.py
but that requires patching some call sites
for examlpe by replacing the pdf_handler kwarg
with input_handler and output_handler
or src_handler and dst_handler

use letter-spacing to stretch text to line width
this also works on span elements with display:inline
transform:scaleX would require display:block

@milahu milahu changed the title Add output html add output format HTML May 5, 2026
@milahu milahu force-pushed the add-output-html branch from 54ef4c9 to ea4d0f4 Compare May 5, 2026 12:55
@milahu milahu force-pushed the add-output-html branch from ea4d0f4 to 8be7445 Compare May 5, 2026 13:10
@milahu
Copy link
Copy Markdown
Author

milahu commented May 9, 2026

closing in favor of #10

@milahu milahu closed this May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add option for exporting OCR output to markdown + HTML

1 participant