Skip to content

Commit 31ac44a

Browse files
Eliott Gandiolleclaude
andcommitted
Support scanned PDFs: OCR fallback, figure/table cropping, text cleanup
Make the reflow engine work on scanned papers (e.g. the MIT Hackman/Oldham 1976 scan) that ship a page bitmap behind an OCR text layer. - Drop full-page scan "backdrops" so the embedded text layer reads through instead of coming back as one giant image. - Recover real figures/tables baked into the scan by their caption ("FIG. 1", "TABLE 2"): infer the artwork box, suppress the gibberish labels inside it, and crop it as an image. Skip the table's title block by width so a same-font title doesn't collapse the crop. - Render crops in the page's native (unrotated) orientation so landscape tables on rotated pages come out upright and uncut. - Judge "needs OCR" from the raw text-layer size (not post-suppression), so a table whose text sits inside its figure box isn't force-OCR'd into garbage. - Tesseract.js OCR fallback for pages with no text layer at all; OCR output is run through the same gibberish/letter-spacing cleaning. - Text quality: collapse per-glyph spacing ("o f j o b" -> "of job"), drop OCR gibberish, stitch sentences across page breaks, char-weight + width-gate body-size detection and add a length guard so body paragraphs aren't styled as headings on size-jittery scans. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 7b8e209 commit 31ac44a

5 files changed

Lines changed: 637 additions & 37 deletions

File tree

README.md

Lines changed: 21 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,20 @@ The default URL is the Deci/Olafsen/Ryan SDT paper — just press **Make readabl
2929
1- vs 2-column layouts to recover reading order;
3030
- groups lines into paragraphs/headings (de-hyphenating line breaks);
3131
- walks the page operator list while tracking the transform matrix to find and
32-
rasterise each embedded figure to a PNG.
32+
rasterise each embedded figure to a PNG;
33+
- ignores full-page **scan backdrops** (a page-sized image sitting behind an
34+
invisible text layer) so the real text reads through instead of coming back
35+
as one giant image — while recovering the real **figures/tables** baked into
36+
that scan by their captions ("FIG. 1", "TABLE 2") and cropping them as
37+
images;
38+
- suppresses **garbled OCR debris** that scanned diagrams paint over their
39+
artwork (and crops full-page scanned **tables** as images instead of leaking
40+
their cells as text), stitches sentences that **continue across a page
41+
break** back into one paragraph, repairs **letter-spaced** OCR words
42+
("o f j o b" → "of job"), and avoids mistaking a large-font body paragraph
43+
for a heading on size-jittery scans;
44+
- falls back to **OCR** ([tesseract.js](https://github.com/naptha/tesseract.js),
45+
in-browser) for pages that have no text layer at all.
3346
- **`src/App.tsx` / `src/Lightbox.tsx`** — re-typesets the result and provides a
3447
full-screen zoomable image viewer (click a figure, click again for actual
3548
size, `Esc` to close).
@@ -47,8 +60,12 @@ Pages.
4760
4861
## Notes / limits
4962

50-
- Works on PDFs that have a real text layer (most digital papers). Pure scans
51-
with no text layer would need OCR — not included here.
63+
- Works best on PDFs with a real text layer (most digital papers), including
64+
scanned papers that carry an invisible OCR text layer behind a full-page image
65+
(the backdrop is dropped and the text reads through).
66+
- Pages with no text layer at all are OCR'd in the browser with tesseract.js.
67+
This is slow (a few seconds per page) and downloads the language model from a
68+
CDN the first time; accuracy depends on scan quality.
5269
- Column detection is heuristic; unusual layouts may interleave oddly.
53-
- The pdf.js worker is loaded from jsDelivr at the exact installed version, so
70+
- The pdf.js worker (and the tesseract.js core/model) are loaded from a CDN, so
5471
the page needs internet access the first time.

bun.lock

Lines changed: 27 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

package.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,10 @@
99
"build": "bun build ./index.html --outdir=dist --minify"
1010
},
1111
"dependencies": {
12+
"pdfjs-dist": "4.8.69",
1213
"react": "^18.3.1",
1314
"react-dom": "^18.3.1",
14-
"pdfjs-dist": "4.8.69"
15+
"tesseract.js": "^7.0.0"
1516
},
1617
"devDependencies": {
1718
"@types/bun": "^1.3.14",

0 commit comments

Comments
 (0)