Commit 31ac44a
Support scanned PDFs: OCR fallback, figure/table cropping, text cleanup
Make the reflow engine work on scanned papers (e.g. the MIT Hackman/Oldham
1976 scan) that ship a page bitmap behind an OCR text layer.
- Drop full-page scan "backdrops" so the embedded text layer reads through
instead of coming back as one giant image.
- Recover real figures/tables baked into the scan by their caption
("FIG. 1", "TABLE 2"): infer the artwork box, suppress the gibberish labels
inside it, and crop it as an image. Skip the table's title block by width so
a same-font title doesn't collapse the crop.
- Render crops in the page's native (unrotated) orientation so landscape
tables on rotated pages come out upright and uncut.
- Judge "needs OCR" from the raw text-layer size (not post-suppression), so a
table whose text sits inside its figure box isn't force-OCR'd into garbage.
- Tesseract.js OCR fallback for pages with no text layer at all; OCR output
is run through the same gibberish/letter-spacing cleaning.
- Text quality: collapse per-glyph spacing ("o f j o b" -> "of job"), drop
OCR gibberish, stitch sentences across page breaks, char-weight + width-gate
body-size detection and add a length guard so body paragraphs aren't styled
as headings on size-jittery scans.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>1 parent 7b8e209 commit 31ac44a
5 files changed
Lines changed: 637 additions & 37 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
32 | | - | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
33 | 46 | | |
34 | 47 | | |
35 | 48 | | |
| |||
47 | 60 | | |
48 | 61 | | |
49 | 62 | | |
50 | | - | |
51 | | - | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
52 | 69 | | |
53 | | - | |
| 70 | + | |
54 | 71 | | |
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| 12 | + | |
12 | 13 | | |
13 | 14 | | |
14 | | - | |
| 15 | + | |
15 | 16 | | |
16 | 17 | | |
17 | 18 | | |
| |||
0 commit comments