| name | pdf-extractor |
|---|---|
| description | Extract spatial text (liteparse grid projection), structured JSON with per-page bounding boxes, a heuristic Markdown reconstruction (with headings, lists, tables, hyperlinks, and bookmark-driven heading hierarchy), the PDF outline / Table of Contents, or per-page PNG renders (with optional bbox overlay + full-text search highlight) from a PDF using the Secutils.dev PDF Extractor tool. Optionally runs in-browser OCR for scanned PDFs via Tesseract.js. Hand the user https://tools.secutils.dev/pdf so they can drop the PDF, click **Parse**, switch to the **Text** / **JSON** / **Markdown** / **Outline** / **Screenshots** tab they want, and then **Share** / **Copy** / **Download** the result. PDFs are NEVER uploaded -- parsing runs entirely in the user's browser, and the share URL carries only the extracted output. Trigger when the user asks to "extract text from a PDF", "convert PDF to JSON with bounding boxes", "convert PDF to Markdown", "extract tables from a PDF", "extract the table of contents from a PDF", "find text inside a PDF visually", "OCR a scanned PDF in the browser", "get structured PDF output without uploading", or anything that names secutils.dev/pdf or run-llama liteparse. |
In-browser PDF parser. Bundles the upstream liteparse engine, PDF.js renderer, and tesseract.js OCR into one HTML file (~3 MB inlined). No server-side parsing, no uploads of the PDF bytes.
Five result tabs:
-
Text -- liteparse's grid-projected output. Plain UTF-8, preserves column / table layout via fixed-width whitespace better than naive
pdf.jstext extraction. Suitable as the input to a Markdown converter (the page has a one-click "open in Markdown to HTML" handoff). -
JSON -- structured output:
{ pages: [{ page, text, items, boundingBoxes }] }, with per-item rectangles, font sizes, page rotation, and (when OCR ran) confidence scores. -
Markdown -- heuristic reconstruction built from the JSON tab's spatial data. Lazy on first click and cached after. Detects:
- Bookmark-driven headings (
#...######) when the PDF carries an outline / Table of Contents -- a line whose normalized plain text matches a bookmark on its page is emitted at the bookmark's level (0->H1, 1->H2, ..., capped at H6). This is authoritative: it overrides the font-size heuristic and catches the common case where a section heading is typeset at the body font size. - Headings (
#,##,###) from items whosefontSizeexceeds the document-wide body median (1.45x / 1.20x / 1.08x cutoffs). Used as a fallback when the bookmark match fails (most PDFs have no outline -- bookmarks are an authoring-time feature most LaTeX / Office workflows skip). - Bullet lists (lines starting with
•,·,-,*,–,—) and numbered lists (lines starting with1.,a),iv., ...). - Tables -- runs of at least 3 paragraph-classified lines whose left-edge x-anchors line up within +/-5 PDF points across at least 2 columns become GitHub-flavored markdown tables; first row is the header.
- Inline bold / italic from PDF font names (
Bold/Black/Heavy,Italic/Oblique). - Hyperlinks via a separate PDF.js annotation pass over the original bytes -- only when the user parsed the PDF locally, NOT when hydrating from a shared URL (the bytes are out of scope).
- Page breaks as
---horizontal rules between every page.
The "open in Markdown to HTML" handoff button works from this tab too.
- Bookmark-driven headings (
-
Outline -- the PDF's bookmark tree / Table of Contents, with destinations resolved to 1-indexed page numbers. Lazy on first click, cached after. Clicking an entry jumps to that page in the Screenshots tab (which renders it on demand). Most PDFs don't have an outline -- bookmarks are a PDF-authoring feature most LaTeX / Office workflows skip. When present, the outline drives the Markdown tab's heading hierarchy (see above).
-
Screenshots -- per-page PNG renders at 150 DPI, generated lazily the first time the user clicks the tab. Each page streams in as it finishes (PDF.js canvas renderer, no PDFium). Per-page download links sit in each figure caption. Two interactive overlays sit in a sticky toolbar above the page stack:
- Search -- case-insensitive substring match across every
JsonTextItem.textinresult.json. Matched items get a translucent yellow highlight on the rendered PNGs (canvas overlay positioned over each<img>, scaled from PDF points to render pixels viaSHOTS_DPI / 72). Match count shows next to the input ("23 matches" / "No matches"); Enter scrolls the first match into view. - Show boxes -- outline every text item with its bounding box. Useful for visualising the spatial nature of PDF text extraction and debugging why a particular run was (or wasn't) detected as a heading / list / table in the Markdown tab.
Share / copy / download in the toolbar are disabled while this tab is active (no URL state for screenshots -- each page is a file).
- Search -- case-insensitive substring match across every
Three export paths from the result pane:
- Share -- copies a
tools.secutils.dev/pdf#<encoded>URL with the result (Text, JSON, Markdown, or Outline, whichever tab is active) round-tripped through the URL fragment. The PDF outline tree, when small (<= 4 KB JSON), opportunistically piggy-backs on every other tab's share URL so the recipient's Outline + Markdown tabs work without re-deriving it from PDF bytes they don't have. Disabled when the payload is over ~64 KB and while the heuristic Markdown engine is still computing. - Copy -- copies the active tab's payload to the clipboard.
- Export -- dropdown with two contextual actions:
- Download saves
<src>.txt,<src>.json,<src>.md, or<src>.json(for Outline) depending on the active tab. - Open in Markdown to HTML hands the active payload off to the Markdown to HTML tool in a new tab. Enabled only on the Text and Markdown tabs (md-to-html renders the latter natively; the spatial text travels verbatim).
- Download saves
The post-parse stats line (<N> pages · <M> bbox · <T>ms) appears
inside the dropzone underneath the file pill rather than in the result
toolbar, so the right-hand action buttons stay uncrowded. When the page
is hydrated from a shared link (no actual PDF bytes), the dropzone shows
a "Loaded from shared link" pseudo-file with the page count, and dropping
a real PDF replaces it cleanly.
| Field | Type | Default | Notes |
|---|---|---|---|
| PDF file | binary | required | Dropped on the dropzone or chosen via file picker. NOT uploaded. |
| OCR mode | auto|always|never |
auto |
Run OCR only on text-sparse pages / always / never. |
| OCR languages | chip multi-select | eng |
One or more Tesseract.js codes picked from the searchable catalog (~120 entries: eng, deu, fra, chi_sim, ...). Multiple selections join with + (e.g. eng+deu) and are downloaded in parallel on first use. |
Options live in the gear popover next to Parse. Defaults are reasonable for any Latin-script PDF.
The shared canonical encoding every Secutils.dev tool uses:
| 4 bytes uncompressed-length (LE u32) | N bytes raw DEFLATE of UTF-8 string |
Pipeline: UTF-8 bytes of JSON.stringify(state) -> deflate-raw -> prepend
the 4-byte LE u32 of the uncompressed length -> base64url (+ -> -,
/ -> _, strip =).
Unlike md-to-html (which puts the raw Markdown directly in the URL), the
PDF Extractor wraps its state in a JSON envelope because the URL has to
carry both the result and a flag for which tab to open on the destination:
type PdfOutlineItem = {
title: string;
level: number; // 0 = top-level
page: number | null; // 1-indexed; null when unresolvable
children: PdfOutlineItem[];
};
type SharedState = {
v: 3; // schema version (v1 + v2 still accepted)
f: 'text' | 'json' | 'md' | 'outline'; // which tab to open
s: string; // source PDF filename (no .pdf)
t?: string; // text body, present when f='text'
j?: ParseResultJson; // structured json, present when f='json'
m?: string; // rendered markdown, present when f='md'
o?: PdfOutlineItem[]; // outline tree; present when f='outline',
// OR piggy-backed on any other tab when
// the JSON serialization is <= 4 KB
};Schema history:
- v1:
{ v, f: 'text'|'json', s, t?, j? } - v2: adds
'md'tofand them(rendered markdown) field - v3: adds
'outline'tofand theo(resolved outline tree) field
The m payload is the rendered Markdown text, not a recipe -- the
heuristic engine is free to evolve between releases, so we share the
finished string so the recipient sees what the sender saw. The o
payload is the resolved outline (destinations already mapped to
1-indexed page numbers) so the recipient renders it without holding the
PDF bytes. v1 / v2 share links keep working.
Practical cap: ~64 KB of UTF-8 (matching the rest of the toolkit). Larger results stay in the user's tab but Share is disabled with a tooltip pointing them at Copy / Export instead.
Default: hand them the bare URL and let them drop the file themselves (this is the common case because PDFs are large and not transferable through chat):
https://tools.secutils.dev/pdf
If you already have an extracted text or JSON payload from a previous turn (e.g. you parsed the PDF yourself with another tool) and want to give the user a pre-filled, shareable view in the browser, encode it into the fragment using the same wire format as every other Secutils.dev tool.
# Pre-fill the Text tab with extracted plain text.
node -e '
const zlib = require("node:zlib");
const state = JSON.stringify({ v: 1, f: "text", s: "my-document", t: process.argv[1] });
const utf8 = Buffer.from(state, "utf8");
const out = Buffer.concat([Buffer.alloc(4), zlib.deflateRawSync(utf8)]);
out.writeUInt32LE(utf8.length, 0);
const enc = out.toString("base64").replace(/\+/g,"-").replace(/\//g,"_").replace(/=+$/,"");
console.log("https://tools.secutils.dev/pdf#" + enc);
' "$(cat /tmp/extracted.txt)"# Pre-fill the JSON tab with a structured liteparse-shaped object.
node -e '
const zlib = require("node:zlib");
const json = JSON.parse(require("node:fs").readFileSync(process.argv[1], "utf8"));
const state = JSON.stringify({ v: 2, f: "json", s: "my-document", j: json });
const utf8 = Buffer.from(state, "utf8");
const out = Buffer.concat([Buffer.alloc(4), zlib.deflateRawSync(utf8)]);
out.writeUInt32LE(utf8.length, 0);
const enc = out.toString("base64").replace(/\+/g,"-").replace(/\//g,"_").replace(/=+$/,"");
console.log("https://tools.secutils.dev/pdf#" + enc);
' /tmp/extracted.json# Pre-fill the Markdown tab with a rendered Markdown document.
node -e '
const zlib = require("node:zlib");
const md = require("node:fs").readFileSync(process.argv[1], "utf8");
const state = JSON.stringify({ v: 2, f: "md", s: "my-document", m: md });
const utf8 = Buffer.from(state, "utf8");
const out = Buffer.concat([Buffer.alloc(4), zlib.deflateRawSync(utf8)]);
out.writeUInt32LE(utf8.length, 0);
const enc = out.toString("base64").replace(/\+/g,"-").replace(/\//g,"_").replace(/=+$/,"");
console.log("https://tools.secutils.dev/pdf#" + enc);
' /tmp/extracted.md# Pre-fill the Outline tab with a resolved bookmark tree.
# `outline.json` is an array of PdfOutlineItem (`title`, `level`,
# `page`, `children`); destinations must already be resolved to
# 1-indexed page numbers because the recipient does not hold the PDF
# bytes. To get this from an existing PDF, run liteparse / PDF.js
# locally and call getPdfOutline(bytes).
node -e '
const zlib = require("node:zlib");
const outline = JSON.parse(require("node:fs").readFileSync(process.argv[1], "utf8"));
const state = JSON.stringify({ v: 3, f: "outline", s: "my-document", o: outline });
const utf8 = Buffer.from(state, "utf8");
const out = Buffer.concat([Buffer.alloc(4), zlib.deflateRawSync(utf8)]);
out.writeUInt32LE(utf8.length, 0);
const enc = out.toString("base64").replace(/\+/g,"-").replace(/\//g,"_").replace(/=+$/,"");
console.log("https://tools.secutils.dev/pdf#" + enc);
' /tmp/outline.jsonAlways print the full URL -- the fragment is opaque and dropping a single character breaks decoding.
If the JSON / text is bigger than ~64 KB, the destination page will refuse to re-Share it (because the fragment can't round-trip something larger than the source it came in on), but it will still load and the user can Copy / Download.
If you have direct access to the PDF bytes and need the text right now
(not as a polished, shareable artefact), parse it with any local PDF
library: pdfjs-dist, pdftotext, pdfplumber, pdf-parse, or even
liteparse itself on Node. Use this tool when the user wants to:
- Avoid uploading the PDF to anything.
- Get structured JSON with bounding boxes, not just text.
- OCR a scanned PDF without standing up Tesseract themselves.
- Hand the extracted output to a teammate via a single URL.
- Pipe the text into the Markdown to HTML converter for a polished export (a one-click action lives inside the Export dropdown on the result pane).
The result pane's Export dropdown contains an Open in Markdown to HTML action that hands the current Text or Markdown payload off to the Markdown to HTML tool in a new tab. Use it when the user asks "now convert this to a nice PDF / HTML / one-page doc" -- the two tools share the same URL-fragment wire format for their text payloads, so the handoff is a single click with no copy/paste.
If you've handed over the URL, that's the whole interaction -- the user takes it from there in the browser. No follow-up encoding required.
- The PDF bytes only ever exist client-side -- the URL fragment
(everything after
#) is never sent to the Secutils.dev server, and the dropzone reads the file viaFile.arrayBuffer()directly into a Web Worker. The share link is therefore safe for content the user wouldn't want logged, but anyone who receives the link can read the extracted output. - OCR fetches from public CDNs. When OCR runs, tesseract.js downloads
its Web Worker and the requested language data from the jsDelivr NPM
mirror -- specifically
cdn.jsdelivr.net/npm/@tesseract.js-data/<lang>/4.0.0_best_int/<lang>.traineddata.gzfor each selected language (the LSTM-only "tessdata best integerized" corpus, ~1-15 MB per language). The PDF content itself is never sent to those hosts -- only the static asset URLs are requested. Set OCR mode toNeverin Options to guarantee zero third-party contact. The previously-documentedtessdata.projectnaptha.comGitHub-Pages origin is deprecated upstream and no longer touched by tesseract.js. - First parse is slow. The bundled engine is ~3 MB inlined; the first
call to Parse Blob-URLs it and
import()s the module (one-time ~200 ms cost on a modern laptop, longer on mobile). After that it stays in memory until the tab is closed. - No file conversions. DOCX / XLSX / HTML / images are rejected at the dropzone -- liteparse normally shells out to libreoffice for those and there's no browser equivalent.
- No cmaps shipped. Latin scripts render perfectly; CJK and some specialised PDFs may fall back to substitute glyphs. Bundling cmaps (~4 MB more) is a future enhancement once there's user demand.
- URL state cap is ~64 KB. Big documents fit easily as Text (a 100-page PDF is usually <100 KB of UTF-8) but the JSON variant exceeds the cap surprisingly quickly because of per-item bounding boxes. The page disables Share above the cap and points at Copy / Export instead.
- Screenshots require the original PDF. When the user lands via a shared URL (which only carries the extracted Text or JSON, never the PDF bytes), the Screenshots tab shows a "drop the PDF to enable" prompt instead of rendering anything. Rendering also only kicks off on the first click of the tab -- pages stream in one at a time so a 50-page PDF doesn't pin the main thread before the first page is visible.
- Screenshots search + box overlay are JSON-driven, not pixel-driven.
Both the search highlight and the "Show boxes" overlay use
result.json.pages[*].textItemsfor coordinates; they don't OCR the rendered PNG. This means search hits exactly what the spatial parser saw, which is faster and more accurate than image-level search on a text PDF -- but on a scanned PDF the matches depend on the OCR pass having found the text in the first place. Run Parse with OCR enabled (auto / always) before relying on the search overlay for scanned input. - Outline overlap with bookmarks. A PDF outline / TOC is not the
same thing as a "visible Table of Contents page" rendered into the
PDF body. Many academic PDFs have the latter but not the former; the
Outline tab shows the empty state for those even though the document
visually contains a TOC. The fix is for the author to add bookmarks
(LaTeX:
hyperref'sbookmarkpackage; Word/Docs: heading styles carry through to the export). When neither is present, the Markdown tab still has the font-size-based heading heuristic as a fallback. - The Markdown tab is heuristic, not lossless. It works well for
documents with clear text-based structure (headings, bullet lists,
data tables with column-aligned text). It will miss:
- Bordered tables whose cells are not also x-aligned (PDF border primitives are not in liteparse's JSON; only text geometry is).
- Multi-column flow layouts (newspaper-style; columns get glued into a single paragraph because line clustering is single-axis).
- Math, formulae, footnote markers (treated as inline text).
- Raster images (the Screenshots tab is the right place for those).
- Links only appear when the PDF is parsed locally. The hyperlink
pass is a separate PDF.js annotation extraction that runs against the
in-memory
pdfFile.bytes. Shared URLs carry only the rendered Markdown string (not the recipe), so link reconstruction is replayed on the sender's side at Markdown-render time, and the recipient just sees the already-[text](url)-wrapped output. If the same JSON is rendered on the recipient's side (e.g. they came in via af: 'json'share link and then clicked the Markdown tab), the output will be link-free.