name	pdf-extractor
description	Extract spatial text (liteparse grid projection), structured JSON with per-page bounding boxes, a heuristic Markdown reconstruction (with headings, lists, tables, hyperlinks, and bookmark-driven heading hierarchy), the PDF outline / Table of Contents, or per-page PNG renders (with optional bbox overlay + full-text search highlight) from a PDF using the Secutils.dev PDF Extractor tool. Optionally runs in-browser OCR for scanned PDFs via Tesseract.js. Hand the user https://tools.secutils.dev/pdf so they can drop the PDF, click Parse, switch to the Text / JSON / Markdown / Outline / Screenshots tab they want, and then Share / Copy / Download the result. PDFs are NEVER uploaded -- parsing runs entirely in the user's browser, and the share URL carries only the extracted output. Trigger when the user asks to "extract text from a PDF", "convert PDF to JSON with bounding boxes", "convert PDF to Markdown", "extract tables from a PDF", "extract the table of contents from a PDF", "find text inside a PDF visually", "OCR a scanned PDF in the browser", "get structured PDF output without uploading", or anything that names secutils.dev/pdf or run-llama liteparse.

name

pdf-extractor

description

Extract spatial text (liteparse grid projection), structured JSON with per-page bounding boxes, a heuristic Markdown reconstruction (with headings, lists, tables, hyperlinks, and bookmark-driven heading hierarchy), the PDF outline / Table of Contents, or per-page PNG renders (with optional bbox overlay + full-text search highlight) from a PDF using the Secutils.dev PDF Extractor tool. Optionally runs in-browser OCR for scanned PDFs via Tesseract.js. Hand the user https://tools.secutils.dev/pdf so they can drop the PDF, click **Parse**, switch to the **Text** / **JSON** / **Markdown** / **Outline** / **Screenshots** tab they want, and then **Share** / **Copy** / **Download** the result. PDFs are NEVER uploaded -- parsing runs entirely in the user's browser, and the share URL carries only the extracted output. Trigger when the user asks to "extract text from a PDF", "convert PDF to JSON with bounding boxes", "convert PDF to Markdown", "extract tables from a PDF", "extract the table of contents from a PDF", "find text inside a PDF visually", "OCR a scanned PDF in the browser", "get structured PDF output without uploading", or anything that names secutils.dev/pdf or run-llama liteparse.

PDF Extractor (Secutils.dev)

In-browser PDF parser. Bundles the upstream liteparse engine, PDF.js renderer, and tesseract.js OCR into one HTML file (~3 MB inlined). No server-side parsing, no uploads of the PDF bytes.

Five result tabs:

Text -- liteparse's grid-projected output. Plain UTF-8, preserves column / table layout via fixed-width whitespace better than naive pdf.js text extraction. Suitable as the input to a Markdown converter (the page has a one-click "open in Markdown to HTML" handoff).
JSON -- structured output: { pages: [{ page, text, items, boundingBoxes }] }, with per-item rectangles, font sizes, page rotation, and (when OCR ran) confidence scores.
Markdown -- heuristic reconstruction built from the JSON tab's spatial data. Lazy on first click and cached after. Detects:
- Bookmark-driven headings (#...######) when the PDF carries an outline / Table of Contents -- a line whose normalized plain text matches a bookmark on its page is emitted at the bookmark's level (0->H1, 1->H2, ..., capped at H6). This is authoritative: it overrides the font-size heuristic and catches the common case where a section heading is typeset at the body font size.
- Headings (#, ##, ###) from items whose fontSize exceeds the document-wide body median (1.45x / 1.20x / 1.08x cutoffs). Used as a fallback when the bookmark match fails (most PDFs have no outline -- bookmarks are an authoring-time feature most LaTeX / Office workflows skip).
- Bullet lists (lines starting with •, ·, -, *, –, —) and numbered lists (lines starting with 1., a), iv., ...).
- Tables -- runs of at least 3 paragraph-classified lines whose left-edge x-anchors line up within +/-5 PDF points across at least 2 columns become GitHub-flavored markdown tables; first row is the header.
- Inline bold / italic from PDF font names (Bold / Black / Heavy, Italic / Oblique).
- Hyperlinks via a separate PDF.js annotation pass over the original bytes -- only when the user parsed the PDF locally, NOT when hydrating from a shared URL (the bytes are out of scope).
- Page breaks as --- horizontal rules between every page.
The "open in Markdown to HTML" handoff button works from this tab too.
Outline -- the PDF's bookmark tree / Table of Contents, with destinations resolved to 1-indexed page numbers. Lazy on first click, cached after. Clicking an entry jumps to that page in the Screenshots tab (which renders it on demand). Most PDFs don't have an outline -- bookmarks are a PDF-authoring feature most LaTeX / Office workflows skip. When present, the outline drives the Markdown tab's heading hierarchy (see above).
Screenshots -- per-page PNG renders at 150 DPI, generated lazily the first time the user clicks the tab. Each page streams in as it finishes (PDF.js canvas renderer, no PDFium). Per-page download links sit in each figure caption. Two interactive overlays sit in a sticky toolbar above the page stack:
- Search -- case-insensitive substring match across every JsonTextItem.text in result.json. Matched items get a translucent yellow highlight on the rendered PNGs (canvas overlay positioned over each <img>, scaled from PDF points to render pixels via SHOTS_DPI / 72). Match count shows next to the input ("23 matches" / "No matches"); Enter scrolls the first match into view.
- Show boxes -- outline every text item with its bounding box. Useful for visualising the spatial nature of PDF text extraction and debugging why a particular run was (or wasn't) detected as a heading / list / table in the Markdown tab.
Share / copy / download in the toolbar are disabled while this tab is active (no URL state for screenshots -- each page is a file).

Three export paths from the result pane:

Share -- copies a tools.secutils.dev/pdf#<encoded> URL with the result (Text, JSON, Markdown, or Outline, whichever tab is active) round-tripped through the URL fragment. The PDF outline tree, when small (<= 4 KB JSON), opportunistically piggy-backs on every other tab's share URL so the recipient's Outline + Markdown tabs work without re-deriving it from PDF bytes they don't have. Disabled when the payload is over ~64 KB and while the heuristic Markdown engine is still computing.
Copy -- copies the active tab's payload to the clipboard.
Export -- dropdown with two contextual actions:
- Download saves <src>.txt, <src>.json, <src>.md, or <src>.json (for Outline) depending on the active tab.
- Open in Markdown to HTML hands the active payload off to the Markdown to HTML tool in a new tab. Enabled only on the Text and Markdown tabs (md-to-html renders the latter natively; the spatial text travels verbatim).

The post-parse stats line (<N> pages · <M> bbox · <T>ms) appears inside the dropzone underneath the file pill rather than in the result toolbar, so the right-hand action buttons stay uncrowded. When the page is hydrated from a shared link (no actual PDF bytes), the dropzone shows a "Loaded from shared link" pseudo-file with the page count, and dropping a real PDF replaces it cleanly.

Inputs

Field	Type	Default	Notes
PDF file	binary	required	Dropped on the dropzone or chosen via file picker. NOT uploaded.
OCR mode	`auto`\|`always`\|`never`	`auto`	Run OCR only on text-sparse pages / always / never.
OCR languages	chip multi-select	`eng`	One or more Tesseract.js codes picked from the searchable catalog (~120 entries: `eng`, `deu`, `fra`, `chi_sim`, ...). Multiple selections join with `+` (e.g. `eng+deu`) and are downloaded in parallel on first use.

Options live in the gear popover next to Parse. Defaults are reasonable for any Latin-script PDF.

Wire format (URL state)

The shared canonical encoding every Secutils.dev tool uses:

| 4 bytes uncompressed-length (LE u32) | N bytes raw DEFLATE of UTF-8 string |

Pipeline: UTF-8 bytes of JSON.stringify(state) -> deflate-raw -> prepend the 4-byte LE u32 of the uncompressed length -> base64url (+ -> -, / -> _, strip =).

Unlike md-to-html (which puts the raw Markdown directly in the URL), the PDF Extractor wraps its state in a JSON envelope because the URL has to carry both the result and a flag for which tab to open on the destination:

type PdfOutlineItem = {
  title: string;
  level: number;                  // 0 = top-level
  page: number | null;            // 1-indexed; null when unresolvable
  children: PdfOutlineItem[];
};
type SharedState = {
  v: 3;                                    // schema version (v1 + v2 still accepted)
  f: 'text' | 'json' | 'md' | 'outline';   // which tab to open
  s: string;                               // source PDF filename (no .pdf)
  t?: string;                              // text body, present when f='text'
  j?: ParseResultJson;                     // structured json, present when f='json'
  m?: string;                              // rendered markdown, present when f='md'
  o?: PdfOutlineItem[];                    // outline tree; present when f='outline',
                                           // OR piggy-backed on any other tab when
                                           // the JSON serialization is <= 4 KB
};

Schema history:

v1: { v, f: 'text'|'json', s, t?, j? }
v2: adds 'md' to f and the m (rendered markdown) field
v3: adds 'outline' to f and the o (resolved outline tree) field

The m payload is the rendered Markdown text, not a recipe -- the heuristic engine is free to evolve between releases, so we share the finished string so the recipient sees what the sender saw. The o payload is the resolved outline (destinations already mapped to 1-indexed page numbers) so the recipient renders it without holding the PDF bytes. v1 / v2 share links keep working.

Practical cap: ~64 KB of UTF-8 (matching the rest of the toolkit). Larger results stay in the user's tab but Share is disabled with a tooltip pointing them at Copy / Export instead.

How to direct the user

Default: hand them the bare URL and let them drop the file themselves (this is the common case because PDFs are large and not transferable through chat):

https://tools.secutils.dev/pdf

If you already have an extracted text or JSON payload from a previous turn (e.g. you parsed the PDF yourself with another tool) and want to give the user a pre-filled, shareable view in the browser, encode it into the fragment using the same wire format as every other Secutils.dev tool.

# Pre-fill the Text tab with extracted plain text.
node -e '
const zlib = require("node:zlib");
const state = JSON.stringify({ v: 1, f: "text", s: "my-document", t: process.argv[1] });
const utf8 = Buffer.from(state, "utf8");
const out = Buffer.concat([Buffer.alloc(4), zlib.deflateRawSync(utf8)]);
out.writeUInt32LE(utf8.length, 0);
const enc = out.toString("base64").replace(/\+/g,"-").replace(/\//g,"_").replace(/=+$/,"");
console.log("https://tools.secutils.dev/pdf#" + enc);
' "$(cat /tmp/extracted.txt)"

# Pre-fill the JSON tab with a structured liteparse-shaped object.
node -e '
const zlib = require("node:zlib");
const json = JSON.parse(require("node:fs").readFileSync(process.argv[1], "utf8"));
const state = JSON.stringify({ v: 2, f: "json", s: "my-document", j: json });
const utf8 = Buffer.from(state, "utf8");
const out = Buffer.concat([Buffer.alloc(4), zlib.deflateRawSync(utf8)]);
out.writeUInt32LE(utf8.length, 0);
const enc = out.toString("base64").replace(/\+/g,"-").replace(/\//g,"_").replace(/=+$/,"");
console.log("https://tools.secutils.dev/pdf#" + enc);
' /tmp/extracted.json

# Pre-fill the Markdown tab with a rendered Markdown document.
node -e '
const zlib = require("node:zlib");
const md = require("node:fs").readFileSync(process.argv[1], "utf8");
const state = JSON.stringify({ v: 2, f: "md", s: "my-document", m: md });
const utf8 = Buffer.from(state, "utf8");
const out = Buffer.concat([Buffer.alloc(4), zlib.deflateRawSync(utf8)]);
out.writeUInt32LE(utf8.length, 0);
const enc = out.toString("base64").replace(/\+/g,"-").replace(/\//g,"_").replace(/=+$/,"");
console.log("https://tools.secutils.dev/pdf#" + enc);
' /tmp/extracted.md

# Pre-fill the Outline tab with a resolved bookmark tree.
# `outline.json` is an array of PdfOutlineItem (`title`, `level`,
# `page`, `children`); destinations must already be resolved to
# 1-indexed page numbers because the recipient does not hold the PDF
# bytes. To get this from an existing PDF, run liteparse / PDF.js
# locally and call getPdfOutline(bytes).
node -e '
const zlib = require("node:zlib");
const outline = JSON.parse(require("node:fs").readFileSync(process.argv[1], "utf8"));
const state = JSON.stringify({ v: 3, f: "outline", s: "my-document", o: outline });
const utf8 = Buffer.from(state, "utf8");
const out = Buffer.concat([Buffer.alloc(4), zlib.deflateRawSync(utf8)]);
out.writeUInt32LE(utf8.length, 0);
const enc = out.toString("base64").replace(/\+/g,"-").replace(/\//g,"_").replace(/=+$/,"");
console.log("https://tools.secutils.dev/pdf#" + enc);
' /tmp/outline.json

Always print the full URL -- the fragment is opaque and dropping a single character breaks decoding.

If the JSON / text is bigger than ~64 KB, the destination page will refuse to re-Share it (because the fragment can't round-trip something larger than the source it came in on), but it will still load and the user can Copy / Download.

Inline alternative (no tool needed)

If you have direct access to the PDF bytes and need the text right now (not as a polished, shareable artefact), parse it with any local PDF library: pdfjs-dist, pdftotext, pdfplumber, pdf-parse, or even liteparse itself on Node. Use this tool when the user wants to:

Avoid uploading the PDF to anything.
Get structured JSON with bounding boxes, not just text.
OCR a scanned PDF without standing up Tesseract themselves.
Hand the extracted output to a teammate via a single URL.
Pipe the text into the Markdown to HTML converter for a polished export (a one-click action lives inside the Export dropdown on the result pane).

Companion: Markdown to HTML

The result pane's Export dropdown contains an Open in Markdown to HTML action that hands the current Text or Markdown payload off to the Markdown to HTML tool in a new tab. Use it when the user asks "now convert this to a nice PDF / HTML / one-page doc" -- the two tools share the same URL-fragment wire format for their text payloads, so the handoff is a single click with no copy/paste.

After producing

If you've handed over the URL, that's the whole interaction -- the user takes it from there in the browser. No follow-up encoding required.

Caveats

The PDF bytes only ever exist client-side -- the URL fragment (everything after #) is never sent to the Secutils.dev server, and the dropzone reads the file via File.arrayBuffer() directly into a Web Worker. The share link is therefore safe for content the user wouldn't want logged, but anyone who receives the link can read the extracted output.
OCR fetches from public CDNs. When OCR runs, tesseract.js downloads its Web Worker and the requested language data from the jsDelivr NPM mirror -- specifically cdn.jsdelivr.net/npm/@tesseract.js-data/<lang>/4.0.0_best_int/<lang>.traineddata.gz for each selected language (the LSTM-only "tessdata best integerized" corpus, ~1-15 MB per language). The PDF content itself is never sent to those hosts -- only the static asset URLs are requested. Set OCR mode to Never in Options to guarantee zero third-party contact. The previously-documented tessdata.projectnaptha.com GitHub-Pages origin is deprecated upstream and no longer touched by tesseract.js.
First parse is slow. The bundled engine is ~3 MB inlined; the first call to Parse Blob-URLs it and import()s the module (one-time ~200 ms cost on a modern laptop, longer on mobile). After that it stays in memory until the tab is closed.
No file conversions. DOCX / XLSX / HTML / images are rejected at the dropzone -- liteparse normally shells out to libreoffice for those and there's no browser equivalent.
No cmaps shipped. Latin scripts render perfectly; CJK and some specialised PDFs may fall back to substitute glyphs. Bundling cmaps (~4 MB more) is a future enhancement once there's user demand.
URL state cap is ~64 KB. Big documents fit easily as Text (a 100-page PDF is usually <100 KB of UTF-8) but the JSON variant exceeds the cap surprisingly quickly because of per-item bounding boxes. The page disables Share above the cap and points at Copy / Export instead.
Screenshots require the original PDF. When the user lands via a shared URL (which only carries the extracted Text or JSON, never the PDF bytes), the Screenshots tab shows a "drop the PDF to enable" prompt instead of rendering anything. Rendering also only kicks off on the first click of the tab -- pages stream in one at a time so a 50-page PDF doesn't pin the main thread before the first page is visible.
Screenshots search + box overlay are JSON-driven, not pixel-driven. Both the search highlight and the "Show boxes" overlay use result.json.pages[*].textItems for coordinates; they don't OCR the rendered PNG. This means search hits exactly what the spatial parser saw, which is faster and more accurate than image-level search on a text PDF -- but on a scanned PDF the matches depend on the OCR pass having found the text in the first place. Run Parse with OCR enabled (auto / always) before relying on the search overlay for scanned input.
Outline overlap with bookmarks. A PDF outline / TOC is not the same thing as a "visible Table of Contents page" rendered into the PDF body. Many academic PDFs have the latter but not the former; the Outline tab shows the empty state for those even though the document visually contains a TOC. The fix is for the author to add bookmarks (LaTeX: hyperref's bookmark package; Word/Docs: heading styles carry through to the export). When neither is present, the Markdown tab still has the font-size-based heading heuristic as a fallback.
The Markdown tab is heuristic, not lossless. It works well for documents with clear text-based structure (headings, bullet lists, data tables with column-aligned text). It will miss:
- Bordered tables whose cells are not also x-aligned (PDF border primitives are not in liteparse's JSON; only text geometry is).
- Multi-column flow layouts (newspaper-style; columns get glued into a single paragraph because line clustering is single-axis).
- Math, formulae, footnote markers (treated as inline text).
- Raster images (the Screenshots tab is the right place for those).
Links only appear when the PDF is parsed locally. The hyperlink pass is a separate PDF.js annotation extraction that runs against the in-memory pdfFile.bytes. Shared URLs carry only the rendered Markdown string (not the recipe), so link reconstruction is replayed on the sender's side at Markdown-render time, and the recipient just sees the already-[text](url)-wrapped output. If the same JSON is rendered on the recipient's side (e.g. they came in via a f: 'json' share link and then clicked the Markdown tab), the output will be link-free.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF Extractor (Secutils.dev)

Inputs

Wire format (URL state)

How to direct the user

Inline alternative (no tool needed)

Companion: Markdown to HTML

After producing

Caveats

FilesExpand file tree

pdf-extractor.skill.md

Latest commit

History

pdf-extractor.skill.md

File metadata and controls

PDF Extractor (Secutils.dev)

Inputs

Wire format (URL state)

How to direct the user

Inline alternative (no tool needed)

Companion: Markdown to HTML

After producing

Caveats