Skip to content

feat(engines): sanitizeDocument — strip XMP, JavaScript, thumbnails, attachments (#673)#674

Open
Phauks wants to merge 1 commit into
embedpdf:mainfrom
Phauks:pr/document-sanitization-engine
Open

feat(engines): sanitizeDocument — strip XMP, JavaScript, thumbnails, attachments (#673)#674
Phauks wants to merge 1 commit into
embedpdf:mainfrom
Phauks:pr/document-sanitization-engine

Conversation

@Phauks

@Phauks Phauks commented Jun 13, 2026

Copy link
Copy Markdown

Implements the engine/TypeScript side of #673 — a sanitizeDocument(doc, options) method that strips hidden vectors for defensible redaction.

Depends on embedpdf/pdfium#27, which adds the EPDF_* C++ exports. The pdfium-src submodule here points at that PR's commit; once it merges I'll re-point the submodule to the merged SHA. (The vendored pdfium.wasm / functions.ts are rebuilt from that C++ so the tests run as-is; regenerate from your own build if you prefer.)

What this adds

  • @embedpdf/modelsSanitizeOptions + sanitizeDocument on the PdfEngine and IPdfiumExecutor interfaces.
  • @embedpdf/enginesPdfiumNative.sanitizeDocument composes the three new exports (EPDF_RemoveXMPMetadata / RemoveEmbeddedThumbnails / RemoveAllJavaScript) with the existing removeAttachment loop and a non-incremental saveAsCopy. Each vector is opt-out via options (all default on). Wired through the orchestrator, RemoteExecutor, WebWorkerEngine, and the worker runner.

Tests

packages/engines/examples/node/sanitize/ — a crafted "dirty" fixture (XMP, document JS, page /Thumb, an embedded file) plus:

  • test-sanitize-document.mjs — full scrub: every vector gone, single-revision output.
  • test-vector-isolation.mjs — each options flag removes only its vector, others preserved.

Run: node test-sanitize-document.mjs && node test-vector-isolation.mjs (after pnpm --filter @embedpdf/engines... build). pdf-lib is added as a devDependency for fixture authorship + independent re-parse in assertions.

Notes / questions (also in #673)

  • API shape: granular exports as here, or a single EPDF_SanitizeDocument? Easy to fold.
  • Hidden OCG (optional-content) layers are intentionally not here — they need content excision, not just dropping /OCProperties — and will be a separate follow-up PR.

…attachments

Implements the engine/TS side of embedpdf#673 (depends on embedpdf/pdfium#27,
which adds the EPDF_* exports; the pdfium-src submodule here points at that PR's
commit and will re-point to the merged SHA).

- models: SanitizeOptions + sanitizeDocument on PdfEngine and IPdfiumExecutor.
- PdfiumNative.sanitizeDocument composes EPDF_RemoveXMPMetadata /
  RemoveEmbeddedThumbnails / RemoveAllJavaScript with the existing
  removeAttachment loop and a non-incremental saveAsCopy; wired through the
  orchestrator, RemoteExecutor, WebWorkerEngine, and the worker runner.
- tests (packages/engines/examples/node/sanitize): a crafted dirty fixture plus
  full-scrub and per-vector-isolation tests asserting each vector is removed and
  unrelated content preserved. pdf-lib added as a devDependency for fixtures.
- vendored pdfium.wasm + functions.ts rebuilt to include the new exports.
@vercel

vercel Bot commented Jun 13, 2026

Copy link
Copy Markdown

@Phauks is attempting to deploy a commit to the OpenBook Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant