Skip to content

feat(engines): OCG hidden-layer removal (optionalContentGroups) — follow-up to #673#675

Open
Phauks wants to merge 2 commits into
embedpdf:mainfrom
Phauks:pr/ocg-removal-engine
Open

feat(engines): OCG hidden-layer removal (optionalContentGroups) — follow-up to #673#675
Phauks wants to merge 2 commits into
embedpdf:mainfrom
Phauks:pr/ocg-removal-engine

Conversation

@Phauks

@Phauks Phauks commented Jun 13, 2026

Copy link
Copy Markdown

Follow-up to #673 — the engine/TS side of the fourth sanitization vector: removing content hidden behind OFF optional-content groups (layers).

Stacked on #674 (and depends on embedpdf/pdfium#28 for the C++). This branch builds on #674's commit; the OCG-specific commit is the last one here. Please land #674 / #28 first — I'll rebase so this shows only the OCG delta.

What this adds

  • optionalContentGroups flag on SanitizeOptions / sanitizeDocument (default on), wiring EPDF_RemoveOptionalContentGroups.
  • test-remove-ocg.mjs + its fixture builder: a page with a hidden OFF-layer; after the scrub, the hidden text is gone (verified via extractText), /OCProperties is removed, and the visible text remains.
  • Vendored pdfium.wasm / functions.ts rebuilt to include the 4th export.

This is the deliberate separate follow-up promised in #673 (kept out of the main sanitization PR because hidden layers need content excision, not just dropping /OCProperties). Same open scope note as embedpdf/pdfium#28 re: form-XObject / annotation /OC edge cases.

Phauks added 2 commits June 13, 2026 12:08
…attachments

Implements the engine/TS side of embedpdf#673 (depends on embedpdf/pdfium#27,
which adds the EPDF_* exports; the pdfium-src submodule here points at that PR's
commit and will re-point to the merged SHA).

- models: SanitizeOptions + sanitizeDocument on PdfEngine and IPdfiumExecutor.
- PdfiumNative.sanitizeDocument composes EPDF_RemoveXMPMetadata /
  RemoveEmbeddedThumbnails / RemoveAllJavaScript with the existing
  removeAttachment loop and a non-incremental saveAsCopy; wired through the
  orchestrator, RemoteExecutor, WebWorkerEngine, and the worker runner.
- tests (packages/engines/examples/node/sanitize): a crafted dirty fixture plus
  full-scrub and per-vector-isolation tests asserting each vector is removed and
  unrelated content preserved. pdf-lib added as a devDependency for fixtures.
- vendored pdfium.wasm + functions.ts rebuilt to include the new exports.
Adds the optionalContentGroups flag to sanitizeDocument (default on), wiring
EPDF_RemoveOptionalContentGroups; a test (test-remove-ocg.mjs) asserting hidden
OCG-layer text is removed (via extractText) and /OCProperties dropped while
visible content is preserved, with its fixture builder; and the vendored wasm
rebuilt to include the 4th export. Stacks on the sanitization PR.
@vercel

vercel Bot commented Jun 13, 2026

Copy link
Copy Markdown

@Phauks is attempting to deploy a commit to the OpenBook Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant