feat(engines): OCG hidden-layer removal (optionalContentGroups) — follow-up to #673#675
Open
Phauks wants to merge 2 commits into
Open
feat(engines): OCG hidden-layer removal (optionalContentGroups) — follow-up to #673#675Phauks wants to merge 2 commits into
Phauks wants to merge 2 commits into
Conversation
…attachments Implements the engine/TS side of embedpdf#673 (depends on embedpdf/pdfium#27, which adds the EPDF_* exports; the pdfium-src submodule here points at that PR's commit and will re-point to the merged SHA). - models: SanitizeOptions + sanitizeDocument on PdfEngine and IPdfiumExecutor. - PdfiumNative.sanitizeDocument composes EPDF_RemoveXMPMetadata / RemoveEmbeddedThumbnails / RemoveAllJavaScript with the existing removeAttachment loop and a non-incremental saveAsCopy; wired through the orchestrator, RemoteExecutor, WebWorkerEngine, and the worker runner. - tests (packages/engines/examples/node/sanitize): a crafted dirty fixture plus full-scrub and per-vector-isolation tests asserting each vector is removed and unrelated content preserved. pdf-lib added as a devDependency for fixtures. - vendored pdfium.wasm + functions.ts rebuilt to include the new exports.
Adds the optionalContentGroups flag to sanitizeDocument (default on), wiring EPDF_RemoveOptionalContentGroups; a test (test-remove-ocg.mjs) asserting hidden OCG-layer text is removed (via extractText) and /OCProperties dropped while visible content is preserved, with its fixture builder; and the vendored wasm rebuilt to include the 4th export. Stacks on the sanitization PR.
|
@Phauks is attempting to deploy a commit to the OpenBook Team on Vercel. A member of the Team first needs to authorize it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #673 — the engine/TS side of the fourth sanitization vector: removing content hidden behind OFF optional-content groups (layers).
Stacked on #674 (and depends on embedpdf/pdfium#28 for the C++). This branch builds on #674's commit; the OCG-specific commit is the last one here. Please land #674 / #28 first — I'll rebase so this shows only the OCG delta.
What this adds
optionalContentGroupsflag onSanitizeOptions/sanitizeDocument(default on), wiringEPDF_RemoveOptionalContentGroups.test-remove-ocg.mjs+ its fixture builder: a page with a hidden OFF-layer; after the scrub, the hidden text is gone (verified viaextractText),/OCPropertiesis removed, and the visible text remains.pdfium.wasm/functions.tsrebuilt to include the 4th export.This is the deliberate separate follow-up promised in #673 (kept out of the main sanitization PR because hidden layers need content excision, not just dropping
/OCProperties). Same open scope note as embedpdf/pdfium#28 re: form-XObject / annotation/OCedge cases.