fix(tagged-pdf): refuse encrypted documents with friendly message#514
Conversation
Objective: Running --format tagged-pdf on an encrypted PDF (any AES
variant, with or without owner password / signatures / CJK fonts) fails
deep in the save step with a Java stack trace ("AES initializing vector
is not fully read" or "Error writing document"). For users, the failure
mode is opaque: the tool produces a stub file, exits non-zero, and dumps
internal frames. Affects every encrypted document — signed or not, owner-
password-restricted or not — regardless of the actual PDF content.
Approach: Refuse tagged-pdf at the entry of AutoTaggingProcessor when the
trailer has an Encrypt entry, throwing a typed exception that the CLI
surfaces as a one-line "Error: ..." message — the same pattern PR #504
used for the password case. Per maintainer policy ("ODL should not
generate Annotated PDF or perform auto-tagging of encrypted documents"),
this is a blanket refusal that sidesteps the AES re-serialization bug
without touching veraPDF or the saveAs path. The bit-4-set carve-out and
owner-password authentication are intentionally out of scope.
The guard is placed in createTaggedPDF (the disk-writing entry point) so
in-memory tagging via tagDocument is unaffected; the default JSON path
on encrypted PDFs also continues to work since it never re-serializes.
Evidence: Built the CLI and ran 17 scenarios covering every combination
of encryption type (none / AES-128 / AES-256 / RC4) × user-password state
× --password input × the original reporter PDF, plus regression checks
for non-tagged-pdf paths.
| Scenario | Before | After |
|---|---|---|
| Plain PDF + tagged-pdf | OK | OK |
| AES-256 + various pw inputs (10 cases) | saveAs stack trace OR rejected | Friendly "Error: ... is encrypted; tagged-pdf conversion is not supported" |
| RC4 + no pw | OK (silently ignored permissions) | Friendly refusal (aligns with policy) |
| Original PDFDLOSP-12 reporter PDF | Stack trace at AES IV | Friendly refusal |
| AES-256 PDF + default JSON (regression) | OK | OK |
| Plain PDF + default JSON (regression) | OK | OK |
No stack trace surfaces in any encrypted-PDF scenario after the patch.
Non-tagged-pdf paths and unencrypted PDFs are unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (3)
WalkthroughThe PR adds a new exception type ( ChangesEncrypted PDF rejection for tagged-PDF output
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Objective
Running
--format tagged-pdfon an encrypted PDF fails deep in the save step with a Java stack trace ("AES initializing vector is not fully read" or "Error writing document"). The failure mode is opaque: the tool produces a stub file, exits non-zero, and dumps internal frames. Affects every encrypted document — signed or not, owner-password-restricted or not, regardless of the actual content.Fixes opendataloader-project/opendataloader-pdf-tasks#461
Approach
Refuse tagged-pdf at the entry of
AutoTaggingProcessor.createTaggedPDFwhen the trailer has anEncryptentry, throwing a typed exception that the CLI surfaces as a one-lineError: ...message — the same pattern PR #504 used for the password case.Per maintainer policy (opendataloader-project/opendataloader-pdf-tasks#462: "ODL should not generate Annotated PDF or perform auto-tagging of encrypted documents"), this is a blanket refusal that sidesteps the AES re-serialization bug without touching veraPDF or the saveAs path. The bit-4-set carve-out and owner-password authentication are intentionally out of scope — those can be revisited if real demand surfaces.
The guard is placed in
createTaggedPDF(the disk-writing entry point) so in-memory tagging viatagDocumentis unaffected. The default JSON path on encrypted PDFs continues to work since it never re-serializes.Evidence
Built the CLI and ran 17 scenarios covering every combination of encryption type × user-password state ×
--passwordinput, plus the original reporter PDF and regression checks for non-tagged-pdf paths. Full matrix (with PDF synthesis details) is in #461 comment thread.Error: '...' is encrypted; tagged-pdf conversion is not supported for encrypted documents.No stack trace surfaces in any encrypted-PDF tagged-pdf scenario after the patch. Non-tagged-pdf paths and unencrypted PDFs are unchanged.
Test plan
mvn clean install -DskipTests)Summary by CodeRabbit
Release Notes