Skip to content

fix(tagged-pdf): refuse encrypted documents with friendly message#514

Merged
bundolee merged 1 commit into
mainfrom
fix/tagged-pdf-refuse-encrypted
May 15, 2026
Merged

fix(tagged-pdf): refuse encrypted documents with friendly message#514
bundolee merged 1 commit into
mainfrom
fix/tagged-pdf-refuse-encrypted

Conversation

@bundolee

@bundolee bundolee commented May 15, 2026

Copy link
Copy Markdown
Contributor

Objective

Running --format tagged-pdf on an encrypted PDF fails deep in the save step with a Java stack trace ("AES initializing vector is not fully read" or "Error writing document"). The failure mode is opaque: the tool produces a stub file, exits non-zero, and dumps internal frames. Affects every encrypted document — signed or not, owner-password-restricted or not, regardless of the actual content.

Fixes opendataloader-project/opendataloader-pdf-tasks#461

Approach

Refuse tagged-pdf at the entry of AutoTaggingProcessor.createTaggedPDF when the trailer has an Encrypt entry, throwing a typed exception that the CLI surfaces as a one-line Error: ... message — the same pattern PR #504 used for the password case.

Per maintainer policy (opendataloader-project/opendataloader-pdf-tasks#462: "ODL should not generate Annotated PDF or perform auto-tagging of encrypted documents"), this is a blanket refusal that sidesteps the AES re-serialization bug without touching veraPDF or the saveAs path. The bit-4-set carve-out and owner-password authentication are intentionally out of scope — those can be revisited if real demand surfaces.

The guard is placed in createTaggedPDF (the disk-writing entry point) so in-memory tagging via tagDocument is unaffected. The default JSON path on encrypted PDFs continues to work since it never re-serializes.

Evidence

Built the CLI and ran 17 scenarios covering every combination of encryption type × user-password state × --password input, plus the original reporter PDF and regression checks for non-tagged-pdf paths. Full matrix (with PDF synthesis details) is in #461 comment thread.

Scenario Expected Before After
Plain PDF + tagged-pdf OK OK OK
AES-256 + no pw Friendly refusal saveAs fails (stack trace) Error: '...' is encrypted; tagged-pdf conversion is not supported for encrypted documents.
AES-256 + owner pw Friendly refusal saveAs fails (stack trace) Friendly refusal
AES-128 + user pw Friendly refusal saveAs fails Friendly refusal
RC4 + no pw Friendly refusal OK (silently bypassed permissions) Friendly refusal
Original PDFDLOSP-12 PDF Friendly refusal Stack trace at AES IV read Friendly refusal
Encrypted PDF + default JSON (regression) OK OK OK
Plain PDF + default JSON (regression) OK OK OK

No stack trace surfaces in any encrypted-PDF tagged-pdf scenario after the patch. Non-tagged-pdf paths and unencrypted PDFs are unchanged.

Test plan

  • Build passes (mvn clean install -DskipTests)
  • Plain PDF tagged-pdf still works
  • AES-encrypted PDF (all variants) produces friendly refusal, no stack trace
  • RC4-encrypted PDF produces friendly refusal (aligns with maintainer policy)
  • Default JSON conversion on encrypted PDFs unchanged
  • Original reporter PDF no longer produces stack trace

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Added validation to reject encrypted PDF files when tagged PDF output is requested, with improved error messaging and handling during processing.

Review Change Stack

Objective: Running --format tagged-pdf on an encrypted PDF (any AES
variant, with or without owner password / signatures / CJK fonts) fails
deep in the save step with a Java stack trace ("AES initializing vector
is not fully read" or "Error writing document"). For users, the failure
mode is opaque: the tool produces a stub file, exits non-zero, and dumps
internal frames. Affects every encrypted document — signed or not, owner-
password-restricted or not — regardless of the actual PDF content.

Approach: Refuse tagged-pdf at the entry of AutoTaggingProcessor when the
trailer has an Encrypt entry, throwing a typed exception that the CLI
surfaces as a one-line "Error: ..." message — the same pattern PR #504
used for the password case. Per maintainer policy ("ODL should not
generate Annotated PDF or perform auto-tagging of encrypted documents"),
this is a blanket refusal that sidesteps the AES re-serialization bug
without touching veraPDF or the saveAs path. The bit-4-set carve-out and
owner-password authentication are intentionally out of scope.

The guard is placed in createTaggedPDF (the disk-writing entry point) so
in-memory tagging via tagDocument is unaffected; the default JSON path
on encrypted PDFs also continues to work since it never re-serializes.

Evidence: Built the CLI and ran 17 scenarios covering every combination
of encryption type (none / AES-128 / AES-256 / RC4) × user-password state
× --password input × the original reporter PDF, plus regression checks
for non-tagged-pdf paths.

| Scenario | Before | After |
|---|---|---|
| Plain PDF + tagged-pdf | OK | OK |
| AES-256 + various pw inputs (10 cases) | saveAs stack trace OR rejected | Friendly "Error: ... is encrypted; tagged-pdf conversion is not supported" |
| RC4 + no pw | OK (silently ignored permissions) | Friendly refusal (aligns with policy) |
| Original PDFDLOSP-12 reporter PDF | Stack trace at AES IV | Friendly refusal |
| AES-256 PDF + default JSON (regression) | OK | OK |
| Plain PDF + default JSON (regression) | OK | OK |

No stack trace surfaces in any encrypted-PDF scenario after the patch.
Non-tagged-pdf paths and unencrypted PDFs are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ef8691b7-1483-4807-9cb9-fddce39ea166

📥 Commits

Reviewing files that changed from the base of the PR and between 2e585ac and 0a50642.

📒 Files selected for processing (3)
  • java/opendataloader-pdf-cli/src/main/java/org/opendataloader/pdf/cli/CLIMain.java
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/exceptions/EncryptedTaggedPdfNotSupportedException.java
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/AutoTaggingProcessor.java

Walkthrough

The PR adds a new exception type (EncryptedTaggedPdfNotSupportedException) and uses it to reject encrypted PDFs when tagged-PDF output is requested. The core processor validates input during tagged-PDF creation and throws the exception, while the CLI catches it and reports a user-friendly error without failing the entire batch.

Changes

Encrypted PDF rejection for tagged-PDF output

Layer / File(s) Summary
Exception definition
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/exceptions/EncryptedTaggedPdfNotSupportedException.java
New EncryptedTaggedPdfNotSupportedException class extends IOException, provides standard serialization and message constructor, with Javadoc explaining it is thrown when tagged-PDF output is requested for encrypted documents.
Core processor encryption validation
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/AutoTaggingProcessor.java
Added import and encryption check in createTaggedPDF(...); inspects the PDF trailer's /Encrypt entry and throws EncryptedTaggedPdfNotSupportedException when present and non-empty.
CLI exception handling
java/opendataloader-pdf-cli/src/main/java/org/opendataloader/pdf/cli/CLIMain.java
Added import and dedicated catch block in processFile to catch the exception, print Error: <message>, and return false to mark the file as failed.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

  • opendataloader-project/opendataloader-pdf#504: Both PRs modify CLIMain to catch and handle a specific encryption-related exception differently (friendly single-line error + returning false), so the CLI's encrypted/PDF error-path behavior changes are directly related.

Suggested reviewers

  • MaximPlusov
  • LonelyMidoriya
  • hnc-jglee
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: detecting encrypted PDFs early and refusing tagged-PDF conversion with a user-friendly error message instead of a stack trace.
Linked Issues check ✅ Passed The changes fully address issue #461 by detecting encrypted PDFs at the tagged-PDF entry point and providing a friendly error message instead of runtime failures.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the linked issue: new exception class, encryption detection in AutoTaggingProcessor, and CLI error handling—no unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bundolee bundolee merged commit 37e3b09 into main May 15, 2026
7 of 8 checks passed
@bundolee bundolee deleted the fix/tagged-pdf-refuse-encrypted branch May 15, 2026 08:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant