Skip to content

Fix encrypted PDF string decryption when ciphertext starts with BOM bytes#1306

Merged
BobLd merged 8 commits into
UglyToad:masterfrom
VarunSaiTeja:patch-1
May 25, 2026
Merged

Fix encrypted PDF string decryption when ciphertext starts with BOM bytes#1306
BobLd merged 8 commits into
UglyToad:masterfrom
VarunSaiTeja:patch-1

Conversation

@VarunSaiTeja
Copy link
Copy Markdown
Contributor

@VarunSaiTeja VarunSaiTeja commented May 25, 2026

Problem

When an encrypted PDF's string value has ciphertext bytes starting with FF FE (UTF-16 LE BOM) or FE FF (UTF-16 BE BOM), the StringTokenizer prematurely detects the BOM on still-encrypted bytes. It then decodes the raw ciphertext as UTF-16 text and strips the BOM via .Substring(1).

When EncryptionHandler.DecryptInternal() later calls StringToken.GetBytes() to reconstruct the original bytes for decryption, two problems occur:

  1. Missing BOM prefix (UTF-16 LE only): The Utf16 case in GetBytes() did not re-add the FF FE prefix, losing 2 bytes. The Utf16BE case already correctly re-added FE FF.

  2. Lossy UTF-16 round-trip: Interpreting arbitrary encrypted bytes as UTF-16 and then re-encoding them back is lossy — some byte pairs form invalid surrogate pairs or get normalized, producing different bytes than the original ciphertext.

Combined, these cause RC4/AES decryption to receive corrupted input, producing garbage output for affected strings.

Fix

  • StringToken.cs: Added a constructor overload that accepts and stores the original raw bytes. GetBytes() now returns the stored raw bytes directly when available, bypassing the lossy encode/decode round-trip entirely. Also added FF FE BOM prefix to the Utf16 fallback path to match the existing Utf16BE behaviour.

  • StringTokenizer.cs: When a BOM is detected (FE FF or FF FE), the original raw bytes are now preserved and passed to the new StringToken constructor. This ensures GetBytes() always returns the exact original bytes for decryption.

Test

Added test-bom-encrypted.pdf — a synthetic RC4-128 encrypted PDF (no password) where the /Keywords string's ciphertext starts with FF FE. On master this decrypts to garbage; with the fix it correctly returns sample keywords for testing encrypted PDF string decryption.

Impact

Fixes decryption of string values in encrypted PDFs where the ciphertext happens to start with BOM-like bytes. This is probabilistic (~1/65536 per encrypted string) but affects real-world documents.

Related

Refactor UTF-16 encoding to include BOM in byte array.
Added rawBytes field to preserve original byte data from the PDF file.
Comment thread src/UglyToad.PdfPig.Tokenization/StringTokenizer.cs Outdated
@BobLd
Copy link
Copy Markdown
Collaborator

BobLd commented May 25, 2026

@VarunSaiTeja I've put a comment. Also, can you add an integration test and a document that was previously failing?

Simplified assignment of originalRawBytes by removing cloning.
Added a test to verify decryption of strings with BOM in encrypted PDFs.
Update keyword assertion for encrypted PDF test.
@VarunSaiTeja VarunSaiTeja changed the title Fix UTF-16 LE BOM not preserved in StringToken.GetBytes() for encrypted PDF decryption Fix encrypted PDF string decryption when ciphertext starts with BOM bytes May 25, 2026
@VarunSaiTeja
Copy link
Copy Markdown
Contributor Author

Added in commits c57377a and e9602c2. The test PDF (test-bom-encrypted.pdf) is a synthetic RC4-128 encrypted PDF (no password) where the /Keywords string's ciphertext starts with FF FE. On master, the Keywords decrypt to garbage (60 → 58 bytes, corrupted); with the fix, it correctly returns "sample keywords for testing encrypted PDF string decryption". The integration test CanDecryptStringWhenCiphertextStartsWithBom in EncryptedDocumentTests.cs validates this.

@VarunSaiTeja VarunSaiTeja requested a review from BobLd May 25, 2026 13:24
@BobLd BobLd merged commit 450b855 into UglyToad:master May 25, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants