Fix encrypted PDF string decryption when ciphertext starts with BOM bytes by VarunSaiTeja · Pull Request #1306 · UglyToad/PdfPig

VarunSaiTeja · 2026-05-25T05:53:13Z

Problem

When an encrypted PDF's string value has ciphertext bytes starting with FF FE (UTF-16 LE BOM) or FE FF (UTF-16 BE BOM), the StringTokenizer prematurely detects the BOM on still-encrypted bytes. It then decodes the raw ciphertext as UTF-16 text and strips the BOM via .Substring(1).

When EncryptionHandler.DecryptInternal() later calls StringToken.GetBytes() to reconstruct the original bytes for decryption, two problems occur:

Missing BOM prefix (UTF-16 LE only): The Utf16 case in GetBytes() did not re-add the FF FE prefix, losing 2 bytes. The Utf16BE case already correctly re-added FE FF.
Lossy UTF-16 round-trip: Interpreting arbitrary encrypted bytes as UTF-16 and then re-encoding them back is lossy — some byte pairs form invalid surrogate pairs or get normalized, producing different bytes than the original ciphertext.

Combined, these cause RC4/AES decryption to receive corrupted input, producing garbage output for affected strings.

Fix

StringToken.cs: Added a constructor overload that accepts and stores the original raw bytes. GetBytes() now returns the stored raw bytes directly when available, bypassing the lossy encode/decode round-trip entirely. Also added FF FE BOM prefix to the Utf16 fallback path to match the existing Utf16BE behaviour.
StringTokenizer.cs: When a BOM is detected (FE FF or FF FE), the original raw bytes are now preserved and passed to the new StringToken constructor. This ensures GetBytes() always returns the exact original bytes for decryption.

Test

Added test-bom-encrypted.pdf — a synthetic RC4-128 encrypted PDF (no password) where the /Keywords string's ciphertext starts with FF FE. On master this decrypts to garbage; with the fix it correctly returns sample keywords for testing encrypted PDF string decryption.

Impact

Fixes decryption of string values in encrypted PDFs where the ciphertext happens to start with BOM-like bytes. This is probabilistic (~1/65536 per encrypted string) but affects real-world documents.

Added in commits c57377a and e9602c2. The test PDF (test-bom-encrypted.pdf) is a synthetic RC4-128 encrypted PDF (no password) where the /Keywords string's ciphertext starts with FF FE. On master, the Keywords decrypt to garbage (60 → 58 bytes, corrupted); with the fix, it correctly returns "sample keywords for testing encrypted PDF string decryption". The integration test CanDecryptStringWhenCiphertextStartsWithBom in EncryptedDocumentTests.cs validates this.

VarunSaiTeja added 4 commits May 25, 2026 11:21

Add BOM to UTF-16 encoded byte array

5168615

Refactor UTF-16 encoding to include BOM in byte array.

Enhance StringToken to store raw byte data

e6a9f4d

Added rawBytes field to preserve original byte data from the PDF file.

Add original raw bytes to StringToken constructor

ea80007

Merge branch 'master' into patch-1

4b2737b

BobLd requested changes May 25, 2026

View reviewed changes

Comment thread src/UglyToad.PdfPig.Tokenization/StringTokenizer.cs Outdated

VarunSaiTeja added 4 commits May 25, 2026 18:31

Simplify originalRawBytes assignment in StringTokenizer

661460c

Simplified assignment of originalRawBytes by removing cloning.

Add test for decrypting BOM-prefixed ciphertext

c57377a

Added a test to verify decryption of strings with BOM in encrypted PDFs.

Fix keyword assertion in EncryptedDocumentTests

863be9d

Update keyword assertion for encrypted PDF test.

Added test file for bom-encrypted

e9602c2

VarunSaiTeja changed the title ~~Fix UTF-16 LE BOM not preserved in StringToken.GetBytes() for encrypted PDF decryption~~ Fix encrypted PDF string decryption when ciphertext starts with BOM bytes May 25, 2026

VarunSaiTeja requested a review from BobLd May 25, 2026 13:24

BobLd approved these changes May 25, 2026

View reviewed changes

BobLd merged commit 450b855 into UglyToad:master May 25, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix encrypted PDF string decryption when ciphertext starts with BOM bytes#1306

Fix encrypted PDF string decryption when ciphertext starts with BOM bytes#1306
BobLd merged 8 commits into
UglyToad:masterfrom
VarunSaiTeja:patch-1

VarunSaiTeja commented May 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

BobLd commented May 25, 2026

Uh oh!

VarunSaiTeja commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

VarunSaiTeja commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Test

Impact

Related

Uh oh!

Uh oh!

BobLd commented May 25, 2026

Uh oh!

VarunSaiTeja commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

VarunSaiTeja commented May 25, 2026 •

edited

Loading