Fix encrypted PDF string decryption when ciphertext starts with BOM bytes#1306
Merged
Conversation
Refactor UTF-16 encoding to include BOM in byte array.
Added rawBytes field to preserve original byte data from the PDF file.
BobLd
requested changes
May 25, 2026
Collaborator
|
@VarunSaiTeja I've put a comment. Also, can you add an integration test and a document that was previously failing? |
Simplified assignment of originalRawBytes by removing cloning.
Added a test to verify decryption of strings with BOM in encrypted PDFs.
Update keyword assertion for encrypted PDF test.
Contributor
Author
|
Added in commits c57377a and e9602c2. The test PDF (test-bom-encrypted.pdf) is a synthetic RC4-128 encrypted PDF (no password) where the /Keywords string's ciphertext starts with FF FE. On master, the Keywords decrypt to garbage (60 → 58 bytes, corrupted); with the fix, it correctly returns "sample keywords for testing encrypted PDF string decryption". The integration test CanDecryptStringWhenCiphertextStartsWithBom in EncryptedDocumentTests.cs validates this. |
BobLd
approved these changes
May 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When an encrypted PDF's string value has ciphertext bytes starting with
FF FE(UTF-16 LE BOM) orFE FF(UTF-16 BE BOM), theStringTokenizerprematurely detects the BOM on still-encrypted bytes. It then decodes the raw ciphertext as UTF-16 text and strips the BOM via.Substring(1).When
EncryptionHandler.DecryptInternal()later callsStringToken.GetBytes()to reconstruct the original bytes for decryption, two problems occur:Missing BOM prefix (UTF-16 LE only): The
Utf16case inGetBytes()did not re-add theFF FEprefix, losing 2 bytes. TheUtf16BEcase already correctly re-addedFE FF.Lossy UTF-16 round-trip: Interpreting arbitrary encrypted bytes as UTF-16 and then re-encoding them back is lossy — some byte pairs form invalid surrogate pairs or get normalized, producing different bytes than the original ciphertext.
Combined, these cause RC4/AES decryption to receive corrupted input, producing garbage output for affected strings.
Fix
StringToken.cs: Added a constructor overload that accepts and stores the original raw bytes.GetBytes()now returns the stored raw bytes directly when available, bypassing the lossy encode/decode round-trip entirely. Also addedFF FEBOM prefix to theUtf16fallback path to match the existingUtf16BEbehaviour.StringTokenizer.cs: When a BOM is detected (FE FForFF FE), the original raw bytes are now preserved and passed to the newStringTokenconstructor. This ensuresGetBytes()always returns the exact original bytes for decryption.Test
Added
test-bom-encrypted.pdf— a synthetic RC4-128 encrypted PDF (no password) where the/Keywordsstring's ciphertext starts withFF FE. On master this decrypts to garbage; with the fix it correctly returnssample keywords for testing encrypted PDF string decryption.Impact
Fixes decryption of string values in encrypted PDFs where the ciphertext happens to start with BOM-like bytes. This is probabilistic (~1/65536 per encrypted string) but affects real-world documents.
Related
GetBytes()round-trip)