pypdf fails to decrypt AES-256 encrypted PDFs when the password contains unicode
characters, even if the password is correct.
There is a TODO in pypdf/_encryption.py at line 1009 inside verify_v5:
-TODO: use SASLprep process
The PDF specification for AES-256 (Revision 5/6) requires passwords to be normalized
using SASLprep (RFC 4013) before UTF-8 encoding. Currently _encode_password tries
latin-1 first then falls back to utf-8 without any SASLprep normalization, so the
byte representation does not match what a spec-compliant PDF creator produced.
I tried to reproduce this and confirmed it fails. i've attached the script i used :
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-6.19.14-200.fc43.x86_64-x86_64-with-glibc2.42
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==6.11.0, crypt_provider=('cryptography', '47.0.0'), PIL=12.2.0
Code + PDF
This is a minimal, complete example that shows the issue:
import pikepdf
from pypdf import PdfReader, PasswordType
pdf = pikepdf.Pdf.new()
pdf.save("sample.pdf")
with pikepdf.open("sample.pdf") as pdf:
pdf.save(
"encrypted_unicode.pdf",
encryption=pikepdf.Encryption(
owner="owner",
user="pässwört",
R=6
)
)
reader = PdfReader("encrypted_unicode.pdf")
result = reader.decrypt("pässwört")
print(result)
# Output: 0 (PasswordType.NOT_DECRYPTED) , produced this with current script
# Expected: 1 (PasswordType.USER_PASSWORD) , produced this result when user="password"
Possible Fix
The fix would likely go in _encode_password , normalize the password using
SASLprep before UTF-8 encoding, but only for AES-256 encryption revisions
(R=5/6).
Python's standard library exposes stringprep tables but does not provide a
complete RFC 4013 SASLprep implementation.
Before working on it, I wanted to ask which direction would be preferred:
- adding a lightweight dependency such as saslprep
- or implementing a minimal RFC 4013-compatible normalization layer maybe using
unicodedata + stringprep to keep pypdf dependency-free
pypdf fails to decrypt AES-256 encrypted PDFs when the password contains unicode
characters, even if the password is correct.
There is a TODO in pypdf/_encryption.py at line 1009 inside verify_v5:
-TODO: use SASLprep process
The PDF specification for AES-256 (Revision 5/6) requires passwords to be normalized
using SASLprep (RFC 4013) before UTF-8 encoding. Currently _encode_password tries
latin-1 first then falls back to utf-8 without any SASLprep normalization, so the
byte representation does not match what a spec-compliant PDF creator produced.
I tried to reproduce this and confirmed it fails. i've attached the script i used :
Environment
Which environment were you using when you encountered the problem?
Code + PDF
This is a minimal, complete example that shows the issue:
Possible Fix
The fix would likely go in _encode_password , normalize the password using
SASLprep before UTF-8 encoding, but only for AES-256 encryption revisions
(R=5/6).
Python's standard library exposes stringprep tables but does not provide a
complete RFC 4013 SASLprep implementation.
Before working on it, I wanted to ask which direction would be preferred:
unicodedata + stringprep to keep pypdf dependency-free