Skip to content

Implement SASLprep (RFC 4013) for AES-256 password normalization #3777

@adityamoolya

Description

@adityamoolya

pypdf fails to decrypt AES-256 encrypted PDFs when the password contains unicode
characters, even if the password is correct.

There is a TODO in pypdf/_encryption.py at line 1009 inside verify_v5:

-TODO: use SASLprep process

The PDF specification for AES-256 (Revision 5/6) requires passwords to be normalized
using SASLprep (RFC 4013) before UTF-8 encoding. Currently _encode_password tries
latin-1 first then falls back to utf-8 without any SASLprep normalization, so the
byte representation does not match what a spec-compliant PDF creator produced.

I tried to reproduce this and confirmed it fails. i've attached the script i used :

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.19.14-200.fc43.x86_64-x86_64-with-glibc2.42

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==6.11.0, crypt_provider=('cryptography', '47.0.0'), PIL=12.2.0

Code + PDF

This is a minimal, complete example that shows the issue:

import pikepdf
from pypdf import PdfReader, PasswordType

pdf = pikepdf.Pdf.new()
pdf.save("sample.pdf")

with pikepdf.open("sample.pdf") as pdf:
    pdf.save(
        "encrypted_unicode.pdf",
        encryption=pikepdf.Encryption(
            owner="owner",
            user="pässwört",
            R=6
        )
    )

reader = PdfReader("encrypted_unicode.pdf")
result = reader.decrypt("pässwört")
print(result)
# Output:  0 (PasswordType.NOT_DECRYPTED) , produced this with current script
# Expected: 1 (PasswordType.USER_PASSWORD) , produced this result when user="password"

Possible Fix

The fix would likely go in _encode_password , normalize the password using
SASLprep before UTF-8 encoding, but only for AES-256 encryption revisions
(R=5/6).

Python's standard library exposes stringprep tables but does not provide a
complete RFC 4013 SASLprep implementation.

Before working on it, I wanted to ask which direction would be preferred:

  • adding a lightweight dependency such as saslprep
  • or implementing a minimal RFC 4013-compatible normalization layer maybe using
    unicodedata + stringprep to keep pypdf dependency-free

Metadata

Metadata

Assignees

No one assigned

    Labels

    workflow-encryptionFrom a users perspective, encryption is the affected feature/workflow

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions