Skip to content

fix: Preserve Unicode filenames in sanitization and persist original names#89

Merged
usnavy13 merged 4 commits into
devfrom
filename-sani-fix
May 6, 2026
Merged

fix: Preserve Unicode filenames in sanitization and persist original names#89
usnavy13 merged 4 commits into
devfrom
filename-sani-fix

Conversation

@usnavy13
Copy link
Copy Markdown
Owner

@usnavy13 usnavy13 commented May 6, 2026

Summary

  • Updated sanitize_filename regex from [^a-zA-Z0-9.-] to [\w.\-] (Python 3 \w preserves Unicode letters/digits across all scripts while still blocking path separators, control chars, and shell metacharacters)
  • Added original_filename field to FileInfo model and Redis metadata so pre-sanitization filenames are recoverable
  • Passed original filename through /upload and /upload/batch endpoints to Redis storage
  • Updated GET /files/{session_id} to return the true original filename in metadata.original-filename

Context

Filed danny-avila/LibreChat#12975 for the matching client-side fix. Danny opened LibreChat#12977 which updates LC's sanitizeFilename to preserve Unicode and adds RFC 8187 Content-Disposition headers.

Do not merge until LibreChat#12977 is merged and we can test the full pipeline end-to-end — both sides need to preserve Unicode for filenames to survive the round trip.

What changed

File Change
src/services/execution/output.py Regex [^a-zA-Z0-9.-][^\w.\-] — preserves CJK, Cyrillic, Arabic, accented Latin, etc.
src/models/files.py Added original_filename: Optional[str] to FileInfo
src/services/file.py Persists original_filename in Redis via store_uploaded_file, propagates through get_file_info and link_file_into_session
src/api/files.py Passes original filename from upload endpoints; returns it from file listing
tests/unit/test_output_processor.py Updated Unicode assertion, added 10 new tests (CJK, Cyrillic, Korean, Arabic, mixed, dangerous chars)

Backward compatibility

  • ASCII-only filenames behave identically (\w on ASCII = [a-zA-Z0-9_], and __ was already a no-op)
  • Existing Redis entries without original_filename degrade gracefully via .get() fallback
  • Path traversal protection, length limits, and leading-dot handling are all unchanged

Test plan

  • pytest tests/unit/test_output_processor.py — 41 passed
  • pytest tests/unit/ — 496 passed, 0 failures
  • flake8 src/ — 0 errors
  • black src/ tests/ --check — clean
  • mypy src/ — success, 0 issues
  • bandit -r src/ -s B104,B108 --severity-level high — 0 high-severity issues
  • End-to-end test with LibreChat after #12977 merges (upload Unicode filename → execute → download → verify name preserved)

🤖 Generated with Claude Code

usnavy13 added 2 commits May 6, 2026 17:12
- Introduced `original_filename` field in the FileInfo model to store pre-sanitization filenames.
- Updated file upload and batch upload functions to include the original filename in metadata.
- Enhanced file listing to return the original filename if available, improving metadata accuracy.
- Adjusted file service methods to handle the new original filename parameter for better file management.
@usnavy13 usnavy13 closed this May 6, 2026
@usnavy13 usnavy13 reopened this May 6, 2026
…-pass approach

Align sanitize_filename with LibreChat#12977's sanitizeFilenameSegment:
- NFC-normalize before sanitizing (handles decomposed accents)
- Two-pass: strict ASCII [a-zA-Z0-9._-], permissive non-ASCII (only
  blocks C1 controls U+0080-U+009F)
- Preserves emoji (📊) and ZWJ sequences that \w alone would strip

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@usnavy13
Copy link
Copy Markdown
Owner Author

usnavy13 commented May 6, 2026

Danny's fix landed in LibreChat (#12977 merged to dev). I've reviewed his final implementation and updated our sanitizer to match his approach:

What changed in this follow-up commit:

  • Switched from \w regex to a two-pass approach matching LC's sanitizeFilenameSegment — strict for ASCII [a-zA-Z0-9._-], permissive for non-ASCII (blocks only C1 controls)
  • Added NFC normalization before sanitizing (handles decomposed accents like e + U+0301é)
  • Emoji and ZWJ sequences now preserved (LC allows \p{Emoji} + , our \w didn't)

Compatibility verified: both sides now produce identical output for CJK, Cyrillic, Korean, Arabic, accented Latin, emoji, and all dangerous ASCII chars.

Ready for end-to-end testing once LC's dev is deployed — the sanitization logic is now aligned.

The fake_store function in TestLibreChatUploadBatch had a fixed
parameter list missing the new original_filename kwarg, causing
a TypeError when the endpoint passed it through.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@usnavy13 usnavy13 merged commit ff69603 into dev May 6, 2026
9 checks passed
@usnavy13 usnavy13 deleted the filename-sani-fix branch May 7, 2026 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant