fix: Preserve Unicode filenames in sanitization and persist original names by usnavy13 · Pull Request #89 · usnavy13/LibreCodeInterpreter

usnavy13 · 2026-05-06T18:17:21Z

Summary

Updated sanitize_filename regex from [^a-zA-Z0-9.-] to [\w.\-] (Python 3 \w preserves Unicode letters/digits across all scripts while still blocking path separators, control chars, and shell metacharacters)
Added original_filename field to FileInfo model and Redis metadata so pre-sanitization filenames are recoverable
Passed original filename through /upload and /upload/batch endpoints to Redis storage
Updated GET /files/{session_id} to return the true original filename in metadata.original-filename

Context

Filed danny-avila/LibreChat#12975 for the matching client-side fix. Danny opened LibreChat#12977 which updates LC's sanitizeFilename to preserve Unicode and adds RFC 8187 Content-Disposition headers.

Do not merge until LibreChat#12977 is merged and we can test the full pipeline end-to-end — both sides need to preserve Unicode for filenames to survive the round trip.

What changed

File	Change
`src/services/execution/output.py`	Regex `[^a-zA-Z0-9.-]` → `[^\w.\-]` — preserves CJK, Cyrillic, Arabic, accented Latin, etc.
`src/models/files.py`	Added `original_filename: Optional[str]` to `FileInfo`
`src/services/file.py`	Persists `original_filename` in Redis via `store_uploaded_file`, propagates through `get_file_info` and `link_file_into_session`
`src/api/files.py`	Passes original filename from upload endpoints; returns it from file listing
`tests/unit/test_output_processor.py`	Updated Unicode assertion, added 10 new tests (CJK, Cyrillic, Korean, Arabic, mixed, dangerous chars)

Backward compatibility

ASCII-only filenames behave identically (\w on ASCII = [a-zA-Z0-9_], and _ → _ was already a no-op)
Existing Redis entries without original_filename degrade gracefully via .get() fallback
Path traversal protection, length limits, and leading-dot handling are all unchanged

Test plan

pytest tests/unit/test_output_processor.py — 41 passed
pytest tests/unit/ — 496 passed, 0 failures
flake8 src/ — 0 errors
black src/ tests/ --check — clean
mypy src/ — success, 0 issues
bandit -r src/ -s B104,B108 --severity-level high — 0 high-severity issues
End-to-end test with LibreChat after #12977 merges (upload Unicode filename → execute → download → verify name preserved)

🤖 Generated with Claude Code

- Introduced `original_filename` field in the FileInfo model to store pre-sanitization filenames. - Updated file upload and batch upload functions to include the original filename in metadata. - Enhanced file listing to return the original filename if available, improving metadata accuracy. - Adjusted file service methods to handle the new original filename parameter for better file management.

…-pass approach Align sanitize_filename with LibreChat#12977's sanitizeFilenameSegment: - NFC-normalize before sanitizing (handles decomposed accents) - Two-pass: strict ASCII [a-zA-Z0-9._-], permissive non-ASCII (only blocks C1 controls U+0080-U+009F) - Preserves emoji (📊) and ZWJ sequences that \w alone would strip Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

usnavy13 · 2026-05-06T19:21:39Z

Danny's fix landed in LibreChat (#12977 merged to dev). I've reviewed his final implementation and updated our sanitizer to match his approach:

What changed in this follow-up commit:

Switched from \w regex to a two-pass approach matching LC's sanitizeFilenameSegment — strict for ASCII [a-zA-Z0-9._-], permissive for non-ASCII (blocks only C1 controls)
Added NFC normalization before sanitizing (handles decomposed accents like e + U+0301 → é)
Emoji and ZWJ sequences now preserved (LC allows \p{Emoji} + ‍, our \w didn't)

Compatibility verified: both sides now produce identical output for CJK, Cyrillic, Korean, Arabic, accented Latin, emoji, and all dangerous ASCII chars.

Ready for end-to-end testing once LC's dev is deployed — the sanitization logic is now aligned.

The fake_store function in TestLibreChatUploadBatch had a fixed parameter list missing the new original_filename kwarg, causing a TypeError when the endpoint passed it through. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

usnavy13 added 2 commits May 6, 2026 17:12

chore: Remove outdated repository guidelines and add CLAUDE.md reference

458d9a7

usnavy13 closed this May 6, 2026

usnavy13 reopened this May 6, 2026

usnavy13 merged commit ff69603 into dev May 6, 2026
9 checks passed

usnavy13 deleted the filename-sani-fix branch May 7, 2026 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Preserve Unicode filenames in sanitization and persist original names#89

fix: Preserve Unicode filenames in sanitization and persist original names#89
usnavy13 merged 4 commits into
devfrom
filename-sani-fix

usnavy13 commented May 6, 2026

Uh oh!

usnavy13 commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

usnavy13 commented May 6, 2026

Summary

Context

What changed

Backward compatibility

Test plan

Uh oh!

usnavy13 commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant