fix: Preserve Unicode filenames in sanitization and persist original names#89
Merged
Conversation
- Introduced `original_filename` field in the FileInfo model to store pre-sanitization filenames. - Updated file upload and batch upload functions to include the original filename in metadata. - Enhanced file listing to return the original filename if available, improving metadata accuracy. - Adjusted file service methods to handle the new original filename parameter for better file management.
…-pass approach Align sanitize_filename with LibreChat#12977's sanitizeFilenameSegment: - NFC-normalize before sanitizing (handles decomposed accents) - Two-pass: strict ASCII [a-zA-Z0-9._-], permissive non-ASCII (only blocks C1 controls U+0080-U+009F) - Preserves emoji (📊) and ZWJ sequences that \w alone would strip Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Owner
Author
|
Danny's fix landed in LibreChat (#12977 merged to What changed in this follow-up commit:
Compatibility verified: both sides now produce identical output for CJK, Cyrillic, Korean, Arabic, accented Latin, emoji, and all dangerous ASCII chars. Ready for end-to-end testing once LC's |
The fake_store function in TestLibreChatUploadBatch had a fixed parameter list missing the new original_filename kwarg, causing a TypeError when the endpoint passed it through. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sanitize_filenameregex from[^a-zA-Z0-9.-]to[\w.\-](Python 3\wpreserves Unicode letters/digits across all scripts while still blocking path separators, control chars, and shell metacharacters)original_filenamefield toFileInfomodel and Redis metadata so pre-sanitization filenames are recoverable/uploadand/upload/batchendpoints to Redis storageGET /files/{session_id}to return the true original filename inmetadata.original-filenameContext
Filed danny-avila/LibreChat#12975 for the matching client-side fix. Danny opened LibreChat#12977 which updates LC's
sanitizeFilenameto preserve Unicode and adds RFC 8187 Content-Disposition headers.Do not merge until LibreChat#12977 is merged and we can test the full pipeline end-to-end — both sides need to preserve Unicode for filenames to survive the round trip.
What changed
src/services/execution/output.py[^a-zA-Z0-9.-]→[^\w.\-]— preserves CJK, Cyrillic, Arabic, accented Latin, etc.src/models/files.pyoriginal_filename: Optional[str]toFileInfosrc/services/file.pyoriginal_filenamein Redis viastore_uploaded_file, propagates throughget_file_infoandlink_file_into_sessionsrc/api/files.pytests/unit/test_output_processor.pyBackward compatibility
\won ASCII =[a-zA-Z0-9_], and_→_was already a no-op)original_filenamedegrade gracefully via.get()fallbackTest plan
pytest tests/unit/test_output_processor.py— 41 passedpytest tests/unit/— 496 passed, 0 failuresflake8 src/— 0 errorsblack src/ tests/ --check— cleanmypy src/— success, 0 issuesbandit -r src/ -s B104,B108 --severity-level high— 0 high-severity issues🤖 Generated with Claude Code