Skip to content

Commit ff69603

Browse files
authored
Merge pull request #89 from usnavy13/filename-sani-fix
fix: Preserve Unicode filenames in sanitization and persist original names
2 parents 48dee34 + 836a352 commit ff69603

7 files changed

Lines changed: 108 additions & 55 deletions

File tree

AGENTS.md

Lines changed: 1 addition & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,43 +1 @@
1-
# Repository Guidelines
2-
3-
## Project Structure & Module Organization
4-
Core application code lives in `src/`. Use `src/api/` for FastAPI routes, `src/services/` for orchestration and business logic, `src/services/sandbox/` and `src/services/container/` for execution backends, `src/models/` for request/response models, and `src/config/` for environment-driven settings. Supporting docs are in `docs/`, dashboard assets in `dashboard/`, container/runtime files in `docker/`, and helper scripts in `scripts/`.
5-
6-
Tests are split by scope: `tests/unit/` for isolated service logic, `tests/integration/` for API and dependency-backed flows, `tests/functional/` for live endpoint testing, and `tests/snapshots/` for stored response fixtures.
7-
8-
## Build, Test, and Development Commands
9-
Set up a local environment with:
10-
11-
```bash
12-
python -m venv .venv
13-
source .venv/bin/activate
14-
pip install -r requirements.txt
15-
cp .env.example .env
16-
```
17-
18-
Run locally with `uvicorn src.main:app --reload`. Start required services with `docker compose up -d`, and build the sandbox image with `docker build -t code-interpreter:nsjail .`.
19-
20-
Key verification commands:
21-
22-
```bash
23-
pytest tests/unit/
24-
pytest tests/integration/
25-
pytest tests/functional/ -v
26-
pytest --cov=src tests/
27-
black src/ --check
28-
flake8 src/
29-
mypy src/
30-
bandit -r src/ -s B104,B108 --severity-level high
31-
```
32-
33-
## Coding Style & Naming Conventions
34-
Target Python 3.11+ with 4-space indentation, explicit type hints, and small async-friendly service boundaries. Follow Black formatting and keep code Flake8- and MyPy-clean. Use `snake_case` for modules, functions, and variables; `PascalCase` for classes and Pydantic models; and `UPPER_SNAKE_CASE` for constants and env names.
35-
36-
## Testing Guidelines
37-
Pytest, `pytest-asyncio`, and `pytest-cov` are the standard tools. Name files `test_*.py` and mirror the component under test where practical, for example `tests/unit/test_session_service.py`. Add unit coverage for new logic first, then integration coverage for endpoint or storage changes. Functional tests use `API_BASE`, `API_KEY`, and `API_TIMEOUT`; keep them stable against a real running API.
38-
39-
## Commit & Pull Request Guidelines
40-
Recent history uses short imperative subjects with prefixes such as `fix:`, `docs:`, `chore(...)`, and `feat:`. Keep the first line under 72 characters and reference issues in the body when relevant. Pull requests should explain behavior changes, note config or API contract impacts, and include the commands you ran. Add screenshots when changing the admin dashboard or other visible UI.
41-
42-
## Security & Configuration Tips
43-
Never commit populated `.env` files, API keys, or storage credentials. Use `.env.example` as the template, and review `docs/CONFIGURATION.md` and `docs/SECURITY.md` before changing auth, sandboxing, Redis, or MinIO behavior.
1+
Read CLAUDE.md

src/api/files.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,7 @@ async def upload_file(
142142
content=content,
143143
content_type=file.content_type,
144144
is_agent_file=is_agent_file,
145+
original_filename=file.filename,
145146
)
146147

147148
uploaded_files.append(
@@ -295,6 +296,7 @@ async def upload_files_batch(
295296
content_type=upload.content_type,
296297
is_agent_file=is_agent_file,
297298
is_read_only=is_read_only,
299+
original_filename=original_filename,
298300
)
299301

300302
results.append(
@@ -413,7 +415,8 @@ async def list_files(
413415
"etag": f'"{file_info.file_id}"',
414416
"metadata": {
415417
"content-type": file_info.content_type,
416-
"original-filename": file_info.filename,
418+
"original-filename": file_info.original_filename
419+
or file_info.filename,
417420
},
418421
"contentType": file_info.content_type,
419422
}

src/models/files.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ class FileInfo(BaseModel):
4040
content_type: str
4141
created_at: datetime
4242
path: str = Field(..., description="File path in the session")
43+
original_filename: Optional[str] = None
4344

4445
class Config:
4546
json_encoders = {datetime: lambda v: v.isoformat()}

src/services/execution/output.py

Lines changed: 25 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
import os
44
import re
55
import secrets
6+
import unicodedata
67
from pathlib import Path
78
from typing import Any, Dict
89

@@ -213,13 +214,27 @@ def format_error_message(cls, exit_code: int, stderr: str) -> str:
213214

214215
return f"Execution failed (exit code {exit_code}):\n{stderr_clean}"
215216

217+
# ASCII chars safe in filenames — matches LibreChat's ASCII_FILENAME_SAFE_PATTERN.
218+
_ASCII_SAFE = re.compile(r"[a-zA-Z0-9._\-]")
219+
# C1 control characters (U+0080–U+009F) — unsafe in filenames.
220+
_C1_CONTROLS = re.compile(r"[\x80-\x9f]")
221+
222+
@classmethod
223+
def _sanitize_char(cls, char: str) -> str:
224+
"""Replace unsafe ASCII; preserve Unicode letters, marks, numbers, and emoji."""
225+
if ord(char) <= 0x7F:
226+
return char if cls._ASCII_SAFE.match(char) else "_"
227+
return "_" if cls._C1_CONTROLS.match(char) else char
228+
216229
@classmethod
217230
def sanitize_filename(cls, input_name: str) -> str:
218-
"""Sanitize filename to match LibreChat's sanitization logic.
231+
"""Sanitize filename while preserving Unicode letters, digits, and emoji.
219232
220-
Replaces all non-alphanumeric characters (except '.' and '-') with
221-
underscores. This ensures filenames on disk match what LibreChat
222-
reports in the system prompt.
233+
NFC-normalizes, then applies a two-pass approach matching
234+
LibreChat's ``sanitizeFilenameSegment``: strict for ASCII
235+
(only ``[a-zA-Z0-9._-]``), permissive for non-ASCII (keeps
236+
Unicode letters, combining marks, numbers, emoji — blocks
237+
only C1 control characters).
223238
224239
Args:
225240
input_name: Original filename (may include path components)
@@ -234,8 +249,12 @@ def sanitize_filename(cls, input_name: str) -> str:
234249
# Remove any directory components (path traversal prevention)
235250
name = os.path.basename(input_name)
236251

237-
# Replace any non-alphanumeric characters except for '.' and '-'
238-
name = re.sub(r"[^a-zA-Z0-9.-]", "_", name)
252+
# NFC-normalize so decomposed sequences (e + U+0301) become
253+
# precomposed (é) before the regex runs.
254+
name = unicodedata.normalize("NFC", name)
255+
256+
# Two-pass sanitization: strict ASCII, permissive Unicode.
257+
name = "".join(cls._sanitize_char(c) for c in name)
239258

240259
# Ensure the name doesn't start with a dot (hidden file in Unix)
241260
if name.startswith(".") or name == "":

src/services/file.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -360,6 +360,7 @@ async def get_file_info(self, session_id: str, file_id: str) -> Optional[FileInf
360360
content_type=metadata["content_type"],
361361
created_at=metadata["created_at"],
362362
path=metadata["path"],
363+
original_filename=metadata.get("original_filename"),
363364
)
364365

365366
async def list_files(self, session_id: str) -> List[FileInfo]:
@@ -419,6 +420,9 @@ async def link_file_into_session(
419420
"source_session_id": source_session_id,
420421
"source_file_id": source_file_id,
421422
"is_read_only": "1",
423+
"original_filename": source_metadata.get(
424+
"original_filename", source_metadata["filename"]
425+
),
422426
}
423427

424428
await self._store_file_metadata(target_session_id, linked_file_id, metadata)
@@ -444,6 +448,7 @@ async def link_file_into_session(
444448
content_type=metadata["content_type"],
445449
created_at=datetime.fromisoformat(metadata["created_at"]),
446450
path=metadata["path"],
451+
original_filename=metadata.get("original_filename"),
447452
)
448453

449454
async def download_file(self, session_id: str, file_id: str) -> Optional[str]:
@@ -771,16 +776,18 @@ async def store_uploaded_file(
771776
content_type: Optional[str] = None,
772777
is_agent_file: bool = False,
773778
is_read_only: bool = False,
779+
original_filename: Optional[str] = None,
774780
) -> str:
775781
"""Store an uploaded file directly.
776782
777783
Args:
778784
session_id: Session identifier
779-
filename: Original filename
785+
filename: Sanitized filename used for storage and sandbox mounting
780786
content: File content as bytes
781787
content_type: MIME type of the file
782788
is_agent_file: If True, marks the file as read-only (agent-assigned)
783789
is_read_only: If True, mounted file should be chmod 444 in sandbox
790+
original_filename: Pre-sanitization filename for metadata recovery
784791
785792
Returns:
786793
The generated file_id
@@ -825,6 +832,7 @@ async def store_uploaded_file(
825832
"1" if is_agent_file else "0"
826833
), # Read-only if agent file
827834
"is_read_only": "1" if (is_read_only or is_agent_file) else "0",
835+
"original_filename": original_filename or filename,
828836
}
829837

830838
await self._store_file_metadata(session_id, file_id, metadata)

tests/integration/test_librechat_compat.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1901,7 +1901,13 @@ def setup_mocks(self):
19011901
stored_filenames = []
19021902

19031903
async def fake_store(
1904-
session_id, filename, content, content_type, is_agent_file, is_read_only=False
1904+
session_id,
1905+
filename,
1906+
content,
1907+
content_type,
1908+
is_agent_file,
1909+
is_read_only=False,
1910+
original_filename=None,
19051911
):
19061912
stored_filenames.append(filename)
19071913
return f"fid-{len(stored_filenames)}"

tests/unit/test_output_processor.py

Lines changed: 61 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -62,10 +62,61 @@ def test_absolute_path_stripped(self):
6262
result = OutputProcessor.sanitize_filename("/absolute/path/file.txt")
6363
assert result == "file.txt"
6464

65-
def test_unicode_characters_replaced(self):
66-
"""Test that non-ASCII characters are replaced."""
65+
def test_unicode_characters_preserved(self):
66+
"""Test that Unicode letters are preserved."""
6767
result = OutputProcessor.sanitize_filename("résumé.docx")
68-
assert result == "r_sum_.docx"
68+
assert result == "résumé.docx"
69+
70+
def test_cjk_characters_preserved(self):
71+
"""Test that CJK characters are preserved."""
72+
result = OutputProcessor.sanitize_filename("日本語レポート.xlsx")
73+
assert result == "日本語レポート.xlsx"
74+
75+
def test_cyrillic_characters_preserved(self):
76+
"""Test that Cyrillic characters are preserved."""
77+
result = OutputProcessor.sanitize_filename("файл.txt")
78+
assert result == "файл.txt"
79+
80+
def test_korean_characters_preserved(self):
81+
"""Test that Korean characters are preserved."""
82+
result = OutputProcessor.sanitize_filename("보고서.xlsx")
83+
assert result == "보고서.xlsx"
84+
85+
def test_arabic_characters_preserved(self):
86+
"""Test that Arabic characters are preserved."""
87+
result = OutputProcessor.sanitize_filename("تقرير.pdf")
88+
assert result == "تقرير.pdf"
89+
90+
def test_mixed_unicode_and_ascii(self):
91+
"""Test mixed Unicode and ASCII filename."""
92+
result = OutputProcessor.sanitize_filename("report_2024_報告.pdf")
93+
assert result == "report_2024_報告.pdf"
94+
95+
def test_unicode_with_spaces_sanitized(self):
96+
"""Test that spaces in Unicode filenames are still replaced."""
97+
result = OutputProcessor.sanitize_filename("日本語 レポート.xlsx")
98+
assert result == "日本語_レポート.xlsx"
99+
100+
def test_dangerous_chars_still_blocked(self):
101+
"""Test that shell metacharacters are still replaced."""
102+
result = OutputProcessor.sanitize_filename("file<>|&;$().txt")
103+
assert result == "file________.txt"
104+
105+
def test_underscores_preserved(self):
106+
"""Test that underscores are preserved."""
107+
result = OutputProcessor.sanitize_filename("my_file_name.txt")
108+
assert result == "my_file_name.txt"
109+
110+
def test_emoji_preserved(self):
111+
"""Test that emoji are preserved (matches LibreChat's \\p{Emoji})."""
112+
result = OutputProcessor.sanitize_filename("chart\U0001F4CA.csv")
113+
assert result == "chart\U0001F4CA.csv"
114+
115+
def test_nfd_normalized_to_nfc(self):
116+
"""Test that decomposed Unicode is NFC-normalized before sanitizing."""
117+
# e + combining acute (U+0301) -> precomposed e-acute
118+
result = OutputProcessor.sanitize_filename("Café.txt")
119+
assert result == "Café.txt"
69120

70121
def test_brackets_replaced(self):
71122
"""Test that brackets are replaced with underscores."""
@@ -169,6 +220,13 @@ def test_librechat_skill_bundle_pattern(self):
169220
== "skills/foo/SKILL.md"
170221
)
171222

223+
def test_unicode_segments_preserved(self):
224+
"""Test that Unicode directory and file names are preserved."""
225+
assert (
226+
OutputProcessor.sanitize_relative_path("報告/2024年/レポート.xlsx")
227+
== "報告/2024年/レポート.xlsx"
228+
)
229+
172230
def test_sanitize_filename_unchanged_for_basename_callers(self):
173231
"""Regression: sanitize_filename still flattens (legacy upload behavior)."""
174232
# Existing single-call sites rely on this.

0 commit comments

Comments
 (0)