fix(beta/filesystem): pin read/write encoding to UTF-8#2818
Conversation
The FilesystemToolkit's read_file, write_file, and update_file helpers called Path.read_text() / Path.write_text() without an explicit encoding. Both default to locale.getpreferredencoding(), which is 'cp1252' on most Windows installs and raises: - UnicodeDecodeError on read for any file whose bytes are not valid cp1252 (most modern source files, JSON/YAML with non-ASCII strings, Markdown with smart quotes, anything with emoji or CJK glyphs). - UnicodeEncodeError on write for any payload the model generates that is not representable in cp1252 (the same set of characters). Update file then reads with the platform default and writes with the platform default in the same call, so a round-trip on Windows silently transcodes the file every time the model edits it. Pass encoding='utf-8' to all three call sites so the toolkit produces the same bytes on Linux, macOS, and Windows. Adds three regression tests that drive the toolkit through Agent.ask and assert byte-level equality against the UTF-8 encoding of the payload: - test_write_file_encodes_as_utf8 pins write_file - test_read_file_decodes_as_utf8 pins read_file - test_update_file_preserves_utf8 pins the read+write pair in update_file (with a non-cp1252 replacement string) Existing locale-default reads in older tests (test_write_file, test_update_file) keep their assertions; they pass on POSIX where the default is already UTF-8 and the new tests pin the contract on every platform.
Codecov Report✅ All modified and coverable lines are covered by tests.
... and 62 files with indirect coverage changes 🚀 New features to boost your workflow:
|
marklysze
left a comment
There was a problem hiding this comment.
Clean fix — pinning encoding="utf-8" on all three file-touching operations (read_file, write_file, update_file) is the correct solution for the Windows cp1252 default. The three new tests cover the non-ASCII roundtrip contract exactly right: write the bytes directly, read back and compare at the byte level. LGTM.
marklysze
left a comment
There was a problem hiding this comment.
Correct and consistent with the rest of the series (#2819, #2825, #2826, #2827, #2814, #2815).
#2818 — FilesystemToolkit read/write/update encoding
Three-method fix — _read_file, _write_file, _update_file — all pinned to UTF-8. The update_file change is the most important: without matching encoding on both the read and write side, a Windows machine iterating on a UTF-8 source file would silently re-encode output as cp1252 each edit cycle.
Tests pin the contract at the byte level — write non-ASCII payload, read back raw bytes and compare against payload.encode("utf-8"). Same proven pattern as the other PRs in this series.
One note: the inline comments in the production code are more verbose than necessary (the PR body already carries this context), but they don't affect correctness.
Summary
FilesystemToolkitexposesread_file,write_file, andupdate_filetools that the model uses to manipulate user files on disk. All three calledPath.read_text()/Path.write_text()without an explicit encoding, so each operation inheritedlocale.getpreferredencoding()—cp1252on most Windows installs:read_fileraisedUnicodeDecodeErroron any file whose bytes weren't valid cp1252 (most modern source files, JSON / YAML with non-ASCII strings, Markdown with smart quotes, anything with emoji or CJK glyphs).write_fileraisedUnicodeEncodeErroron any payload the model generated that wasn't representable in cp1252 (the same set of characters).update_filedid the read and the write in the same call against the platform default, so a round-trip on Windows silently transcoded the file every time the model edited it.Same class of bug as #1731 / #2814 / #2815, this time on the LLM-facing beta toolkit.
Fix
Pass
encoding=\"utf-8\"explicitly at all three call sites so the toolkit produces the same bytes on Linux, macOS, and Windows. No change to the public tool signatures; no new options. Inline comments explain why the kwarg is required so a future refactor doesn't drop it again.Test
test/beta/tools/test_filesystem.pygets three regression tests that drive the toolkit throughAgent.ask(pertest/beta/CLAUDE.md's public-API guidance) and assert byte-level equality against the explicit UTF-8 encoding of the payload — not againstPath.read_text(), which would mask the bug on POSIX:test_write_file_encodes_as_utf8test_read_file_decodes_as_utf8test_update_file_preserves_utf8— replaces an ASCII token with a CJK glyph so the read+write pair has to round-trip non-cp1252 content.npm testequivalent:.venv/bin/python -m pytest test/beta/tools/test_filesystem.pyreports 16 passed.