autogen-studio: pin utf-8 encoding on production text-file open() calls (refs #5566)#7723
autogen-studio: pin utf-8 encoding on production text-file open() calls (refs #5566)#7723adv0r wants to merge 1 commit into
Conversation
Refs microsoft#5566. Continuation of the same encoding sweep started in microsoft#6094 (which fixed the original `playwright_controller.py` site) and continued in the `magentic-one-cli` PR. The reporter of microsoft#5566 explicitly flagged that *"there will be some similar issues in the codebase while using open function"* — this PR closes the autogen-studio production code paths that read or write text files without specifying an encoding. On a non-UTF-8 default locale (e.g. cp950 on Traditional Chinese Windows, cp1252 on Western European Windows), Python's `open(..., "r")` falls back to the platform encoding and crashes with `UnicodeDecodeError` on any non-ASCII byte. For autogen-studio that manifests every time: - `schema_manager.py` reads or writes Alembic templates (`env.py`, `script.py.mako`, `alembic.ini`) that may contain non-ASCII paths or comments - `cli.py` / `lite/studio.py` write the runtime `.env` file (project paths can contain user/folder names with accented characters) - `web/auth/manager.py` loads a user-supplied YAML config - `gallery/builder.py` writes `gallery_default.json` Files touched (11 lines, 5 files): | File | open() sites fixed | |------|--------------------| | autogenstudio/cli.py | 1 | | autogenstudio/lite/studio.py | 1 | | autogenstudio/database/schema_manager.py | 6 | | autogenstudio/web/auth/manager.py | 1 | | autogenstudio/gallery/builder.py | 1 | For every site the change is the same shape: ```python - with open(path, "r") as f: + with open(path, "r", encoding="utf-8") as f: ``` Scope deliberately narrowed: - **Production-code only** — no test fixtures. - **Skipped `aiofiles.open` in `teammanager.py`** — the API is slightly different and that one deserves its own audited PR. - **Did NOT sweep `agbench/benchmarks/*`** — those are user-facing scenario scripts that read JSONL produced by other agents; forcing UTF-8 there could mask issues upstream. No behaviour change for already-UTF-8-locale users (UTF-8 IS what Python opens these as on macOS/Linux today). All five files re-parsed cleanly via `ast.parse(...)` after the rewrite. AI-assisted via Cursor (Claude Opus 4.7). Personal token-burn initiative by @adv0r to use up an expiring Cursor subscription budget on small, useful upstream contributions. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Heads up: the CLA reply needs to come from the human account holder (@adv0r) directly, which I can't auto-post on their behalf in good conscience — the magic-phrase reply is a binding legal acceptance. I've flagged it on the user's side as a manual TODO and the CLA acceptance should land here shortly. The companion PR #7722 already shows |
|
@adv0r please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
Why
Refs #5566. Continuation of the encoding sweep started in #6094
(
playwright_controller.py) and continued inthe
magentic-one-cliPR.The original report explicitly flagged that "there will be some
similar issues in the codebase while using open function". On a
non-UTF-8 default locale (cp950 on Traditional Chinese Windows, cp1252
on Western European Windows, …), Python's
open(..., \"r\")falls backto the platform encoding and crashes with
UnicodeDecodeErroron anynon-ASCII byte.
This PR closes the autogen-studio production code paths that read
or write text files without an explicit encoding.
What changed
5 files, 11
open()call sites, all of the same shape:autogenstudio/cli.py.env)autogenstudio/lite/studio.py.env)autogenstudio/database/schema_manager.pyenv.py/script.py.mako/alembic.ini)autogenstudio/web/auth/manager.pyautogenstudio/gallery/builder.pygallery_default.json)Why these specific sites
env.pycan legitimately contain non-ASCIIcomments / paths and are read+rewritten on schema upgrades.
.envwriters are called with the user's project path. Foldernames with accented characters (very common on Windows) would crash
the first run.
gallery_default.jsoncan contain non-ASCII strings.Scope deliberately narrowed
aiofiles.openinteammanager.py— async API signatureis slightly different, deserves its own audited PR.
agbench/benchmarks/*— those are scenario scriptsthat consume JSONL produced by other agents; forcing UTF-8 there
could mask issues upstream.
Verification
ast.parse(...)clean on all 5 touched files (no syntax break).encoding=...encoding=(double-add) anywhere.Recommended next sweep (for a separate PR)
agbench(mixed: some files read agent-emitted JSONL, others areuser scripts — needs case-by-case audit)
aiofiles.opensitesAI-assisted via Cursor (Claude Opus 4.7). Personal token-burn
initiative by @adv0r to use up an expiring Cursor subscription budget on
small, useful upstream contributions.
Made with Cursor