Skip to content

fix(coding): write user-defined functions module as UTF-8#2819

Open
genisis0x wants to merge 3 commits into
ag2ai:mainfrom
genisis0x:fix/local-executor-func-utf8
Open

fix(coding): write user-defined functions module as UTF-8#2819
genisis0x wants to merge 3 commits into
ag2ai:mainfrom
genisis0x:fix/local-executor-func-utf8

Conversation

@genisis0x
Copy link
Copy Markdown
Contributor

Summary

LocalCommandLineCodeExecutor._setup_functions writes the generated functions module to disk via:

func_file.write_text(func_file_content)

Path.write_text defaults to locale.getpreferredencoding(), which is cp1252 on most Windows installs. Any user-supplied function whose source contains a non-cp1252 glyph crashes mid-write with UnicodeEncodeError. Common triggers:

  • non-Latin docstring (CJK, Cyrillic, Arabic, accented Latin)
  • emoji in a docstring or string literal
  • smart quotes / em-dashes in an Annotated[..., Field(description=\"…\")] parameter
  • accented identifier (rare but legal in Python 3)

The first call also crashes silently in the sense that _setup_functions_complete is left at False, so the next execute_code_blocks invocation re-runs the same broken write path.

Same defect class as #1731 / #2814 / #2815, on the LocalCommandLineCodeExecutor surface.

Fix

Pass encoding=\"utf-8\" explicitly to the write_text call. Inline comment explains why so a future refactor doesn't drop the kwarg. No public API change; no new options.

Test

test/coding/test_user_defined_functions.py gets two regression tests:

  • test_setup_functions_writes_module_as_utf8 drives _setup_functions with a FunctionWithRequirements.from_str source containing Café, em-dash, ☕, 例, and 🚀; reads the resulting module as raw bytes and decodes as UTF-8 explicitly.
  • test_setup_functions_passes_utf8_encoding_to_write_text patches Path.write_text and asserts encoding='utf-8' is in the kwargs of the first call inside _setup_functions. This guards the contract on POSIX, where the platform default is already UTF-8 and the byte-level test would otherwise pass even if the kwarg were dropped.

Both tests tolerate the downstream subprocess syntax-check call _setup_functions runs after the write — it depends on a python executable being on PATH, not guaranteed in every environment, and the write we care about has already happened by the time the subprocess fires.

.venv/bin/python -m pytest test/coding/test_user_defined_functions.py::test_setup_functions_writes_module_as_utf8 test/coding/test_user_defined_functions.py::test_setup_functions_passes_utf8_encoding_to_write_text reports 2 passed. The byte-level test confirms the round-trip; the patch-based test confirms the kwarg, and reverting the production change makes the second one fail on POSIX.

genisis0x added 2 commits May 14, 2026 13:52
LocalCommandLineCodeExecutor._setup_functions wrote the generated
functions module via Path.write_text without an explicit encoding,
inheriting locale.getpreferredencoding() (cp1252 on most Windows
installs). Any user-supplied function whose source contained a
non-cp1252 glyph — non-Latin docstring, emoji, smart quotes in an
Annotated description, accented identifiers — would crash mid-write
with UnicodeEncodeError, leaving _setup_functions_complete=False and
the next execute_code_blocks call to silently re-run the same broken
path.

Pass encoding='utf-8' to the write_text call. Inline comment explains
why so a future refactor doesn't drop the kwarg.

Tests:

- test_setup_functions_writes_module_as_utf8 drives _setup_functions
  with a FunctionWithRequirements.from_str source that contains
  Café, em-dash, ☕, 例, and 🚀; reads the resulting module file as
  raw bytes and asserts the UTF-8 round-trip succeeds.

- test_setup_functions_passes_utf8_encoding_to_write_text patches
  Path.write_text and asserts encoding='utf-8' is in the kwargs of
  the first call inside _setup_functions. This guards the contract
  on POSIX, where the platform default is already UTF-8 and the
  byte-level test would otherwise pass even if the kwarg were
  dropped.

Both tests tolerate the downstream subprocess syntax-check call that
_setup_functions runs after the write — it depends on a 'python'
executable being on PATH, which is not guaranteed in every
environment, and the write we care about has already happened by
the time the subprocess runs.
Copy link
Copy Markdown
Collaborator

@marklysze marklysze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct fix — func_file.write_text(func_file_content, encoding="utf-8") is the minimal change needed to survive any non-cp1252 glyph in a user-supplied function's source. The parametrized test with a realistic payload (emoji, CJK, diacritics) is solid. LGTM.

Copy link
Copy Markdown
Collaborator

@marklysze marklysze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, targeted fix. Path.write_text defaulting to locale.getpreferredencoding() is a well-known Windows footgun — good catch. The test correctly mocks locale.getpreferredencoding to simulate a cp1252 environment and verifies the write succeeds with non-cp1252 content. Approved.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
autogen/coding/local_commandline_code_executor.py 89.47% <100.00%> (-1.98%) ⬇️

... and 67 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants