fix: preserve utf-8 characters in JSONL output#1762
fix: preserve utf-8 characters in JSONL output#1762sarahyurick merged 6 commits intoNVIDIA-NeMo:mainfrom
Conversation
Greptile SummaryThis PR fixes UTF-8 character corruption in Confidence Score: 5/5Safe to merge — minimal, targeted fix with correct test coverage and no regressions. The change is a single-line default addition that is overridable, the previously flagged issues (unclosed file handle, wrong assertion literal) are both resolved, and the two new tests correctly verify the default and override paths. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["JsonlWriter.write_data(task, file_path)"] --> B["Build default write_kwargs\n{lines: True, orient: records, force_ascii: False}"]
B --> C["write_kwargs.update(self.write_kwargs)\n(user overrides applied here)"]
C --> D{"force_ascii in\nself.write_kwargs?"}
D -- "Yes (e.g. True)" --> E["ASCII-escaped output\n(\\u4f60 etc.)"]
D -- "No" --> F["UTF-8 literal output\n(你好, 世界 preserved)"]
E --> G["df.to_json(file_path, **write_kwargs)"]
F --> G
Reviews (6): Last reviewed commit: "Merge branch 'main' into fix/issue-1582" | Re-trigger Greptile |
| result = writer.process(batch) | ||
|
|
||
| file_path = result.data[0] | ||
| payload = open(file_path, encoding="utf-8").read() |
There was a problem hiding this comment.
open(...) is called without a context manager, leaving the file handle open until GC collects it. On some platforms or when the test runner runs many tests back-to-back, this can exhaust file descriptors.
| payload = open(file_path, encoding="utf-8").read() | |
| with open(file_path, encoding="utf-8") as f: | |
| payload = f.read() |
There was a problem hiding this comment.
Let's apply this suggestion please.
There was a problem hiding this comment.
| payload = open(file_path, encoding="utf-8").read() | |
| with open(file_path, encoding="utf-8") as f: | |
| payload = f.read() |
Tip: You can customize Greptile's behavior for this repo with .greptile/rules and .greptile/config.
sarahyurick
left a comment
There was a problem hiding this comment.
Thanks @nightcityblade !
| result = writer.process(batch) | ||
|
|
||
| file_path = result.data[0] | ||
| payload = open(file_path, encoding="utf-8").read() |
There was a problem hiding this comment.
Let's apply this suggestion please.
|
Pushed an update in bdbcce8. Changes made:
Validation: the environment here is missing |
| with open(file_path, encoding="utf-8") as f: | ||
| payload = f.read() | ||
| assert "你好, 世界" not in payload | ||
| assert "\u4f60" in payload |
There was a problem hiding this comment.
Assertion checks the wrong value — test will always fail
"\u4f60" is a Python Unicode escape; Python evaluates it to the single character 你 at parse time. So assert "\u4f60" in payload is identical to assert "你" in payload. But when force_ascii=True the file contains the literal six-character sequence \u4f60 — the character 你 is absent — so the assertion will always fail.
Use a raw backslash to match the literal escape sequence in the file:
| assert "\u4f60" in payload | |
| assert "\\u4f60" in payload |
|
/ok to test 291898c |
|
/ok to test 16ec3c9 |
Fixes #1582
Summary
JsonlWritertoforce_ascii=FalseTesting