Skip to content

fix: preserve utf-8 characters in JSONL output#1762

Merged
sarahyurick merged 6 commits intoNVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1582
Apr 13, 2026
Merged

fix: preserve utf-8 characters in JSONL output#1762
sarahyurick merged 6 commits intoNVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1582

Conversation

@nightcityblade
Copy link
Copy Markdown
Contributor

Fixes #1582

Summary

  • default JsonlWriter to force_ascii=False
  • add a regression test for UTF-8 output preservation

Testing

  • python3 -m compileall nemo_curator/stages/text/io/writer/jsonl.py tests/stages/text/io/writer/test_jsonl.py

@nightcityblade nightcityblade requested a review from a team as a code owner April 8, 2026 03:15
@nightcityblade nightcityblade requested review from meatybobby and removed request for a team April 8, 2026 03:15
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 8, 2026

Greptile Summary

This PR fixes UTF-8 character corruption in JsonlWriter output by defaulting force_ascii to False in the pandas.DataFrame.to_json call, while still allowing callers to override it via write_kwargs. Two regression tests cover both the new default and the override path.

Confidence Score: 5/5

Safe to merge — minimal, targeted fix with correct test coverage and no regressions.

The change is a single-line default addition that is overridable, the previously flagged issues (unclosed file handle, wrong assertion literal) are both resolved, and the two new tests correctly verify the default and override paths.

No files require special attention.

Important Files Changed

Filename Overview
nemo_curator/stages/text/io/writer/jsonl.py Single-line fix adding force_ascii: False to the default write_kwargs dict, allowing write_kwargs.update(self.write_kwargs) to override it; correct and minimal.
tests/stages/text/io/writer/test_jsonl.py Two new tests verify default UTF-8 preservation and force_ascii=True override; context managers used for file I/O and assertions are correctly written with "\\u4f60" raw literals.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["JsonlWriter.write_data(task, file_path)"] --> B["Build default write_kwargs\n{lines: True, orient: records, force_ascii: False}"]
    B --> C["write_kwargs.update(self.write_kwargs)\n(user overrides applied here)"]
    C --> D{"force_ascii in\nself.write_kwargs?"}
    D -- "Yes (e.g. True)" --> E["ASCII-escaped output\n(\\u4f60 etc.)"]
    D -- "No" --> F["UTF-8 literal output\n(你好, 世界 preserved)"]
    E --> G["df.to_json(file_path, **write_kwargs)"]
    F --> G
Loading

Reviews (6): Last reviewed commit: "Merge branch 'main' into fix/issue-1582" | Re-trigger Greptile

result = writer.process(batch)

file_path = result.data[0]
payload = open(file_path, encoding="utf-8").read()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unclosed file handle

open(...) is called without a context manager, leaving the file handle open until GC collects it. On some platforms or when the test runner runs many tests back-to-back, this can exhaust file descriptors.

Suggested change
payload = open(file_path, encoding="utf-8").read()
with open(file_path, encoding="utf-8") as f:
payload = f.read()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's apply this suggestion please.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
payload = open(file_path, encoding="utf-8").read()
with open(file_path, encoding="utf-8") as f:
payload = f.read()

Tip: You can customize Greptile's behavior for this repo with .greptile/rules and .greptile/config.

Copy link
Copy Markdown
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nightcityblade !

result = writer.process(batch)

file_path = result.data[0]
payload = open(file_path, encoding="utf-8").read()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's apply this suggestion please.

Comment thread tests/stages/text/io/writer/test_jsonl.py
@nightcityblade
Copy link
Copy Markdown
Contributor Author

Pushed an update in bdbcce8.

Changes made:

  • applied the with open(...) suggestion in the UTF-8 regression test
  • added a focused override test to confirm user-provided force_ascii=True still wins over the new default

Validation: the environment here is missing ray, so the full pytest target could not be imported; I ran focused local sanity checks on the touched test logic instead.

with open(file_path, encoding="utf-8") as f:
payload = f.read()
assert "你好, 世界" not in payload
assert "\u4f60" in payload
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Assertion checks the wrong value — test will always fail

"\u4f60" is a Python Unicode escape; Python evaluates it to the single character at parse time. So assert "\u4f60" in payload is identical to assert "你" in payload. But when force_ascii=True the file contains the literal six-character sequence \u4f60 — the character is absent — so the assertion will always fail.

Use a raw backslash to match the literal escape sequence in the file:

Suggested change
assert "\u4f60" in payload
assert "\\u4f60" in payload

@sarahyurick
Copy link
Copy Markdown
Contributor

/ok to test 291898c

@sarahyurick
Copy link
Copy Markdown
Contributor

/ok to test 16ec3c9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Default jsonl_writer to write data with kwarg force_ascii=False

3 participants