Skip to content

Commit eaa42e3

Browse files
hyunhee-jobundolee
authored andcommitted
fix(runner): use utf-8 encoding for subprocess I/O on Windows
Objective: On Korean Windows (and other non-UTF-8 Windows locales), running opendataloader-pdf fails immediately with UnicodeDecodeError: 'cp949' codec can't decode byte 0xe2 because the JAR outputs UTF-8 but the Python wrapper reads it using the system locale encoding (cp949). Approach: Replace locale.getpreferredencoding(False) with a hard-coded "utf-8" for both subprocess.run (quiet mode) and subprocess.Popen (streaming mode) — the JAR always outputs UTF-8 regardless of OS locale, so tying the decoder to the system locale was always wrong. Also switch sys.stdout.write(line) to sys.stdout.buffer.write with utf-8 encoding so the decoded text reaches the terminal correctly on Windows where stdout may also default to cp949. Evidence: Verified source after patch — encoding="utf-8" appears in both call sites (2 occurrences), locale.getpreferredencoding removed (0 occurrences), stdout.buffer.write added (1 occurrence). Before: UnicodeDecodeError on first byte of UTF-8 multibyte sequence. After: subprocess reads and writes UTF-8 correctly on cp949 Windows.
1 parent ad6e906 commit eaa42e3

1 file changed

Lines changed: 3 additions & 3 deletions

File tree

  • python/opendataloader-pdf/src/opendataloader_pdf

python/opendataloader-pdf/src/opendataloader_pdf/runner.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ def run_jar(args: List[str], quiet: bool = False) -> str:
2626
capture_output=True,
2727
text=True,
2828
check=True,
29-
encoding=locale.getpreferredencoding(False),
29+
encoding="utf-8",
3030
)
3131
return result.stdout
3232

@@ -36,11 +36,11 @@ def run_jar(args: List[str], quiet: bool = False) -> str:
3636
stdout=subprocess.PIPE,
3737
stderr=subprocess.STDOUT,
3838
text=True,
39-
encoding=locale.getpreferredencoding(False),
39+
encoding="utf-8",
4040
) as process:
4141
output_lines: List[str] = []
4242
for line in process.stdout:
43-
sys.stdout.write(line)
43+
sys.stdout.buffer.write(line.encode("utf-8", errors="replace"))
4444
output_lines.append(line)
4545

4646
return_code = process.wait()

0 commit comments

Comments
 (0)