fix(runner): use utf-8 encoding for subprocess I/O on Windows

hyunhee-jo · bundolee · commit eaa42e3c76e6 · 2026-05-06T14:46:52.000+09:00
Objective: On Korean Windows (and other non-UTF-8 Windows locales),
running opendataloader-pdf fails immediately with
UnicodeDecodeError: 'cp949' codec can't decode byte 0xe2
because the JAR outputs UTF-8 but the Python wrapper reads it
using the system locale encoding (cp949).

Approach: Replace locale.getpreferredencoding(False) with a hard-coded
"utf-8" for both subprocess.run (quiet mode) and subprocess.Popen
(streaming mode) — the JAR always outputs UTF-8 regardless of OS locale,
so tying the decoder to the system locale was always wrong.
Also switch sys.stdout.write(line) to sys.stdout.buffer.write with
utf-8 encoding so the decoded text reaches the terminal correctly on
Windows where stdout may also default to cp949.

Evidence: Verified source after patch — encoding="utf-8" appears in
both call sites (2 occurrences), locale.getpreferredencoding removed
(0 occurrences), stdout.buffer.write added (1 occurrence).
Before: UnicodeDecodeError on first byte of UTF-8 multibyte sequence.
After: subprocess reads and writes UTF-8 correctly on cp949 Windows.
diff --git a/python/opendataloader-pdf/src/opendataloader_pdf/runner.py b/python/opendataloader-pdf/src/opendataloader_pdf/runner.py
@@ -26,7 +26,7 @@ def run_jar(args: List[str], quiet: bool = False) -> str:
                     capture_output=True,
                     text=True,
                     check=True,
-                    encoding=locale.getpreferredencoding(False),
+                    encoding="utf-8",
                 )
                 return result.stdout
 
@@ -36,11 +36,11 @@ def run_jar(args: List[str], quiet: bool = False) -> str:
                 stdout=subprocess.PIPE,
                 stderr=subprocess.STDOUT,
                 text=True,
-                encoding=locale.getpreferredencoding(False),
+                encoding="utf-8",
             ) as process:
                 output_lines: List[str] = []
                 for line in process.stdout:
-                    sys.stdout.write(line)
+                    sys.stdout.buffer.write(line.encode("utf-8", errors="replace"))
                     output_lines.append(line)
 
                 return_code = process.wait()