fix: support non-UTF-8 encodings in eval data loading#4100
fix: support non-UTF-8 encodings in eval data loading#4100CodeForgeNet wants to merge 1 commit intomicrosoft:mainfrom
Conversation
Fixes microsoft#3670 pd.read_json defaulted to UTF-8 only. Files encoded with utf-8-sig (BOM) raised ValueError: Expected object or value. - Added _detect_encoding() BOM detection in load_data.py, _evaluate.py, _utils.py - Added fallback encoding chain: utf-8, utf-8-sig, latin-1, cp1252 - Improved error messages to show which encodings were attempted - Added test case and utf-8-sig encoded test data file
|
@microsoft-github-policy-service agree |
|
Hi, thank you for your interest in helping to improve the prompt flow experience and for your contribution. We've noticed that there hasn't been recent engagement on this pull request. If this is still an active work stream, please let us know by pushing some changes or leaving a comment. |
|
Still active. Ready for review and merge whenever the team has bandwidth. |
|
Hi, thank you for your interest in helping to improve the prompt flow experience and for your contribution. We've noticed that there hasn't been recent engagement on this pull request. If this is still an active work stream, please let us know by pushing some changes or leaving a comment. |
|
Hi, thank you for your contribution. Since there has not been recent engagement, we are going to close this out. Feel free to reopen if you'd like to continue working on these changes. Please be sure to remove the |
Fixes #3670
The eval SDK only read JSONL files as UTF-8. If your data had a BOM (utf-8-sig)
— common for multilingual content generated on Windows — it failed immediately
with
ValueError: Expected object or value. Not helpful.The fix adds BOM detection before reading and a fallback chain
(utf-8 → utf-8-sig → latin-1 → cp1252) so the loader handles real-world
files without requiring users to re-encode their data.
Three files touched:
promptflow/_utils/load_data.py—_pd_read_file()now detects encodingbefore calling
pd.read_json()on.jsonlfilesevaluate/_evaluate.py—_validate_and_load_data()gets the same treatmentevaluate/_utils.py—load_jsonl()updated with BOM detection + fallbackAdded a utf-8-sig encoded test file with multilingual content and a unit test
that would have caught this from the start.
Checklist