Skip to content

fix: support non-UTF-8 encodings in eval data loading#4100

Open
CodeForgeNet wants to merge 1 commit intomicrosoft:mainfrom
CodeForgeNet:fix/eval-utf8-encoding-support
Open

fix: support non-UTF-8 encodings in eval data loading#4100
CodeForgeNet wants to merge 1 commit intomicrosoft:mainfrom
CodeForgeNet:fix/eval-utf8-encoding-support

Conversation

@CodeForgeNet
Copy link
Copy Markdown

Fixes #3670

The eval SDK only read JSONL files as UTF-8. If your data had a BOM (utf-8-sig)
— common for multilingual content generated on Windows — it failed immediately
with ValueError: Expected object or value. Not helpful.

The fix adds BOM detection before reading and a fallback chain
(utf-8 → utf-8-sig → latin-1 → cp1252) so the loader handles real-world
files without requiring users to re-encode their data.

Three files touched:

  • promptflow/_utils/load_data.py_pd_read_file() now detects encoding
    before calling pd.read_json() on .jsonl files
  • evaluate/_evaluate.py_validate_and_load_data() gets the same treatment
  • evaluate/_utils.pyload_jsonl() updated with BOM detection + fallback

Added a utf-8-sig encoded test file with multilingual content and a unit test
that would have caught this from the start.


Checklist

  • No breaking changes
  • Read the contribution guidelines
  • New dependencies are MIT compatible
  • CHANGELOG updated
  • Test coverage included for the change

Fixes microsoft#3670

pd.read_json defaulted to UTF-8 only. Files encoded with utf-8-sig
(BOM) raised ValueError: Expected object or value.

- Added _detect_encoding() BOM detection in load_data.py, _evaluate.py, _utils.py
- Added fallback encoding chain: utf-8, utf-8-sig, latin-1, cp1252
- Improved error messages to show which encodings were attempted
- Added test case and utf-8-sig encoded test data file
@CodeForgeNet
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

Hi, thank you for your interest in helping to improve the prompt flow experience and for your contribution. We've noticed that there hasn't been recent engagement on this pull request. If this is still an active work stream, please let us know by pushing some changes or leaving a comment.

@github-actions github-actions bot added the no-recent-activity There has been no recent activity on this issue/pull request label Apr 5, 2026
@CodeForgeNet
Copy link
Copy Markdown
Author

CodeForgeNet commented Apr 12, 2026

Still active. Ready for review and merge whenever the team has bandwidth.

@github-actions github-actions bot removed the no-recent-activity There has been no recent activity on this issue/pull request label Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]prompt flow eval only supports UTF8 encoding

1 participant