fix: Sanitize CU output to prevent Unicode corruption in Citation Panel#920
Conversation
chore: dev to main merge
The Content Understanding analyzeBinary API (v2025-11-01) intermittently corrupts Unicode characters by stripping the high byte (e.g. U+2019 -> U+0019). This causes apostrophes and quotes to render as box characters in the Citation Panel. Added _sanitize_cu_output() to map known corrupted control characters back to their intended Unicode equivalents after CU processing, before saving to Search/SQL. Fixes AB#43310 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@copilot review the PR |
PR Review:
|
…ping Agent-Logs-Url: https://github.com/microsoft/Conversation-Knowledge-Mining-Solution-Accelerator/sessions/50f2a621-4990-4aca-8dd8-cfa5ea31ce6f Co-authored-by: Yamini-Microsoft <191316559+Yamini-Microsoft@users.noreply.github.com>
|
🎉 This PR is included in version 3.23.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
Problem
The Citation Panel displays apostrophes and special characters as □ (box characters) instead of proper characters.
Root Cause
The Content Understanding \�nalyzeBinary\ API (v2025-11-01) intermittently corrupts Unicode characters by stripping the high byte:
Evidence:
Fix
Added a defensive _sanitize_cu_output()\ function in the data ingestion scripts that maps known corrupted control characters back to their intended Unicode equivalents after CU processing, before saving to Search/SQL.
Files Changed
Validation
Fixes AB#43310