fix: Sanitize CU output to prevent Unicode corruption in Citation Panel by Yamini-Microsoft · Pull Request #920 · microsoft/Conversation-Knowledge-Mining-Solution-Accelerator

Yamini-Microsoft · 2026-05-19T14:41:43Z

Problem

The Citation Panel displays apostrophes and special characters as □ (box characters) instead of proper characters.

Root Cause

The Content Understanding \�nalyzeBinary\ API (v2025-11-01) intermittently corrupts Unicode characters by stripping the high byte:

\\u2019\ (right single quote) → \\u0019\ (control character)
\\u201C/\\u201D\ (double quotes) → \\u001C/\\u001D\
\\u2014\ (em dash) → \\u001E\

Evidence:

Source blobs contain valid \\u2019\ ✅
After CU processing, search index contains \\u0019\ ❌
39 corrupted characters found in a single document
Previous CU API version (\2024-12-01-preview) did not exhibit this behavior

Fix

Added a defensive _sanitize_cu_output()\ function in the data ingestion scripts that maps known corrupted control characters back to their intended Unicode equivalents after CU processing, before saving to Search/SQL.

Files Changed

\infra/scripts/index_scripts/03_cu_process_data_text.py\
\infra/scripts/index_scripts/04_cu_process_custom_data.py\

Validation

Verified fix by re-running full ingestion pipeline — all documents clean after sanitization
The fix is zero-cost when CU output is already correct

Fixes AB#43310

chore: dev to main merge

The Content Understanding analyzeBinary API (v2025-11-01) intermittently corrupts Unicode characters by stripping the high byte (e.g. U+2019 -> U+0019). This causes apostrophes and quotes to render as box characters in the Citation Panel. Added _sanitize_cu_output() to map known corrupted control characters back to their intended Unicode equivalents after CU processing, before saving to Search/SQL. Fixes AB#43310 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Yamini-Microsoft · 2026-05-20T09:33:44Z

@copilot review the PR

Copilot · 2026-05-20T09:40:07Z

@copilot review the PR

PR Review: `fix/cu-unicode-sanitization-43310`

Summary

The PR adds a defensive _sanitize_cu_output() function to fix Unicode corruption caused by the CU analyzeBinary API (v2025-11-01) that strips the high byte from certain Unicode characters. The approach is sound. Below are findings:

✅ Correct Approach

The _sanitize_cu_output() function is applied uniformly through get_field_value(), which is the single entry point for all CU field extractions in both files.
The if not text: return text guard correctly handles None and empty strings.
The fix is zero-cost when CU output is already clean (no corrupted characters present).

⚠️ Issues to Address

1. Questionable \u001a → \u2019 mapping
The \u001a → \u2019 mapping doesn't follow the observed corruption pattern. The CU API corrupts by stripping the high byte (\u20xx → \u00xx), so:

\u201a (single low-9 quotation mark, ‚) would corrupt to \u001a
Mapping \u001a → \u2019 (right single quote) is incorrect; it should be \u001a → \u201a

This could silently turn a legitimate \u001a control character (SUB) into a right single quotation mark. The PR description only documents \u0019 → \u2019 — the \u001a entry appears to be an oversight.

2. Code duplication
_sanitize_cu_output() is copy-pasted identically into both 03_cu_process_data_text.py and 04_cu_process_custom_data.py. This violates DRY and means future corrections must be applied twice. It should be moved to the shared content_understanding_client.py module.

3. PEP 8 spacing
Only one blank line separates create_tables() from _sanitize_cu_output(). PEP 8 requires two blank lines between top-level definitions.

4. \u001e → \u2014 mapping not mathematically consistent
\u2014 (em dash) with the high byte stripped gives \u0014, not \u001e. If \u001e is based on empirical observation from actual CU output (not just arithmetic), that's fine, but it should be noted in the comment that this is empirically observed, not derived, to avoid confusion.

5. Coverage limited to valueString
get_field_value() only sanitizes valueString fields. If CU output ever returns corrupted text in valueArray or nested field types, those would not be sanitized. For the current use case this appears sufficient, but worth noting.

Suggested Fix for Issue #1 and #3

def _sanitize_cu_output(text):
    """Replace non-printable control characters that may appear in CU output.
    ...
    """
    if not text:
        return text
    replacements = {
        '\u0019': '\u2019',  # right single quotation mark (empirically observed)
        '\u001a': '\u201a',  # single low-9 quotation mark (high-byte-strip pattern)
        '\u001c': '\u201c',  # left double quotation mark
        '\u001d': '\u201d',  # right double quotation mark
        '\u001e': '\u2014',  # em dash (empirically observed)
    }

Verdict

The fix addresses a real, well-documented bug. Issues #2 (duplication) and #1 (wrong mapping for \u001a) should be addressed before merging to prevent a subtle future regression.

…ping Agent-Logs-Url: https://github.com/microsoft/Conversation-Knowledge-Mining-Solution-Accelerator/sessions/50f2a621-4990-4aca-8dd8-cfa5ea31ce6f Co-authored-by: Yamini-Microsoft <191316559+Yamini-Microsoft@users.noreply.github.com>

github-actions · 2026-05-26T07:49:50Z

🎉 This PR is included in version 3.23.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Prajwal-Microsoft and others added 2 commits May 18, 2026 05:44

Merge pull request #914 from microsoft/dev

5f1ae0a

chore: dev to main merge

Copilot AI review requested due to automatic review settings May 19, 2026 14:41

Yamini-Microsoft requested review from Avijit-Microsoft, Prajwal-Microsoft, Roopan-Microsoft, Vinay-Microsoft, aniaroramsft, brittneek, dgp10801, nchandhi and toherman-msft as code owners May 19, 2026 14:41

Yamini-Microsoft marked this pull request as draft May 19, 2026 14:43

Yamini-Microsoft marked this pull request as ready for review May 20, 2026 09:30

Copilot started work on behalf of Yamini-Microsoft May 20, 2026 09:33 View session

Copilot finished work on behalf of Yamini-Microsoft May 20, 2026 09:44

Avijit-Microsoft approved these changes May 20, 2026

View reviewed changes

Avijit-Microsoft merged commit e081526 into dev May 20, 2026
2 checks passed

github-actions Bot added the released label May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Sanitize CU output to prevent Unicode corruption in Citation Panel#920

fix: Sanitize CU output to prevent Unicode corruption in Citation Panel#920
Avijit-Microsoft merged 3 commits into
devfrom
fix/cu-unicode-sanitization-43310

Yamini-Microsoft commented May 19, 2026

Uh oh!

Yamini-Microsoft commented May 20, 2026

Uh oh!

Copilot AI commented May 20, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Yamini-Microsoft commented May 19, 2026

Problem

Root Cause

Fix

Files Changed

Validation

Uh oh!

Yamini-Microsoft commented May 20, 2026

Uh oh!

Copilot AI commented May 20, 2026

PR Review: fix/cu-unicode-sanitization-43310

Summary

✅ Correct Approach

⚠️ Issues to Address

Suggested Fix for Issue #1 and #3

Verdict

Uh oh!

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PR Review: `fix/cu-unicode-sanitization-43310`