Skip to content

fix: Sanitize CU output to prevent Unicode corruption in Citation Panel#920

Merged
Avijit-Microsoft merged 3 commits into
devfrom
fix/cu-unicode-sanitization-43310
May 20, 2026
Merged

fix: Sanitize CU output to prevent Unicode corruption in Citation Panel#920
Avijit-Microsoft merged 3 commits into
devfrom
fix/cu-unicode-sanitization-43310

Conversation

@Yamini-Microsoft

Copy link
Copy Markdown
Contributor

Problem

The Citation Panel displays apostrophes and special characters as □ (box characters) instead of proper characters.

Root Cause

The Content Understanding \�nalyzeBinary\ API (v2025-11-01) intermittently corrupts Unicode characters by stripping the high byte:

  • \\u2019\ (right single quote) → \\u0019\ (control character)
  • \\u201C/\\u201D\ (double quotes) → \\u001C/\\u001D\
  • \\u2014\ (em dash) → \\u001E\

Evidence:

  • Source blobs contain valid \\u2019\ ✅
  • After CU processing, search index contains \\u0019\ ❌
  • 39 corrupted characters found in a single document
  • Previous CU API version (\2024-12-01-preview) did not exhibit this behavior

Fix

Added a defensive _sanitize_cu_output()\ function in the data ingestion scripts that maps known corrupted control characters back to their intended Unicode equivalents after CU processing, before saving to Search/SQL.

Files Changed

  • \infra/scripts/index_scripts/03_cu_process_data_text.py\
  • \infra/scripts/index_scripts/04_cu_process_custom_data.py\

Validation

  • Verified fix by re-running full ingestion pipeline — all documents clean after sanitization
  • The fix is zero-cost when CU output is already correct

Fixes AB#43310

Prajwal-Microsoft and others added 2 commits May 18, 2026 05:44
chore: dev to main merge
The Content Understanding analyzeBinary API (v2025-11-01) intermittently
corrupts Unicode characters by stripping the high byte (e.g. U+2019 -> U+0019).
This causes apostrophes and quotes to render as box characters in the Citation Panel.

Added _sanitize_cu_output() to map known corrupted control characters back to
their intended Unicode equivalents after CU processing, before saving to Search/SQL.

Fixes AB#43310

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 19, 2026 14:41
@Yamini-Microsoft Yamini-Microsoft marked this pull request as draft May 19, 2026 14:43
@Yamini-Microsoft Yamini-Microsoft marked this pull request as ready for review May 20, 2026 09:30
@Yamini-Microsoft

Copy link
Copy Markdown
Contributor Author

@copilot review the PR

Copilot AI commented May 20, 2026

Copy link
Copy Markdown
Contributor

@copilot review the PR

PR Review: fix/cu-unicode-sanitization-43310

Summary

The PR adds a defensive _sanitize_cu_output() function to fix Unicode corruption caused by the CU analyzeBinary API (v2025-11-01) that strips the high byte from certain Unicode characters. The approach is sound. Below are findings:


✅ Correct Approach

  • The _sanitize_cu_output() function is applied uniformly through get_field_value(), which is the single entry point for all CU field extractions in both files.
  • The if not text: return text guard correctly handles None and empty strings.
  • The fix is zero-cost when CU output is already clean (no corrupted characters present).

⚠️ Issues to Address

1. Questionable \u001a → \u2019 mapping
The \u001a\u2019 mapping doesn't follow the observed corruption pattern. The CU API corrupts by stripping the high byte (\u20xx → \u00xx), so:

  • \u201a (single low-9 quotation mark, ) would corrupt to \u001a
  • Mapping \u001a → \u2019 (right single quote) is incorrect; it should be \u001a → \u201a

This could silently turn a legitimate \u001a control character (SUB) into a right single quotation mark. The PR description only documents \u0019 → \u2019 — the \u001a entry appears to be an oversight.

2. Code duplication
_sanitize_cu_output() is copy-pasted identically into both 03_cu_process_data_text.py and 04_cu_process_custom_data.py. This violates DRY and means future corrections must be applied twice. It should be moved to the shared content_understanding_client.py module.

3. PEP 8 spacing
Only one blank line separates create_tables() from _sanitize_cu_output(). PEP 8 requires two blank lines between top-level definitions.

4. \u001e → \u2014 mapping not mathematically consistent
\u2014 (em dash) with the high byte stripped gives \u0014, not \u001e. If \u001e is based on empirical observation from actual CU output (not just arithmetic), that's fine, but it should be noted in the comment that this is empirically observed, not derived, to avoid confusion.

5. Coverage limited to valueString
get_field_value() only sanitizes valueString fields. If CU output ever returns corrupted text in valueArray or nested field types, those would not be sanitized. For the current use case this appears sufficient, but worth noting.


Suggested Fix for Issue #1 and #3

def _sanitize_cu_output(text):
    """Replace non-printable control characters that may appear in CU output.
    ...
    """
    if not text:
        return text
    replacements = {
        '\u0019': '\u2019',  # right single quotation mark (empirically observed)
        '\u001a': '\u201a',  # single low-9 quotation mark (high-byte-strip pattern)
        '\u001c': '\u201c',  # left double quotation mark
        '\u001d': '\u201d',  # right double quotation mark
        '\u001e': '\u2014',  # em dash (empirically observed)
    }

Verdict

The fix addresses a real, well-documented bug. Issues #2 (duplication) and #1 (wrong mapping for \u001a) should be addressed before merging to prevent a subtle future regression.

@Avijit-Microsoft Avijit-Microsoft merged commit e081526 into dev May 20, 2026
2 checks passed
@github-actions

Copy link
Copy Markdown
Contributor

🎉 This PR is included in version 3.23.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants