fix: hoist sanitize mapping to module constant, fix U+001E em-dash mismap by Yamini-Microsoft · Pull Request #929 · microsoft/Conversation-Knowledge-Mining-Solution-Accelerator

Yamini-Microsoft · 2026-05-22T10:31:42Z

Purpose

Addresses review feedback on PR #922 for the sanitize_cu_output function in content_understanding_client.py:

Performance: Hoisted replacements dict to module-level constant (_CU_REPLACEMENTS) to avoid per-call allocation. Added _CU_BAD_CHARS set with isdisjoint() early-exit for true no-op when text has no corrupted chars.
Bug fix: U+001E was incorrectly mapped to U+2014 (em dash). Per the high-byte stripping corruption model, U+001E corresponds to U+201E (double low-9 quotation mark). Added correct U+0014 -> U+2014 mapping for actual em dash corruption.
Docstring: Removed misleading "zero-cost" claim, clarified no-op semantics.

Does this introduce a breaking change?

Yes
No

Golden Path Validation

I have tested the primary workflows (the "golden path") to ensure they function correctly without errors.

Deployment Validation

I have validated the deployment process successfully and all services are running as expected with this change.

What to Check

Verify that the following are valid

sanitize_cu_output correctly maps all corrupted control chars back to intended Unicode equivalents
No regression in CU text processing for 03_cu_process_data_text.py and 04_cu_process_custom_data.py

Other Information

Fixes review comments from PR #922 (comments by Copilot code review).

…smap - Hoisted replacements dict to _CU_REPLACEMENTS module-level constant to avoid per-call allocation overhead. - Added _CU_BAD_CHARS set for O(1) early-exit when text has no corrupted chars. - Fixed U+001E mapping: was incorrectly mapped to U+2014 (em dash), now correctly maps to U+201E (double low-9 quotation mark) per the high-byte stripping model. - Added U+0014 -> U+2014 mapping for actual em dash corruption. - Reworded docstring to clarify no-op semantics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Improves the sanitize_cu_output utility used by the infra indexing scripts to correct corrupted Unicode characters emitted by the Content Understanding analyzeBinary API, while reducing per-call overhead and fixing an incorrect character mapping.

Changes:

Hoisted the corrupted-control-character replacement mapping to a module-level constant and added a fast no-op early exit via set.isdisjoint().
Fixed the U+001E mapping (now to U+201E) and added the correct U+0014 → U+2014 mapping for em dashes.
Updated the docstring to remove the misleading “zero-cost” wording and clarify no-op semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings May 22, 2026 10:31

Yamini-Microsoft requested review from Avijit-Microsoft, Prajwal-Microsoft, Roopan-Microsoft, Vinay-Microsoft, aniaroramsft, brittneek, dgp10801, nchandhi and toherman-msft as code owners May 22, 2026 10:31

Copilot started reviewing on behalf of Yamini-Microsoft May 22, 2026 10:31 View session

Yamini-Microsoft mentioned this pull request May 22, 2026

refactor: dev to main - Foundry Roles and sanitize CU output for Unicode integrity #922

Open

4 tasks

Copilot AI reviewed May 22, 2026

View reviewed changes

Avijit-Microsoft approved these changes May 22, 2026

View reviewed changes

Avijit-Microsoft merged commit e68f85f into dev May 22, 2026
5 checks passed

Yamini-Microsoft deleted the fix/sanitize-cu-mapping branch May 22, 2026 12:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: hoist sanitize mapping to module constant, fix U+001E em-dash mismap#929

fix: hoist sanitize mapping to module constant, fix U+001E em-dash mismap#929
Avijit-Microsoft merged 1 commit into
devfrom
fix/sanitize-cu-mapping

Yamini-Microsoft commented May 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Yamini-Microsoft commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Does this introduce a breaking change?

Golden Path Validation

Deployment Validation

What to Check

Other Information

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Yamini-Microsoft commented May 22, 2026 •

edited

Loading