Skip to content

fix: hoist sanitize mapping to module constant, fix U+001E em-dash mismap#929

Merged
Avijit-Microsoft merged 1 commit into
devfrom
fix/sanitize-cu-mapping
May 22, 2026
Merged

fix: hoist sanitize mapping to module constant, fix U+001E em-dash mismap#929
Avijit-Microsoft merged 1 commit into
devfrom
fix/sanitize-cu-mapping

Conversation

@Yamini-Microsoft
Copy link
Copy Markdown
Contributor

@Yamini-Microsoft Yamini-Microsoft commented May 22, 2026

Purpose

Addresses review feedback on PR #922 for the sanitize_cu_output function in content_understanding_client.py:

  1. Performance: Hoisted replacements dict to module-level constant (_CU_REPLACEMENTS) to avoid per-call allocation. Added _CU_BAD_CHARS set with isdisjoint() early-exit for true no-op when text has no corrupted chars.
  2. Bug fix: U+001E was incorrectly mapped to U+2014 (em dash). Per the high-byte stripping corruption model, U+001E corresponds to U+201E (double low-9 quotation mark). Added correct U+0014 -> U+2014 mapping for actual em dash corruption.
  3. Docstring: Removed misleading "zero-cost" claim, clarified no-op semantics.

Does this introduce a breaking change?

  • Yes
  • No

Golden Path Validation

  • I have tested the primary workflows (the "golden path") to ensure they function correctly without errors.

Deployment Validation

  • I have validated the deployment process successfully and all services are running as expected with this change.

What to Check

Verify that the following are valid

  • sanitize_cu_output correctly maps all corrupted control chars back to intended Unicode equivalents
  • No regression in CU text processing for 03_cu_process_data_text.py and 04_cu_process_custom_data.py

Other Information

Fixes review comments from PR #922 (comments by Copilot code review).

…smap

- Hoisted replacements dict to _CU_REPLACEMENTS module-level constant
  to avoid per-call allocation overhead.
- Added _CU_BAD_CHARS set for O(1) early-exit when text has no corrupted chars.
- Fixed U+001E mapping: was incorrectly mapped to U+2014 (em dash),
  now correctly maps to U+201E (double low-9 quotation mark) per the
  high-byte stripping model.
- Added U+0014 -> U+2014 mapping for actual em dash corruption.
- Reworded docstring to clarify no-op semantics.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves the sanitize_cu_output utility used by the infra indexing scripts to correct corrupted Unicode characters emitted by the Content Understanding analyzeBinary API, while reducing per-call overhead and fixing an incorrect character mapping.

Changes:

  • Hoisted the corrupted-control-character replacement mapping to a module-level constant and added a fast no-op early exit via set.isdisjoint().
  • Fixed the U+001E mapping (now to U+201E) and added the correct U+0014U+2014 mapping for em dashes.
  • Updated the docstring to remove the misleading “zero-cost” wording and clarify no-op semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Avijit-Microsoft Avijit-Microsoft merged commit e68f85f into dev May 22, 2026
5 checks passed
@Yamini-Microsoft Yamini-Microsoft deleted the fix/sanitize-cu-mapping branch May 22, 2026 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants