Skip to content

utf8 bug fix #47008

Merged
tvaron3 merged 4 commits into
mainfrom
users/dibahl/utf8-decoding-fix
May 26, 2026
Merged

utf8 bug fix #47008
tvaron3 merged 4 commits into
mainfrom
users/dibahl/utf8-decoding-fix

Conversation

@dibahlfi
Copy link
Copy Markdown
Member

@dibahlfi dibahlfi commented May 19, 2026

This PR introduces two independent fixes in Cosmos Python request/response handling.

Fix 1: Content-Length calculation for request bodies-
The Content-Length header is updated to use the UTF-8 encoded byte length instead of the Unicode character count.
Previously, len(body) was used, which returns character length. For non-ASCII payloads, this can under-report the actual size. For example, the string { "name": "café" } has a character length of 15 but a UTF-8 byte length of 16 because "é" is multi-byte.
This mismatch can lead to incorrect wire-level payload sizing and potential request failures. The fix ensures accurate byte-level accounting across both sync and async request paths.

Fix 2: Optional fallback for malformed UTF-8 in responses-
Adds an opt-in mechanism to handle response bodies that contain invalid UTF-8 sequences.
Default behavior remains unchanged: strict decoding raises UnicodeDecodeError when malformed UTF-8 is encountered.
An environment variable controls the fallback behavior:
COSMOS.CHARSET_DECODER_ERROR_ACTION_ON_MALFORMED_INPUT
REPLACE: invalid bytes are replaced with the Unicode replacement character
IGNORE: invalid bytes are skipped during decoding
This enables workloads to continue when encountering malformed data without changing default strict behavior.

Compatibility and risk-
No change to default decoding behavior (still strict).
Fallback behavior is explicitly opt-in via environment variable.
Content-Length change only corrects byte calculation and does not alter request semantics.

Copilot AI review requested due to automatic review settings May 19, 2026 23:56
@dibahlfi dibahlfi requested a review from a team as a code owner May 19, 2026 23:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Cosmos Python SDK reliability around request framing and response decoding by (1) correcting Content-Length calculation for UTF-8 payloads and (2) introducing an opt-in, environment-variable-driven fallback when decoding malformed UTF-8 response bodies.

Changes:

  • Compute Content-Length using UTF-8 byte length (len(body.encode("utf-8"))) in both sync and async request paths.
  • Add _response_decoding.decode_response_body() to keep strict UTF-8 decoding by default while enabling REPLACE/IGNORE fallback via COSMOS.CHARSET_DECODER_ERROR_ACTION_ON_MALFORMED_INPUT.
  • Add unit tests covering decoding behavior (strict + opt-in fallback) and regression tests for Content-Length wiring in sync/async paths.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
sdk/cosmos/azure-cosmos/azure/cosmos/_synchronized_request.py Uses shared decode helper for response bodies; fixes sync Content-Length UTF-8 byte accounting.
sdk/cosmos/azure-cosmos/azure/cosmos/aio/_asynchronous_request.py Uses shared decode helper for response bodies; fixes async Content-Length UTF-8 byte accounting.
sdk/cosmos/azure-cosmos/azure/cosmos/_response_decoding.py New helper module implementing strict decode + opt-in permissive fallback with logging.
sdk/cosmos/azure-cosmos/tests/test_response_decoding.py New tests validating strict behavior, actionable error hinting, and env-var-driven fallback modes.
sdk/cosmos/azure-cosmos/tests/test_content_length_encoding.py New regression tests validating UTF-8 byte-length Content-Length (including sync/async wiring).
sdk/cosmos/azure-cosmos/CHANGELOG.md Documents the Content-Length fix and the opt-in malformed UTF-8 decode fallback.

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_synchronized_request.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/aio/_asynchronous_request.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_response_decoding.py Outdated
Comment thread sdk/cosmos/azure-cosmos/tests/test_response_decoding.py Outdated
@dibahlfi
Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@dibahlfi dibahlfi changed the title utf8 bug fix - initial commit utf8 bug fix May 20, 2026
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_response_decoding.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_synchronized_request.py Outdated
@xinlian12
Copy link
Copy Markdown
Member

Review complete (44:43)

Posted 2 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@dibahlfi
Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_response_decoding.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_synchronized_request.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_response_decoding.py Outdated
Comment thread sdk/cosmos/azure-cosmos/tests/test_semantic_reranker_unit.py
Comment thread sdk/cosmos/azure-cosmos/tests/test_content_length_encoding.py Outdated
Comment thread sdk/cosmos/azure-cosmos/tests/test_response_decoding.py
@xinlian12
Copy link
Copy Markdown
Member

Review complete (40:44)

Posted 6 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

Copy link
Copy Markdown
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reinforcing @xinlian12's existing thread on decode-before-status-check (_synchronized_request.py:178 / aio/_asynchronous_request.py:142) — adding peer-SDK evidence found while reviewing this PR. Not a fresh finding; just additional grounding for shipping the structural fix in this PR rather than a follow-up.

Peer-SDK ordering — Python is the outlier

Both peer SDKs do the status-code dispatch before any body decode:

  • .NETMicrosoft.Azure.Cosmos/src/GatewayStoreClient.cs:112 checks responseMessage.StatusCode < 400 before touching the stream:
    if ((int)responseMessage.StatusCode < 400)
    {
        Stream contentStream = await GatewayStoreClient.BufferContentIfAvailableAsync(responseMessage);
        // ...
    }
  • JavaRxGatewayStoreModel.java:722-728 checks statusCode >= MINIMUM_STATUSCODE_AS_ERROR_GATEWAY before decoding the body.

Java also wraps residual decode failures inside the Cosmos exception hierarchy

When Java's decode does fail, it wraps as a CosmosException with status=400 / SubStatus=FAILED_TO_PARSE_SERVER_RESPONSE and preserves the response headers — JsonNodeStorePayload.java:84-91:

IllegalStateException innerException = new IllegalStateException("Unable to parse JSON.", e);
throw Utils.createCosmosException(
    HttpConstants.StatusCodes.BADREQUEST,
    HttpConstants.SubStatusCodes.FAILED_TO_PARSE_SERVER_RESPONSE,
    innerException,
    responseHeaders);

This keeps the failure inside the Cosmos exception hierarchy so activity_id, request_charge, sub_status, and lsn diagnostics survive. Python's current behavior raises stdlib UnicodeDecodeError, which loses all of that.

Java has the regression test we'd need to add

MalformedResponseTests.java:78-86 already covers exactly this scenario:

cosmosAsyncContainer.readItem(testObject.getId(), new PartitionKey(testObject.getMypk()), TestObject.class).block();
fail("The read operation should have failed");
} catch (CosmosException cosmosException) {
    assertThat(cosmosException.getStatusCode()).isEqualTo(HttpConstants.StatusCodes.BADREQUEST);
    assertThat(cosmosException.getSubStatusCode()).isEqualTo(HttpConstants.SubStatusCodes.FAILED_TO_PARSE_SERVER_RESPONSE);

Python has no equivalent today.

Customer-impact framing for the structural fix

In default strict mode (env var unset), a 410/1002 (PartitionKeyRangeGone), 410/1007 (CompletingSplit), 429 (RU throttling), or 503 response that happens to carry a malformed-UTF-8 body is silently converted from a retryable transient (handled by the SDK's chapter-02 / chapter-17 retry policies) into a non-retryable UnicodeDecodeError client exception that escapes the Cosmos exception hierarchy entirely. Customer middleware keyed on status_code won't see it; activity_id / sub_status / lsn diagnostics are lost. The opt-in env var doesn't help here because the default failure mode is the regression.

Recommended fix (extends @xinlian12's structural recommendation)

  1. (@xinlian12's recommendation) Dispatch typed Cosmos exceptions on raw status+substatus before the decode_response_body call.
  2. Additionally, wrap any residual UnicodeDecodeError in CosmosHttpResponseError (or introduce a typed CosmosResponseDecodeError) with status_code=400 and original response headers, mirroring Java's JsonNodeStorePayload.java:84-91 pattern.
  3. Add a Python equivalent of MalformedResponseTests.java so the contract is locked in.

Not re-litigating the structural change — just surfacing the peer-SDK precedent so the fix can land in this PR rather than as a follow-up.

@dibahlfi
Copy link
Copy Markdown
Member Author

@kushagraThapar - regarding your above comment I feel we should not be throwing 400 because it lies about wire reality. If we synthesize status_code=400 for a response that actually came back HTTP 200, dashboards that aggregate by exception status will show fake “BadRequest” spikes.
if the wire response is HTTP 200 (e.response.status_code == 200), but the exception would claim 400 (e.status_code == 400). That creates two contradictory signals in one error
object and leads to incnsistent handling across middleware, alerts, logs etc. so instead I am throwing a typed SDK exception DecodeError which is a subclass of HttpResponseError, so except HttpResponseError: block catches it; so does the broader except Exception:).
The real response object intact. e.response.status_code == 200, all headers (activity_id, request_charge, lsn, etc.) flow through. so we will still have the parity in outcome.

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_response_decoding.py
Copy link
Copy Markdown
Member

@tvaron3 tvaron3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tvaron3 tvaron3 merged commit 49c9034 into main May 26, 2026
51 checks passed
@tvaron3 tvaron3 deleted the users/dibahl/utf8-decoding-fix branch May 26, 2026 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants