utf8 bug fix by dibahlfi · Pull Request #47008 · Azure/azure-sdk-for-python

dibahlfi · 2026-05-19T23:56:39Z

This PR introduces two independent fixes in Cosmos Python request/response handling.

Fix 1: Content-Length calculation for request bodies-
The Content-Length header is updated to use the UTF-8 encoded byte length instead of the Unicode character count.
Previously, len(body) was used, which returns character length. For non-ASCII payloads, this can under-report the actual size. For example, the string { "name": "café" } has a character length of 15 but a UTF-8 byte length of 16 because "é" is multi-byte.
This mismatch can lead to incorrect wire-level payload sizing and potential request failures. The fix ensures accurate byte-level accounting across both sync and async request paths.

Fix 2: Optional fallback for malformed UTF-8 in responses-
Adds an opt-in mechanism to handle response bodies that contain invalid UTF-8 sequences.
Default behavior remains unchanged: strict decoding raises UnicodeDecodeError when malformed UTF-8 is encountered.
An environment variable controls the fallback behavior:
COSMOS.CHARSET_DECODER_ERROR_ACTION_ON_MALFORMED_INPUT
REPLACE: invalid bytes are replaced with the Unicode replacement character
IGNORE: invalid bytes are skipped during decoding
This enables workloads to continue when encountering malformed data without changing default strict behavior.

Compatibility and risk-
No change to default decoding behavior (still strict).
Fallback behavior is explicitly opt-in via environment variable.
Content-Length change only corrects byte calculation and does not alter request semantics.

Copilot

Pull request overview

This PR improves Cosmos Python SDK reliability around request framing and response decoding by (1) correcting Content-Length calculation for UTF-8 payloads and (2) introducing an opt-in, environment-variable-driven fallback when decoding malformed UTF-8 response bodies.

Changes:

Compute Content-Length using UTF-8 byte length (len(body.encode("utf-8"))) in both sync and async request paths.
Add _response_decoding.decode_response_body() to keep strict UTF-8 decoding by default while enabling REPLACE/IGNORE fallback via COSMOS.CHARSET_DECODER_ERROR_ACTION_ON_MALFORMED_INPUT.
Add unit tests covering decoding behavior (strict + opt-in fallback) and regression tests for Content-Length wiring in sync/async paths.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
sdk/cosmos/azure-cosmos/azure/cosmos/_synchronized_request.py	Uses shared decode helper for response bodies; fixes sync `Content-Length` UTF-8 byte accounting.
sdk/cosmos/azure-cosmos/azure/cosmos/aio/_asynchronous_request.py	Uses shared decode helper for response bodies; fixes async `Content-Length` UTF-8 byte accounting.
sdk/cosmos/azure-cosmos/azure/cosmos/_response_decoding.py	New helper module implementing strict decode + opt-in permissive fallback with logging.
sdk/cosmos/azure-cosmos/tests/test_response_decoding.py	New tests validating strict behavior, actionable error hinting, and env-var-driven fallback modes.
sdk/cosmos/azure-cosmos/tests/test_content_length_encoding.py	New regression tests validating UTF-8 byte-length `Content-Length` (including sync/async wiring).
sdk/cosmos/azure-cosmos/CHANGELOG.md	Documents the Content-Length fix and the opt-in malformed UTF-8 decode fallback.

dibahlfi · 2026-05-20T00:04:44Z

@sdkReviewAgent-2

xinlian12 · 2026-05-20T00:50:07Z

✅ Review complete (44:43)

Posted 2 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

dibahlfi · 2026-05-20T01:21:50Z

/azp run python - cosmos - tests

dibahlfi · 2026-05-20T01:22:05Z

@sdkReviewAgent-2

azure-pipelines · 2026-05-20T01:22:06Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2026-05-20T21:10:02Z

✅ Review complete (40:44)

Posted 6 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

kushagraThapar

Reinforcing @xinlian12's existing thread on decode-before-status-check (_synchronized_request.py:178 / aio/_asynchronous_request.py:142) — adding peer-SDK evidence found while reviewing this PR. Not a fresh finding; just additional grounding for shipping the structural fix in this PR rather than a follow-up.

Peer-SDK ordering — Python is the outlier

Both peer SDKs do the status-code dispatch before any body decode:

.NET — Microsoft.Azure.Cosmos/src/GatewayStoreClient.cs:112 checks responseMessage.StatusCode < 400 before touching the stream:

if ((int)responseMessage.StatusCode < 400)
{
    Stream contentStream = await GatewayStoreClient.BufferContentIfAvailableAsync(responseMessage);
    // ...
}

Java — RxGatewayStoreModel.java:722-728 checks statusCode >= MINIMUM_STATUSCODE_AS_ERROR_GATEWAY before decoding the body.

Java also wraps residual decode failures inside the Cosmos exception hierarchy

When Java's decode does fail, it wraps as a CosmosException with status=400 / SubStatus=FAILED_TO_PARSE_SERVER_RESPONSE and preserves the response headers — JsonNodeStorePayload.java:84-91:

IllegalStateException innerException = new IllegalStateException("Unable to parse JSON.", e);
throw Utils.createCosmosException(
    HttpConstants.StatusCodes.BADREQUEST,
    HttpConstants.SubStatusCodes.FAILED_TO_PARSE_SERVER_RESPONSE,
    innerException,
    responseHeaders);

This keeps the failure inside the Cosmos exception hierarchy so activity_id, request_charge, sub_status, and lsn diagnostics survive. Python's current behavior raises stdlib UnicodeDecodeError, which loses all of that.

Java has the regression test we'd need to add

MalformedResponseTests.java:78-86 already covers exactly this scenario:

cosmosAsyncContainer.readItem(testObject.getId(), new PartitionKey(testObject.getMypk()), TestObject.class).block();
fail("The read operation should have failed");
} catch (CosmosException cosmosException) {
    assertThat(cosmosException.getStatusCode()).isEqualTo(HttpConstants.StatusCodes.BADREQUEST);
    assertThat(cosmosException.getSubStatusCode()).isEqualTo(HttpConstants.SubStatusCodes.FAILED_TO_PARSE_SERVER_RESPONSE);

Python has no equivalent today.

Customer-impact framing for the structural fix

In default strict mode (env var unset), a 410/1002 (PartitionKeyRangeGone), 410/1007 (CompletingSplit), 429 (RU throttling), or 503 response that happens to carry a malformed-UTF-8 body is silently converted from a retryable transient (handled by the SDK's chapter-02 / chapter-17 retry policies) into a non-retryable UnicodeDecodeError client exception that escapes the Cosmos exception hierarchy entirely. Customer middleware keyed on status_code won't see it; activity_id / sub_status / lsn diagnostics are lost. The opt-in env var doesn't help here because the default failure mode is the regression.

Recommended fix (extends @xinlian12's structural recommendation)

(@xinlian12's recommendation) Dispatch typed Cosmos exceptions on raw status+substatus before the decode_response_body call.
Additionally, wrap any residual UnicodeDecodeError in CosmosHttpResponseError (or introduce a typed CosmosResponseDecodeError) with status_code=400 and original response headers, mirroring Java's JsonNodeStorePayload.java:84-91 pattern.
Add a Python equivalent of MalformedResponseTests.java so the contract is locked in.

Not re-litigating the structural change — just surfacing the peer-SDK precedent so the fix can land in this PR rather than as a follow-up.

dibahlfi · 2026-05-21T06:35:09Z

@kushagraThapar - regarding your above comment I feel we should not be throwing 400 because it lies about wire reality. If we synthesize status_code=400 for a response that actually came back HTTP 200, dashboards that aggregate by exception status will show fake “BadRequest” spikes.
if the wire response is HTTP 200 (e.response.status_code == 200), but the exception would claim 400 (e.status_code == 400). That creates two contradictory signals in one error
object and leads to incnsistent handling across middleware, alerts, logs etc. so instead I am throwing a typed SDK exception DecodeError which is a subclass of HttpResponseError, so except HttpResponseError: block catches it; so does the broader except Exception:).
The real response object intact. e.response.status_code == 200, all headers (activity_id, request_charge, lsn, etc.) flow through. so we will still have the parity in outcome.

dibahlfi · 2026-05-21T06:35:33Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-21T06:35:47Z

Azure Pipelines successfully started running 1 pipeline(s).

tvaron3

LGTM

utf8 bug fix - initial commit

661591f

Copilot AI review requested due to automatic review settings May 19, 2026 23:56

dibahlfi requested a review from a team as a code owner May 19, 2026 23:56

github-actions Bot added the Cosmos label May 19, 2026

github-project-automation Bot added this to CosmosDB Python Eco-System May 19, 2026

Copilot started reviewing on behalf of dibahlfi May 19, 2026 23:57 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

dibahlfi changed the title ~~utf8 bug fix - initial commit~~ utf8 bug fix May 20, 2026