utf8 bug fix #47008
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves Cosmos Python SDK reliability around request framing and response decoding by (1) correcting Content-Length calculation for UTF-8 payloads and (2) introducing an opt-in, environment-variable-driven fallback when decoding malformed UTF-8 response bodies.
Changes:
- Compute
Content-Lengthusing UTF-8 byte length (len(body.encode("utf-8"))) in both sync and async request paths. - Add
_response_decoding.decode_response_body()to keep strict UTF-8 decoding by default while enablingREPLACE/IGNOREfallback viaCOSMOS.CHARSET_DECODER_ERROR_ACTION_ON_MALFORMED_INPUT. - Add unit tests covering decoding behavior (strict + opt-in fallback) and regression tests for Content-Length wiring in sync/async paths.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/azure/cosmos/_synchronized_request.py | Uses shared decode helper for response bodies; fixes sync Content-Length UTF-8 byte accounting. |
| sdk/cosmos/azure-cosmos/azure/cosmos/aio/_asynchronous_request.py | Uses shared decode helper for response bodies; fixes async Content-Length UTF-8 byte accounting. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_response_decoding.py | New helper module implementing strict decode + opt-in permissive fallback with logging. |
| sdk/cosmos/azure-cosmos/tests/test_response_decoding.py | New tests validating strict behavior, actionable error hinting, and env-var-driven fallback modes. |
| sdk/cosmos/azure-cosmos/tests/test_content_length_encoding.py | New regression tests validating UTF-8 byte-length Content-Length (including sync/async wiring). |
| sdk/cosmos/azure-cosmos/CHANGELOG.md | Documents the Content-Length fix and the opt-in malformed UTF-8 decode fallback. |
|
@sdkReviewAgent-2 |
|
✅ Review complete (44:43) Posted 2 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
/azp run python - cosmos - tests |
|
@sdkReviewAgent-2 |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
✅ Review complete (40:44) Posted 6 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
kushagraThapar
left a comment
There was a problem hiding this comment.
Reinforcing @xinlian12's existing thread on decode-before-status-check (_synchronized_request.py:178 / aio/_asynchronous_request.py:142) — adding peer-SDK evidence found while reviewing this PR. Not a fresh finding; just additional grounding for shipping the structural fix in this PR rather than a follow-up.
Peer-SDK ordering — Python is the outlier
Both peer SDKs do the status-code dispatch before any body decode:
- .NET —
Microsoft.Azure.Cosmos/src/GatewayStoreClient.cs:112checksresponseMessage.StatusCode < 400before touching the stream:if ((int)responseMessage.StatusCode < 400) { Stream contentStream = await GatewayStoreClient.BufferContentIfAvailableAsync(responseMessage); // ... }
- Java —
RxGatewayStoreModel.java:722-728checksstatusCode >= MINIMUM_STATUSCODE_AS_ERROR_GATEWAYbefore decoding the body.
Java also wraps residual decode failures inside the Cosmos exception hierarchy
When Java's decode does fail, it wraps as a CosmosException with status=400 / SubStatus=FAILED_TO_PARSE_SERVER_RESPONSE and preserves the response headers — JsonNodeStorePayload.java:84-91:
IllegalStateException innerException = new IllegalStateException("Unable to parse JSON.", e);
throw Utils.createCosmosException(
HttpConstants.StatusCodes.BADREQUEST,
HttpConstants.SubStatusCodes.FAILED_TO_PARSE_SERVER_RESPONSE,
innerException,
responseHeaders);This keeps the failure inside the Cosmos exception hierarchy so activity_id, request_charge, sub_status, and lsn diagnostics survive. Python's current behavior raises stdlib UnicodeDecodeError, which loses all of that.
Java has the regression test we'd need to add
MalformedResponseTests.java:78-86 already covers exactly this scenario:
cosmosAsyncContainer.readItem(testObject.getId(), new PartitionKey(testObject.getMypk()), TestObject.class).block();
fail("The read operation should have failed");
} catch (CosmosException cosmosException) {
assertThat(cosmosException.getStatusCode()).isEqualTo(HttpConstants.StatusCodes.BADREQUEST);
assertThat(cosmosException.getSubStatusCode()).isEqualTo(HttpConstants.SubStatusCodes.FAILED_TO_PARSE_SERVER_RESPONSE);Python has no equivalent today.
Customer-impact framing for the structural fix
In default strict mode (env var unset), a 410/1002 (PartitionKeyRangeGone), 410/1007 (CompletingSplit), 429 (RU throttling), or 503 response that happens to carry a malformed-UTF-8 body is silently converted from a retryable transient (handled by the SDK's chapter-02 / chapter-17 retry policies) into a non-retryable UnicodeDecodeError client exception that escapes the Cosmos exception hierarchy entirely. Customer middleware keyed on status_code won't see it; activity_id / sub_status / lsn diagnostics are lost. The opt-in env var doesn't help here because the default failure mode is the regression.
Recommended fix (extends @xinlian12's structural recommendation)
- (@xinlian12's recommendation) Dispatch typed Cosmos exceptions on raw
status+substatusbefore thedecode_response_bodycall. - Additionally, wrap any residual
UnicodeDecodeErrorinCosmosHttpResponseError(or introduce a typedCosmosResponseDecodeError) withstatus_code=400and original response headers, mirroring Java'sJsonNodeStorePayload.java:84-91pattern. - Add a Python equivalent of
MalformedResponseTests.javaso the contract is locked in.
Not re-litigating the structural change — just surfacing the peer-SDK precedent so the fix can land in this PR rather than as a follow-up.
|
@kushagraThapar - regarding your above comment I feel we should not be throwing 400 because it lies about wire reality. If we synthesize status_code=400 for a response that actually came back HTTP 200, dashboards that aggregate by exception status will show fake “BadRequest” spikes. |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This PR introduces two independent fixes in Cosmos Python request/response handling.
Fix 1: Content-Length calculation for request bodies-
The Content-Length header is updated to use the UTF-8 encoded byte length instead of the Unicode character count.
Previously, len(body) was used, which returns character length. For non-ASCII payloads, this can under-report the actual size. For example, the string { "name": "café" } has a character length of 15 but a UTF-8 byte length of 16 because "é" is multi-byte.
This mismatch can lead to incorrect wire-level payload sizing and potential request failures. The fix ensures accurate byte-level accounting across both sync and async request paths.
Fix 2: Optional fallback for malformed UTF-8 in responses-
Adds an opt-in mechanism to handle response bodies that contain invalid UTF-8 sequences.
Default behavior remains unchanged: strict decoding raises UnicodeDecodeError when malformed UTF-8 is encountered.
An environment variable controls the fallback behavior:
COSMOS.CHARSET_DECODER_ERROR_ACTION_ON_MALFORMED_INPUT
REPLACE: invalid bytes are replaced with the Unicode replacement character
IGNORE: invalid bytes are skipped during decoding
This enables workloads to continue when encountering malformed data without changing default strict behavior.
Compatibility and risk-
No change to default decoding behavior (still strict).
Fallback behavior is explicitly opt-in via environment variable.
Content-Length change only corrects byte calculation and does not alter request semantics.