Skip to content

[fix] Distinguish client cancellation from server errors during slice upload#290

Merged
paul-fresquet merged 5 commits into
masterfrom
fix/upload-client-timeout-distinction
Apr 25, 2026
Merged

[fix] Distinguish client cancellation from server errors during slice upload#290
paul-fresquet merged 5 commits into
masterfrom
fix/upload-client-timeout-distinction

Conversation

@paul-fresquet
Copy link
Copy Markdown
Contributor

@paul-fresquet paul-fresquet commented Apr 25, 2026

Context

A user reported a stress-test failure synchronizing a 150 GB game file with an unreliable network on the receiving side (B-SIDE). After ~6 minutes of E2EE upload of the 909 MB inventory ZIP, slice uploads started failing in cascade with SocketException (995) / TaskCanceledException, and after 6 retries on a single slice the entire RunFullInventory was aborted on B-SIDE, which then propagated to A-SIDE as Local Inventory is cancelled due to a premature end to another Session Member.

The B-SIDE log showed a clear pattern: every per-attempt timeout was reported by the upload strategy as a generic HTTP 500, which the adaptive upload controller then interpreted as a bandwidth/server error and used to reset the chunk size to its initial 500 KB on every failure — preventing any real adaptation to the degraded network and making the failure noisier than necessary.

Summary

Tier 1 of the upload-resilience plan:

  • Distinguish client-side failures from server failures. New UploadFailureKind enum (None, ServerError, ClientCancellation, ClientTimeout) carried on UploadFileResponse. New static UploadFailureClassifier shared by both upload strategies (CloudflareR2UploadStrategy, AzureBlobStorageUploadStrategy): an OperationCanceledException whose CancellationToken is already canceled is now reported as ClientCancellation instead of a synthetic HTTP 500.
  • Stop poisoning the adaptive controller with client cancellations. IAdaptiveUploadController.RecordUploadResult accepts a failureKind; AdaptiveUploadController.HandleBandwidthReset is short-circuited when the kind is ClientCancellation or ClientTimeout, so chunk sizing reflects real bandwidth signals only.
  • Refine the failure kind in the worker. FileUploadWorker upgrades a strategy-level ClientCancellation to ClientTimeout when the worker’s per-attempt CTS is the cancellation source (and not the global token).
  • Raise the per-slice attempt timeout. Floor: 30s → 60s; ceiling: 90s → 120s; constants now named (AttemptTimeoutFloorSeconds, AttemptTimeoutCeilingSeconds, SecondsPerMegabyteHeuristic). A single intermittent slice on a degraded network no longer trips the timeout in 30s.

Tests

All new and updated test classes target ≥ 85% coverage of the touched logic. Full ByteSync.Client.UnitTests suite: 1173 / 1173 passing.

  • UploadFileResponseTests (new) — 6 tests over success, server failure (status + exception), ClientCancellation, ClientTimeout, and default FailureKind.
  • UploadFailureClassifierTests (new) — 5 tests covering OCE/TCE with a canceled token, OCE with a live token, generic exceptions and HttpRequestException.
  • CloudflareR2UploadStrategyTests (new, mocked HttpMessageHandler) — 8 tests covering 200, 429, 500, HttpRequestException, IOException, in-flight cancellation via a HangingHttpMessageHandler, and OCE both with and without a canceled caller token.
  • FileUploadWorkerInternalsTests (new) — 12 tests over ComputeAttemptTimeoutSeconds (floor 60s, linear 3s/MB region, ceiling 120s), RefineFailureKind, and DetermineCancellationKind.
  • AdaptiveUploadControllerTests (extended) — 4 new tests asserting that ClientCancellation and ClientTimeout do not reset the chunk size, and that a 500 with no failureKind still does (regression guard for existing behavior).
  • FileUploadWorkerTests (extended) — 3 new tests verifying the failureKind plumbed to RecordUploadResult for Success, ClientCancellation, and server Failure(503) paths.

Notes

  • ByteSync.Client exposes internals to ByteSync.Client.UnitTests via an InternalsVisibleTo MSBuild item to keep ComputeAttemptTimeoutSeconds, RefineFailureKind and DetermineCancellationKind non-public while remaining testable.
  • This PR is intentionally limited to Tier 1 (signal correctness + reasonable timeout). Tier 2 (per-slice retry budget independent of overall upload, smarter chunk floor, jitter on attempt timeout) will follow in a separate PR.

Test plan

  • Build ByteSync.Client and ByteSync.Client.IntegrationTests.
  • Run full ByteSync.Client.UnitTests suite (1173 passing).
  • Sonar quality gate ≥ 80% on the PR.
  • Manual smoke: small/medium upload still succeeds end-to-end.

… upload

Client-side timeouts (per-attempt CancellationTokenSource) and user cancellations
were converted to a synthetic HTTP 500 by the upload strategies, which caused the
adaptive upload controller to repeatedly reset the chunk size to 500 KB and
masked the real network condition. Introduce UploadFailureKind to distinguish
ServerError, ClientCancellation and ClientTimeout, and stop downscaling the
chunk size on client-side cancellations. Raise the per-slice attempt timeout
floor from 30s to 60s so a single intermittent slice no longer kills the whole
upload on a degraded network.

Made-with: Cursor
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves upload resilience by explicitly classifying upload failures (server error vs client cancellation/timeout) and preventing client-driven cancellations/timeouts from being interpreted as bandwidth/server issues by the adaptive upload controller.

Changes:

  • Introduces UploadFailureKind and carries it through UploadFileResponse, upload strategies, worker refinement, and adaptive controller recording.
  • Adds shared UploadFailureClassifier and updates Azure/Cloudflare upload strategies to classify cancellations distinctly.
  • Adjusts per-attempt timeout computation (new named constants; 60s floor / 120s ceiling) and adds/extends unit tests across the touched components.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/ByteSync.Client.UnitTests/Services/Communications/Transfers/Uploading/FileUploadWorkerTests.cs Adds assertions that failureKind is plumbed to the adaptive controller correctly.
tests/ByteSync.Client.UnitTests/Services/Communications/Transfers/Uploading/FileUploadWorkerInternalsTests.cs New tests for attempt-timeout math and cancellation-kind refinement logic.
tests/ByteSync.Client.UnitTests/Services/Communications/Transfers/Uploading/AdaptiveUploadControllerTests.cs Adds regression coverage ensuring client failures don’t reset chunk size.
tests/ByteSync.Client.UnitTests/Services/Communications/Transfers/Strategies/UploadFailureClassifierTests.cs New tests for shared failure classification behavior.
tests/ByteSync.Client.UnitTests/Services/Communications/Transfers/Strategies/CloudflareR2UploadStrategyTests.cs New tests validating failureKind classification and status handling for R2 uploads.
tests/ByteSync.Client.UnitTests/Common/UploadFileResponseTests.cs New tests validating UploadFileResponse constructors and default FailureKind.
tests/ByteSync.Client.IntegrationTests/TestHelpers/FixedAdaptiveUploadController.cs Updates test helper to match new RecordUploadResult signature.
src/ByteSync.Common/Business/Communications/Transfers/UploadFileResponse.cs Adds FailureKind field and constructors for client cancellation/timeout responses.
src/ByteSync.Common/Business/Communications/Transfers/UploadFailureKind.cs New enum defining failure categories.
src/ByteSync.Client/Services/Communications/Transfers/Uploading/FileUploadWorker.cs Refines failureKind based on which CTS canceled; updates attempt timeout computation.
src/ByteSync.Client/Services/Communications/Transfers/Uploading/AdaptiveUploadController.cs Accepts failureKind and short-circuits bandwidth reset for client-side failures.
src/ByteSync.Client/Services/Communications/Transfers/Strategies/UploadFailureClassifier.cs New shared classifier for mapping exceptions + token state to UploadFileResponse.
src/ByteSync.Client/Services/Communications/Transfers/Strategies/CloudflareR2UploadStrategy.cs Uses shared classifier and distinguishes caller-cancellation path.
src/ByteSync.Client/Services/Communications/Transfers/Strategies/AzureBlobStorageUploadStrategy.cs Uses shared classifier and distinguishes caller-cancellation path.
src/ByteSync.Client/Interfaces/Controls/Communications/IAdaptiveUploadController.cs Adds failureKind parameter (defaulted) to RecordUploadResult.
src/ByteSync.Client/ByteSync.Client.csproj Exposes internals to unit tests via InternalsVisibleTo.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sonarqubecloud
Copy link
Copy Markdown

@paul-fresquet paul-fresquet merged commit fccfb43 into master Apr 25, 2026
25 checks passed
@paul-fresquet paul-fresquet deleted the fix/upload-client-timeout-distinction branch April 25, 2026 11:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants