Skip to content

Diagnostics: Adds gateway hedging suppression marker#5894

Open
orange-dot wants to merge 1 commit into
Azure:users/ntripician/ppaf-gateway-hedging-overridefrom
orange-dot:diagnostics-gateway-hedging-suppression
Open

Diagnostics: Adds gateway hedging suppression marker#5894
orange-dot wants to merge 1 commit into
Azure:users/ntripician/ppaf-gateway-hedging-overridefrom
orange-dot:diagnostics-gateway-hedging-suppression

Conversation

@orange-dot
Copy link
Copy Markdown

Description

Fixes #5871.

Adds customer-visible diagnostics for gateway-driven hedging suppression. When the gateway disable flag first suppresses hedging for a client, CosmosDiagnostics now includes db.cosmosdb.hedging_disabled_by_gateway = true.

The marker is intentionally one-shot per gateway true-cycle: it is emitted on the first suppressed request for a client, remains quiet for later suppressed requests while the flag stays true, and is emitted again after the flag cycles back to false and then true.

The marker is emitted only when the gateway suppresses an availability strategy that would otherwise hedge that specific request. The diagnostics contract uses a typed internal boolean trace datum for this marker, so the new field serializes as a JSON boolean without changing the existing fallback behavior for generic bool trace data.

Type of change

  • Bug fix / supportability improvement (non-breaking change)
  • New feature
  • Breaking change
  • Documentation update

Changes

  • DocumentClient: tracks gateway suppression with a monotonic diagnostics generation instead of a shared bit, so stale requests from an older true-cycle cannot consume the marker for a newer true-cycle; exposes the gateway-suppressed stashed strategy through a small internal accessor.
  • RequestInvokerHandler: emits the diagnostics marker only after gateway suppression is active and the effective availability strategy would hedge the specific request; non-hedgeable requests such as document creates or non-document reads do not consume the marker.
  • TraceWriter: adds an internal BooleanTraceDatum path for JSON boolean diagnostics while preserving the existing generic bool fallback as string output.
  • Tests and baselines: adds regression coverage for first emission, no emission when disabled, no marker consumption without an enabled strategy, no marker consumption for non-hedgeable requests, re-emission after a flag cycle, concurrent first suppressed requests, observed-true / transition-to-false races, stale true -> false -> true request races, customer-visible emulator diagnostics output, typed boolean trace serialization, and raw generic bool compatibility.

Acceptance coverage

Acceptance criterion Coverage
First suppressed request emits db.cosmosdb.hedging_disabled_by_gateway = true RequestInvokerHandler_FlagTrue_FirstSuppressedRequest_EmitsDiagnosticsOnce; emulator test GatewayDrivenHedgingSuppressionDiagnostics_EmitsOnceAndReEmitsAfterFlagCycle validates the customer-visible CosmosDiagnostics.ToString() output
Flag false does not emit the marker RequestInvokerHandler_FlagFalse_DoesNotEmitDiagnostics; emulator test covers the initial false/inactive state and the explicit false state after a flag cycle
Requests with no effective strategy or per-request DisabledStrategy() do not consume the marker RequestInvokerHandler_FlagTrue_NoEffectiveStrategy_DoesNotConsumeDiagnosticsMarker; RequestInvokerHandler_FlagTrue_PerRequestDisabledStrategy_DoesNotConsumeDiagnosticsMarker
Non-hedgeable requests do not consume the marker RequestInvokerHandler_FlagTrue_DocumentCreate_DoesNotConsumeDiagnosticsMarker; RequestInvokerHandler_FlagTrue_NonDocumentRead_DoesNotConsumeDiagnosticsMarker
Later suppressed requests while the flag remains true do not re-emit RequestInvokerHandler_FlagTrue_FirstSuppressedRequest_EmitsDiagnosticsOnce; emulator test verifies the second suppressed request has no marker
True -> false -> true emits again RequestInvokerHandler_FlagTrueFalseTrue_ReEmitsDiagnosticsAfterFlagCycle; emulator test verifies the second true cycle emits again
Request observes true, then a false transition races before diagnostic consume DocumentClient_ObservedSuppressionThenFlagFalse_DiagnosticMarkerStillConsumable
Stale request from an older true-cycle cannot consume the newer true-cycle marker DocumentClient_ObservedSuppressionThenFlagFalseTrue_DoesNotConsumeNewCycleMarker
Concurrent first suppressed requests emit once RequestInvokerHandler_FlagTrue_ConcurrentFirstSuppressedRequests_EmitsDiagnosticsOnce
Diagnostics baseline serializes the marker as a JSON boolean and keeps raw generic bool compatibility TraceWriterBaselineTests.Serialization baseline includes typed BooleanTraceDatum as JSON boolean and raw generic bool as string "True"

Behavior

State Result
Gateway suppression inactive No diagnostics marker is emitted
Gateway suppression active but no effective enabled availability strategy exists No diagnostics marker is emitted and the marker is left available for the first real suppressed hedge
Gateway suppression active but the specific request would not hedge No diagnostics marker is emitted and the marker is left available for the first real suppressed hedge
First suppressed hedgeable request after suppression becomes active db.cosmosdb.hedging_disabled_by_gateway = true is emitted
Later suppressed requests while suppression remains active Marker is not re-emitted
Suppression cycles inactive then active again Marker is emitted again on the next suppressed hedgeable request for the new true-cycle
Request already observed suppression while the flag concurrently transitions false The suppressed request can still consume and emit the one-shot marker for its observed generation
Request from an older true-cycle races with a newer true-cycle The stale request cannot consume the newer true-cycle marker
Concurrent first suppressed requests Exactly one request consumes and emits the one-shot marker for the observed generation

Validation

The validation focused on the runtime diagnostics path, emulator compile coverage, and the trace serialization contract.

  • GatewayHedgingOverrideTests: 23/23 passed
  • Microsoft.Azure.Cosmos.EmulatorTests.csproj build: passed
  • GatewayDrivenHedgingSuppressionDiagnostics_EmitsOnceAndReEmitsAfterFlagCycle: passed in the fork emulator validation
  • git diff --check: clean

Changelog

  • Added an entry under ### Unreleased / #### Features Added for the diagnostics/supportability change.

@orange-dot orange-dot marked this pull request as ready for review May 20, 2026 19:42
@orange-dot orange-dot force-pushed the diagnostics-gateway-hedging-suppression branch from f8495b3 to bbe0a80 Compare May 20, 2026 19:53
@orange-dot orange-dot force-pushed the diagnostics-gateway-hedging-suppression branch from bbe0a80 to ce9b0ea Compare May 20, 2026 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant