Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
a25ae1f
Cosmos: add Gateway 2.0 design spec
tvaron3 Apr 20, 2026
fa33adf
Cosmos: address PR deep-review findings on Gateway 2.0 spec
tvaron3 Apr 20, 2026
c732728
Cosmos: address second-pass review of Gateway 2.0 spec
tvaron3 Apr 20, 2026
a2be890
Cosmos: drop Java parity subsection from Gateway 2.0 spec
tvaron3 Apr 20, 2026
39827c9
Cosmos: resolve Q1/Q3 + clarify Q4 in Gateway 2.0 spec
tvaron3 Apr 20, 2026
369735e
Cosmos: drop Retry Decision Table from Gateway 2.0 spec
tvaron3 Apr 20, 2026
9c44af1
Cosmos: drop Gateway 2.0-specific failure fallback from spec
tvaron3 Apr 20, 2026
1fa81be
Cosmos: fix RNTBD expansion in Gateway 2.0 spec
tvaron3 Apr 20, 2026
785fab9
Cosmos: fix Gateway 2.0 spec analyze failures
tvaron3 Apr 20, 2026
cd77426
Update sdk/cosmos/azure_data_cosmos_driver/docs/GATEWAY_20_SPEC.md
tvaron3 Apr 21, 2026
6891309
Update sdk/cosmos/azure_data_cosmos_driver/docs/GATEWAY_20_SPEC.md
tvaron3 Apr 21, 2026
07535d7
Cosmos: refine Gateway 2.0 spec retry/timeout/PLF wording
tvaron3 Apr 21, 2026
b2fa3b3
Cosmos: address Gateway 2.0 spec review comments
tvaron3 Apr 21, 2026
7784f2f
Merge remote-tracking branch 'upstream/release/azure_data_cosmos-prev…
tvaron3 Apr 26, 2026
82a9ec1
Cosmos: Address PR #4223 round-5 review on Gateway 2.0 spec
tvaron3 Apr 27, 2026
5ed5246
Scrub 'footgun' from Gateway 2.0 spec
tvaron3 Apr 27, 2026
7669540
Remove internal ADO PR references from spec
tvaron3 Apr 27, 2026
c3ba856
Resolve EPK range type consolidation in spec
tvaron3 Apr 27, 2026
85a559b
Remove HPK PK-header gating prose from spec
tvaron3 Apr 27, 2026
c31bcf7
Use GlobalDatabaseAccountName RNTBD token for tenant identity
tvaron3 Apr 27, 2026
72f4ab9
Address round-7 PR feedback on Gateway 2.0 spec
tvaron3 Apr 27, 2026
b45fc05
Remove personal-name references from Gateway 2.0 spec
tvaron3 Apr 27, 2026
81f51c6
Rename prefer_gateway20 to gateway20_suppressed
tvaron3 Apr 27, 2026
8be7037
Address round 8 PR review comments on GATEWAY_20_SPEC
tvaron3 Apr 29, 2026
9c7484e
Restore prefer-remote-region wording for 404/1002 retry
tvaron3 Apr 29, 2026
8363209
Add Gateway 2.0 RNTBD wire format (Slice 1)
tvaron3 Apr 30, 2026
802e479
Add Gateway 2.0 foundation (Slice 2)
tvaron3 Apr 30, 2026
c475d87
Add Gateway 2.0 routing eligibility and endpoint key derivation (Slic…
tvaron3 Apr 30, 2026
6963626
Add Gateway 2.0 RNTBD dispatch (Slice 3b/c)
tvaron3 Apr 30, 2026
56890a7
Use AcqRel ordering for Gateway 2.0 transport request id
tvaron3 Apr 30, 2026
27218fb
Add Gateway 2.0 Phase 6 testing & infrastructure
tvaron3 Apr 30, 2026
ea8f0eb
Document Gateway 2.0 capability bitmask (Rust=9 vs Java=11)
tvaron3 Apr 30, 2026
384844d
Rename gateway20 flag to negative-term name (R15)
tvaron3 Apr 30, 2026
76f8f4f
Remove THINCLIENT_PROXY_* deprecated SDK aliases
tvaron3 Apr 30, 2026
953fabf
Add transport_kind filter to FaultInjectionCondition
tvaron3 Apr 30, 2026
07e72e1
Expose Gateway 2.0 toggle on CosmosClientBuilder
tvaron3 Apr 30, 2026
967caca
Emit Gateway 2.0 EPK range headers for HPK partial-PK dispatches
tvaron3 Apr 30, 2026
9772486
Add HPK partial-PK round-trip E2E test for Gateway 2.0
tvaron3 Apr 30, 2026
0aa52f1
Propagate x-ms-continuation into Gateway 2.0 RNTBD frame
tvaron3 Apr 30, 2026
ead35b2
Merge release/azure_data_cosmos-previews into Gateway 2.0 branch
tvaron3 Apr 30, 2026
88a55f8
Flip Gateway 2.0 default to enabled and rewrite docs
tvaron3 Apr 30, 2026
31d2e64
Consolidate Gateway 2.0 live tests into main Cosmos pipeline
tvaron3 Apr 30, 2026
d010978
Re-export driver fault-injection types from SDK
tvaron3 Apr 30, 2026
b3b5f91
Document Gateway 2.0 default and fault-injection refactor in CHANGELOGs
tvaron3 Apr 30, 2026
6c89a16
Fix rustdoc errors in fault_injection module re-exports
tvaron3 Apr 30, 2026
2780650
Rename thinclient → Gateway 2.0 per spec naming policy
tvaron3 Apr 30, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions sdk/cosmos/azure_data_cosmos/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,12 @@

### Features Added

- Added `CosmosClientBuilder::with_gateway20_disabled(bool)` to opt out of the new Gateway 2.0 transport, which is now enabled by default. Gateway 2.0 routes data-plane requests through a regional proxy that forwards RNTBD-over-HTTP/2 to the backend. Set this to `true` to fall back to the direct gateway transport — useful for workloads that depend on the published gateway latency SLAs (Gateway 2.0 is not currently covered by them) or that need the direct-gateway behavior for diagnostics. ([#4319](https://github.com/Azure/azure-sdk-for-rust/pull/4319))

### Breaking Changes

- Consolidated SDK fault-injection types as re-exports from `azure_data_cosmos_driver::fault_injection`. `FaultInjectionRule`, `FaultInjectionCondition`, `FaultInjectionResult`, `CustomResponse`, `FaultInjectionErrorType`, `FaultOperationType`, and the matching builders are now provided by the driver crate. Field access is via accessor methods (e.g., `rule.id()`, `condition.region()`, `response.body()`) rather than direct field reads. The SDK retains only `FaultInjectionClientBuilder` (gateway-side transport wrapper). ([#4319](https://github.com/Azure/azure-sdk-for-rust/pull/4319))

### Bugs Fixed

### Other Changes
Expand Down
4 changes: 3 additions & 1 deletion sdk/cosmos/azure_data_cosmos/build.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,7 @@
// unknown cfg names are warned/denied unless explicitly declared via check-cfg.
fn main() {
// Allow `#[cfg_attr(not(test_category = "..."), ignore)]` in `tests/*.rs`.
println!("cargo:rustc-check-cfg=cfg(test_category, values(\"emulator\", \"multi_write\", \"split\"))");
println!(
"cargo:rustc-check-cfg=cfg(test_category, values(\"emulator\", \"multi_write\", \"split\", \"gateway20\"))"
);
}
55 changes: 10 additions & 45 deletions sdk/cosmos/azure_data_cosmos/docs/sdk-to-driver-cutover.md
Original file line number Diff line number Diff line change
Expand Up @@ -315,42 +315,24 @@ The gateway pipeline tracked this via `CosmosRequest` (which held the final URL)

## Fault Injection Wiring

When cutting `read_item` over to the driver, the SDK's fault injection tests initially failed because the two execution paths (gateway and driver) have **independent fault injection systems**. This section documents how they were connected.
The SDK no longer ships a parallel fault-injection type system. All fault-injection types — [`FaultInjectionRule`], [`FaultInjectionCondition`], [`FaultInjectionResult`], [`CustomResponse`], [`FaultInjectionErrorType`], [`FaultOperationType`], and the matching builders — are re-exported directly from the driver crate (`azure_data_cosmos_driver::fault_injection`) by `azure_data_cosmos::fault_injection`. The SDK only owns:

### Problem
- [`FaultInjectionClientBuilder`] — produces the `azure_core::http::Transport` that the SDK pipeline plugs in (i.e., a `FaultClient` HTTP client wrapper that evaluates driver rules against in-flight gateway requests).
- A small private `fault_operation_for_sdk(SdkOperationType, SdkResourceType) → Option<FaultOperationType>` adapter so `CosmosRequest::add_fault_injection_headers` can stamp the right operation tag on the outbound headers.

The SDK and driver each have their own fault injection module (`azure_data_cosmos::fault_injection` and `azure_data_cosmos_driver::fault_injection`). They define parallel but separate types (`FaultInjectionRule`, `FaultInjectionCondition`, `FaultInjectionResult`, etc.) with identical variants but different Rust types. Prior to this work, only the gateway pipeline received fault injection rules — the driver was built without them.

### Solution: Rule Translation with Shared State

The bridge module (`driver_bridge.rs`) includes `sdk_fi_rules_to_driver_fi_rules()`, which translates SDK fault injection rules into driver fault injection rules. The translation covers:

- `FaultOperationType` — variant-by-variant match (identical variant names)
- `FaultInjectionErrorType` — variant-by-variant match
- `FaultInjectionCondition` — `RegionName` → `Region`, operation type and container ID mapped directly
- `FaultInjectionResult` — `Duration` → `Option<Duration>`, probability copied
- Timing fields — `start_time: Instant` → `Option<Instant>`, `end_time` and `hit_limit` copied

### Shared Mutable State

SDK `FaultInjectionRule` has `enabled: Arc<AtomicBool>` and `hit_count: Arc<AtomicU32>` that tests mutate at runtime (`.disable()`, `.enable()`, `.hit_count()`). The driver's `FaultInjectionRuleBuilder` accepts external `Arc`s via `with_shared_state()`, so both the SDK gateway path and the driver path reference the **same atomic state**. This means:

- Calling `.disable()` on the SDK rule also disables it in the driver
- Hit counts are shared — both paths increment the same counter
- Tests that toggle rules or assert hit counts work correctly across both paths
Because both transports (gateway and driver) consume the **same** `Arc<FaultInjectionRule>` instances now, there is no translation step and no shared-state plumbing — toggling `enable()`/`disable()`, hit-count increments, and `hit_limit` enforcement all happen against one canonical rule object.

### Wiring in `CosmosClientBuilder`

In `CosmosClientBuilder::build()`:

1. Before the `FaultInjectionClientBuilder` is consumed for the gateway transport, `rules()` extracts a reference to the SDK rules
2. `sdk_fi_rules_to_driver_fi_rules()` translates them to driver rules with shared state
3. The translated rules are passed to `CosmosDriverRuntimeBuilder::with_fault_injection_rules()`
4. The SDK's `fault_injection` Cargo feature now forwards to the driver's `fault_injection` feature
1. The `FaultInjectionClientBuilder::rules()` accessor returns `&[Arc<FaultInjectionRule>]` — already the driver type, so the SDK simply clones the slice (`fault_builder.rules().to_vec()`).
2. The cloned rules are passed to `CosmosDriverRuntimeBuilder::with_fault_injection_rules()` so the driver's own fault-injection HTTP client can evaluate them.
3. The `FaultInjectionClientBuilder` is then consumed to build the gateway transport, which wraps the inner `HttpClient` with a `FaultClient` that evaluates the same rules.

### Test Patterns for Future Cutover

When cutting over additional operations, **no additional fault injection wiring is needed** — it's handled once at the `CosmosClientBuilder` level. However, tests need to account for two behavioral differences:
When cutting over additional operations, **no fault-injection wiring changes are needed** — it's all wired once at `CosmosClientBuilder::build()`. However, tests need to account for two behavioral differences between gateway-routed and driver-routed operations:

**`request_url()` returns `None` for driver-routed operations:**

Expand Down Expand Up @@ -378,24 +360,7 @@ let rule = FaultInjectionRuleBuilder::new("test", error)

This asymmetry will disappear once all operations are driver-routed, since there will be only one hit-counting path.

### `custom_response` Translation

Translation of `CustomResponse` (synthetic HTTP responses) is not yet implemented. None of the current tests use custom responses for `ReadItem` operations. When needed, the bridge function should be extended to translate `CustomResponse` fields (`status_code`, `headers`, `body`).

### Consolidating to Driver Fault Injection After Cutover

The current dual-system architecture (SDK fault injection + driver fault injection + translation bridge) exists only because the cutover is incremental — some operations still go through the gateway while others go through the driver. Once **all** operations are routed through the driver:

1. **Drop `azure_data_cosmos::fault_injection`** — the SDK's HTTP-client-level fault interception module becomes unreachable. Delete the entire `src/fault_injection/` directory.
2. **Re-export driver types** — the SDK re-exports the driver's fault injection types directly:

```rust
#[cfg(feature = "fault_injection")]
pub use azure_data_cosmos_driver::fault_injection;
```
### Final State After Cutover

3. **Remove the translation layer** — `sdk_fi_rules_to_driver_fi_rules()` in `driver_bridge.rs` and the `shared_enabled()`/`shared_hit_count()` accessors on the SDK rule are no longer needed.
4. **Simplify `CosmosClientBuilder`** — `with_fault_injection()` accepts `Vec<Arc<driver::FaultInjectionRule>>` directly and passes them to `CosmosDriverRuntimeBuilder::with_fault_injection_rules()`. No translation, no cloning, no intermediary builder.
5. **Update tests** — tests construct driver `FaultInjectionRule` directly (same builders, same API) instead of SDK rules.
Once **all** operations are routed through the driver, the SDK-side `FaultInjectionClientBuilder` and `FaultClient` HTTP wrapper become unreachable too — the driver-runtime fault-injection HTTP client is the single source of truth. At that point `azure_data_cosmos::fault_injection` collapses into a pure `pub use azure_data_cosmos_driver::fault_injection;` re-export (or is dropped entirely).

At that point the SDK has **no fault injection logic of its own** — it's a pass-through to the driver, matching the overall "SDK as thin wrapper" goal. The driver is the single source of truth for all transport-related concerns including fault injection.
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,15 @@ pub struct CosmosClientBuilder {
fault_injection_builder: Option<crate::fault_injection::FaultInjectionClientBuilder>,
/// Fallback endpoints tried when the primary endpoint is unavailable.
backup_endpoints: Vec<azure_core::http::Url>,
/// Operator override for the Gateway 2.0 transport.
///
/// `None` (the default) leaves the underlying driver in charge of
/// routing — Gateway 2.0 is selected automatically whenever the
/// account advertises a Gateway 2.0 endpoint and HTTP/2 is allowed.
/// `Some(true)` forces every request through the standard gateway
/// transport via [`with_gateway20_disabled`](Self::with_gateway20_disabled);
/// `Some(false)` explicitly opts in (matching the default behaviour).
gateway20_disabled: Option<bool>,
}

impl CosmosClientBuilder {
Expand Down Expand Up @@ -168,6 +177,41 @@ impl CosmosClientBuilder {
self
}

/// Disables the Gateway 2.0 transport for this client.
///
/// Gateway 2.0 is the next-generation Cosmos DB dataplane transport:
/// SDK connections terminate at a regional Gateway 2.0 proxy that
/// forwards RNTBD-over-HTTP/2 to the backend. **Gateway 2.0 is enabled
/// by default** — whenever the account advertises a Gateway 2.0 endpoint
/// the SDK routes eligible dataplane operations through it and falls
/// back to the standard gateway only for operations Gateway 2.0 cannot
/// serve (e.g. metadata requests or accounts that do not advertise a
/// Gateway 2.0 endpoint).
///
/// Pass `true` to opt out and force every request through the standard
/// gateway transport. The standard gateway path remains supported and
/// stable — disabling Gateway 2.0 is the recommended workaround if you
/// hit a regression on the new transport.
///
/// # Latency caveat
///
/// Gateway 2.0 traffic flows through a proxy that is
/// **not currently covered by the regional Cosmos DB latency SLA**.
/// Workloads with strict P99 latency requirements should opt out via
/// `with_gateway20_disabled(true)` until the proxy reaches general
/// availability. The extra hop also means Gateway 2.0 may add measurable
/// latency relative to the standard gateway in some regions.
///
/// # Arguments
///
/// * `disabled` - `true` to suppress Gateway 2.0 and force the standard
/// gateway transport; `false` (or leaving the builder untouched) keeps
/// the default Gateway 2.0 behaviour.
pub fn with_gateway20_disabled(mut self, disabled: bool) -> Self {
self.gateway20_disabled = Some(disabled);
self
}

/// Registers a throughput control group on the driver runtime.
///
/// Groups define throughput policies (priority level, throughput bucket) that
Expand Down Expand Up @@ -287,9 +331,10 @@ impl CosmosClientBuilder {
Option<azure_core::http::Transport>,
Vec<std::sync::Arc<azure_data_cosmos_driver::fault_injection::FaultInjectionRule>>,
) = if let Some(fault_builder) = self.fault_injection_builder {
// Translate rules for the driver before the builder is consumed.
let driver_rules =
crate::driver_bridge::sdk_fi_rules_to_driver_fi_rules(fault_builder.rules());
// SDK fault-injection rules are now driver `FaultInjectionRule`s
// (re-exported through `crate::fault_injection`), so the driver
// can consume them directly without a translation step.
let driver_rules = fault_builder.rules().to_vec();
let fault_builder = match base_client {
Some(client) => fault_builder.with_inner_client(client),
None => fault_builder,
Expand Down Expand Up @@ -425,6 +470,9 @@ impl CosmosClientBuilder {
EmulatorServerCertValidation::DangerousDisabled,
);
}
if let Some(disabled) = self.gateway20_disabled {
pool_builder = pool_builder.with_gateway20_disabled(disabled);
}
driver_runtime_builder = driver_runtime_builder.with_connection_pool(pool_builder.build()?);

#[cfg(feature = "fault_injection")]
Expand Down
5 changes: 2 additions & 3 deletions sdk/cosmos/azure_data_cosmos/src/constants.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ macro_rules! cosmos_headers {
/// A list of all Cosmos DB specific headers that should be allowed in logging.
pub const COSMOS_ALLOWED_HEADERS: &[&HeaderName] = &[
$(&$name,)*
&azure_data_cosmos_driver::constants::GATEWAY20_OPERATION_TYPE,
&azure_data_cosmos_driver::constants::GATEWAY20_RESOURCE_TYPE,
];
};
}
Expand Down Expand Up @@ -185,9 +187,6 @@ cosmos_headers! {
COSMOS_QUORUM_ACKED_LLSN => "x-ms-cosmos-quorum-acked-llsn",
REQUEST_DURATION_MS => "x-ms-request-duration-ms",
COSMOS_INTERNAL_PARTITION_ID => "x-ms-cosmos-internal-partition-id",
// Thin Client
THINCLIENT_PROXY_OPERATION_TYPE => "x-ms-thinclient-proxy-operation-type",
THINCLIENT_PROXY_RESOURCE_TYPE => "x-ms-thinclient-proxy-resource-type",
// Client ID
CLIENT_ID => "x-ms-client-id",
// these are not actually sent but are used internally for fault injection
Expand Down
7 changes: 2 additions & 5 deletions sdk/cosmos/azure_data_cosmos/src/cosmos_request.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
// Licensed under the MIT License.

#[cfg(feature = "fault_injection")]
use crate::fault_injection::FaultOperationType;
use crate::fault_injection::fault_operation_for_sdk;
use crate::operation_context::OperationType;
use crate::options::ExcludedRegions;
use crate::request_context::RequestContext;
Expand Down Expand Up @@ -153,10 +153,7 @@ impl CosmosRequest {

#[cfg(feature = "fault_injection")]
pub fn add_fault_injection_headers(&mut self) {
let fault_op = FaultOperationType::from_operation_and_resource(
&self.operation_type,
&self.resource_type,
);
let fault_op = fault_operation_for_sdk(&self.operation_type, &self.resource_type);

if let Some(op) = fault_op {
self.headers.insert(
Expand Down
Loading
Loading