Commit 6af81eb
authored
Reafctoring to use CosmosError instead of azure_core:Error (#4442)
## Summary
Replaces the SDK's reliance on `azure_core::Error` for Cosmos failure
reporting with a typed, diagnosable `Error` that's safe to construct at
high rates in production.
Every error returned from the driver or SDK now carries — without any
downcasts or string parsing — the typed `CosmosStatus` (HTTP status +
sub-status + categorical `Kind`), the parsed `CosmosResponseHeaders` (RU
charge, activity id, session token, LSNs, …), the raw service response
body, the shared `DiagnosticsContext` for the operation, and (where the
production-safety gates allow) a captured stack backtrace. The previous
`azure_core::Error` remains reachable via `std::error::Error::source()`.
## Motivation
`azure_core::Error` exposes failure data behind an opaque enum and
string messages, which forced callers into brittle pattern matches like
`e.kind() == HttpResponse { status, .. }` and silently dropped
Cosmos-specific fields (sub-status, RU charge, activity id, diagnostics)
at the boundary. That made production triage of Cosmos failures
(throttles, session retries, transport classifications, end-to-end
timeouts) much harder than it needs to be.
Java/.NET SDKs have shown that handing back rich error objects is a
major usability win, but Java/.NET also illustrate the cost of getting
this wrong: when stack-frame computation runs unbounded inside exception
construction, a transient backend error storm can become a sustained
client-side CPU pinhole. This PR adds the rich error surface **and** the
production-safety machinery needed to keep that surface affordable under
load.
## What changed
### New typed `Error`
- **Driver** (`azure_data_cosmos_driver::error::Error`) is the canonical
type. Single `Arc<ErrorInner>` so `Result<T, Error>` stays
pointer-sized; `Clone` is a refcount bump.
- **SDK** (`azure_data_cosmos::Error`) is a `#[repr(transparent)]`
re-export of the driver type, plus the crate-wide
`azure_data_cosmos::Result<T>` alias.
- All SDK fallible APIs (clients, query/feed iterators,
`into_model`/`single`/`items`, `FromStr` impls on
`CosmosAccountEndpoint`/`ConnectionString`/`FeedRange`) now return
`azure_data_cosmos::Result<T>` instead of `azure_core::Result<T>`.
- Accessors: `status() -> CosmosStatus`, `status_code()`,
`sub_status()`, `kind()`, `cosmos_headers() -> Option<ResponseHeaders>`,
`diagnostics() -> Option<&Arc<DiagnosticsContext>>`, `response_body() ->
Option<&[u8]>`, `backtrace() -> Option<&str>`, plus the usual `is_*`
predicates on `CosmosStatus`.
- Named `CosmosStatus` constants for well-known status/sub-status pairs
(e.g. `CosmosStatus::TRANSPORT_GENERATED_503`,
`READ_SESSION_NOT_AVAILABLE`, `RU_BUDGET_EXCEEDED`,
`CROSS_PARTITION_QUERY_NOT_SERVABLE`) so call sites read as
`assert_eq!(err.status(),
CosmosStatus::CROSS_PARTITION_QUERY_NOT_SERVABLE)`.
### Boundary mapper
`From<azure_core::Error>` classifies into the most specific
`CosmosStatus` available — `HttpResponse` wins on its real wire status;
otherwise the `azure_core::ErrorKind` plus a downcast walk of
`.source()` (`reqwest`/`hyper`/`h2`/`io`) refines into synthetic
sub-statuses (e.g. `TRANSPORT_DNS_FAILED`,
`TRANSPORT_HTTP2_INCOMPATIBLE`,
`AUTHENTICATION_TOKEN_ACQUISITION_FAILED`). The original
`azure_core::Error` is preserved in the source chain.
The driver transport layer carries typed `Error` end-to-end; nothing
wraps a Cosmos error back into an `azure_core::Error`, so the typed
payload is never lost on the wire.
### Stack backtrace capture — production-safe by construction
Two-tier cost model with two independent rolling-1-second limiters:
| Limiter | Bounds | Default | Env var / builder |
| --- | --- | --- | --- |
| **Capture throttle** | `Backtrace::capture()` — IP-only, microseconds
| `1000 / s` | `AZURE_COSMOS_BACKTRACE_CAPTURES_PER_SECOND` /
`with_max_error_backtrace_captures_per_second` |
| **Resolution rate** | Symbolication on first `backtrace()` read —
cache-missed frames only | `5 / s` |
`AZURE_COSMOS_BACKTRACE_RESOLUTIONS_PER_SECOND` /
`with_max_error_backtrace_resolutions_per_second` |
Additional safety:
- Resolved frames are cached process-wide by instruction pointer (soft
cap 100K, swap-and-drop-outside-lock eviction). **Cache hits do not
consume budget**, so once a hot stack is known it renders at full
fidelity regardless of limiter state.
- A per-window auto-disable kicks in on resolution-limiter denial: after
one denial, no further capture in that window. The window clears on the
next grant.
- Partial backtraces are never produced — callers either get a
fully-resolved render or `None`.
- The first call's outcome is cached on the `Error` instance, so logging
+ telemetry + panic-message paths see the same answer.
- Wrapping another Cosmos `Error` (e.g. transport-layer re-wrap)
**inherits** the inner backtrace instead of capturing fresh, doubling
the effective budget on retry-heavy paths.
### `Display` / `Debug` impls
- `{e}` — bare message (matches `anyhow` / `azure_core` / `std::io`
convention).
- `{e:#}` — header (`[Kind] status/sub (name)`) + source chain (Display)
+ diagnostics block + backtrace.
- `{e:?}` — header + source chain (Debug) + diagnostics. **No
backtrace** to keep `tracing::error!(?e)` cheap.
- `{e:#?}` — full report including backtrace; alternate flag cascades to
source entries (`{src:#?}`) and to the `DiagnosticsContext`
(`{diag:#?}`), so wrapped errors and diagnostics surface their pretty
multi-line debug layout.
### Misc
- `CosmosStatus::Display` / `Debug` now prefix the categorical `[Kind]`
(e.g. `[Service] 429/3200 (RUBudgetExceeded)`). The `Deserialize` impl
tolerates the `[Kind] ` prefix for JSON round-trip stability.
- `Error::with_context(prefix)` for enriching mapper-classified errors
with operation-specific context (single-allocation prepend, all typed
fields preserved).
- Removed `unsafe` from the SDK `ResponseHeaders` wrapper —
`Error::cosmos_headers()` now returns an owned `Option<ResponseHeaders>`
via a cheap clone (cold path) instead of the previous
`repr(transparent)` reference transmute.
## Breaking changes (SDK)
- All fallible SDK APIs now return `azure_data_cosmos::Result<T>`.
Callers matching on `azure_core::ErrorKind` should switch to the typed
accessors (`e.status_code()`, `e.sub_status()`, `e.status() ==
CosmosStatus::…`, `e.cosmos_headers()`, `e.diagnostics()`). The
underlying `azure_core::Error` remains reachable via
`std::error::Error::source()`.
- `Error::cosmos_headers()` returns `Option<ResponseHeaders>` (by value,
cloned) instead of `Option<&ResponseHeaders>`.
- `CosmosStatus::Display` output now includes the `[Kind]` prefix; any
diagnostics consumers that parsed the previous bare `"429/3200 (…)"`
shape need to either accept the new format or use the typed accessors.
## Testing
- `cargo test -p azure_data_cosmos_driver --lib --all-features` —
**1670+ tests green** (added coverage for backtrace limiter behavior,
source-chain inheritance, `Display`/`Debug` format variants and
alternate-flag cascade, named-constant comparisons, sub-status
serialization with and without well-known names, JSON snapshot updates
for the new status format).
- Cross-crate build clean across `azure_data_cosmos`,
`azure_data_cosmos_driver`, `azure_data_cosmos_perf`,
`azure_data_cosmos_benchmarks`.
- Doctests updated to the new SDK `Result<()>` alias.
## Notes for reviewers
- The throttle defaults (1000 captures/s, 5 resolutions/s) are
deliberately conservative; they're tunable per-runtime and per-env. The
README in `azure_data_cosmos_driver` has a "when to adjust which"
section.
- The boundary mapper is intentionally a one-way conversion. If you see
a code path that round-trips through `azure_core::Error`, please flag it
— that's a regression that would lose the typed payload.
- `Error::client` / `Error::serialization` / `Error::configuration` are
`#[doc(hidden)] pub` so the SDK wrapper crate can construct typed
errors; they are not part of the public surface.
## Backtrace machinery benchmarks
`cargo bench -p azure_data_cosmos_benchmarks --bench backtrace_capture`
Reviewed against the production-readiness changes (opt-in capture,
two-limiter model, source-error backtrace inheritance, `OnceLock`
per-instance render cache).
### Changes to the bench harness
- Added `capture/cosmos/inherit_from_source` to cover the re-wrap path
on `CosmosErrorBuilder::with_arc_source(cosmos_err)`. This path skips a
fresh stack walk and inherits the source's `Backtrace`, which is a key
production optimization but was previously unmeasured.
- Annotated `capture/cosmos/throttle_denied` to make explicit that it
also represents the **default production state** (capture is opt-in;
with `RUST_BACKTRACE` unset the same fast-denial path runs on every
construction).
- Removed a redundant second `prime_resolution_cache()` call (left over
from when the limiter capacity was 1) — the unbounded limiter only needs
one prime pass.
- Updated the module-level docs table accordingly.
### Results (Windows, release, 100 samples per group)
| Group / variant | Mean | Notes |
|---|---|---|
| `capture/cosmos/unbounded` | **1.36 µs** | Cold capture path, throttle
wide open — instruction-pointer walk only. |
| `capture/cosmos/throttle_denied` | **1.82 ns** | Single `AtomicU64`
CAS denial. Also the default-off production cost when `RUST_BACKTRACE`
is unset. **~750× cheaper than `unbounded`**, ~915× cheaper than
`std::backtrace::Backtrace::force_capture`. |
| `capture/cosmos/inherit_from_source` | **1.68 µs** | Full
`CosmosErrorBuilder` build (alloc + status + message + `Arc<source>`
clone) that inherits the source's backtrace — i.e. *does not*
re-capture. Effectively equal to `std::force_capture` alone, so the
entire builder overhead disappears into the cost of one stack walk that
we are deliberately skipping. |
| `capture/std/force_capture` | **1.68 µs** |
`std::backtrace::Backtrace::force_capture()` baseline. |
| `render/cosmos/cached` | **20.6 ns** | `OnceLock` hit on the
per-instance render cache — the steady-state cost of every
`CosmosError::backtrace()` call after the first one. **~735× cheaper
than `std::backtrace::to_string`**. |
| `render/cosmos/fresh_warm_cache` | **5.76 µs** | Fresh `Backtrace` per
iter, process-global IP-keyed frame cache hot — pays cache lookup only,
no symbol resolution, no budget consumption. |
| `render/cosmos/fresh_cold_resolution_denied` | **3.21 µs** | Fresh
`Backtrace` per iter with the resolution limiter exhausted — proves the
denial fast-path is cheaper than even a fully cached resolution.
Validates the "no partial backtraces" guarantee. |
| `render/std/to_string` | **15.2 µs** |
`std::backtrace::Backtrace::to_string()` baseline — std has no
per-instance cache; every call re-walks debug info. |
### Conclusions
1. **Default off is free.** The opt-in design pays for itself: 1.82 ns
per error construction when `RUST_BACKTRACE` is unset (a single CAS),
versus 1.36 µs for the captured path. Default-off error storms cost the
same as an atomic decrement.
2. **Source-error inheritance is real.** Re-wrapping a `CosmosError`
adds the cost of the builder plumbing only — no second stack walk — so
the pipeline's re-wrap sites (transport → service, status promotion,
etc.) do not multiply backtrace cost across nested errors.
3. **Per-instance render cache is the right tradeoff.** A cached
`CosmosError::backtrace()` call returns in ~21 ns, versus 15 µs for
`std::backtrace` — a structural difference of ~700×, mattering on any
path that formats the same error multiple times (e.g.
`tracing::error!("{e}")` + `Result::unwrap` panic message).
4. **Resolution-limiter denial is cheaper than a cache-hit render.** A
denied resolution (~3.2 µs) costs less than rendering the same backtrace
through the warm frame cache (~5.8 µs), so the "deny rather than
partially render" rule cannot regress under capacity pressure.
All benches and the production code build clean (`-D warnings`, all
features).1 parent 1999560 commit 6af81eb
146 files changed
Lines changed: 9028 additions & 2680 deletions
File tree
- sdk/cosmos
- .github/skills/cosmos-design-struct
- azure_data_cosmos_benchmarks
- benches
- src
- azure_data_cosmos_driver
- docs
- src
- diagnostics
- driver
- cache
- dataflow
- pipeline
- routing
- transport
- error
- fault_injection
- in_memory_emulator
- models
- options
- query
- eval
- plan
- tests
- system
- tests
- emulator_tests
- in_memory_emulator_tests
- azure_data_cosmos_perf
- src
- operations
- azure_data_cosmos
- docs
- examples/cosmos
- src
- clients
- models
- tests
- emulator_tests
- framework
- in_memory_emulator_tests
- multi_write_tests
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
88 | 88 | | |
89 | 89 | | |
90 | 90 | | |
| 91 | + | |
91 | 92 | | |
92 | 93 | | |
93 | 94 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
| 60 | + | |
60 | 61 | | |
61 | 62 | | |
62 | 63 | | |
| |||
154 | 155 | | |
155 | 156 | | |
156 | 157 | | |
| 158 | + | |
157 | 159 | | |
158 | 160 | | |
159 | 161 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
226 | 226 | | |
227 | 227 | | |
228 | 228 | | |
229 | | - | |
| 229 | + | |
230 | 230 | | |
231 | 231 | | |
232 | 232 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
48 | | - | |
| 48 | + | |
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
| |||
66 | 66 | | |
67 | 67 | | |
68 | 68 | | |
69 | | - | |
| 69 | + | |
70 | 70 | | |
71 | 71 | | |
72 | 72 | | |
| |||
190 | 190 | | |
191 | 191 | | |
192 | 192 | | |
193 | | - | |
| 193 | + | |
194 | 194 | | |
195 | 195 | | |
196 | 196 | | |
| |||
355 | 355 | | |
356 | 356 | | |
357 | 357 | | |
358 | | - | |
| 358 | + | |
359 | 359 | | |
360 | 360 | | |
361 | 361 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
7 | | - | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
8 | 10 | | |
9 | 11 | | |
10 | 12 | | |
| |||
17 | 19 | | |
18 | 20 | | |
19 | 21 | | |
| 22 | + | |
20 | 23 | | |
21 | 24 | | |
22 | 25 | | |
| |||
43 | 46 | | |
44 | 47 | | |
45 | 48 | | |
46 | | - | |
47 | 49 | | |
48 | 50 | | |
49 | 51 | | |
| |||
Lines changed: 4 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
933 | 933 | | |
934 | 934 | | |
935 | 935 | | |
936 | | - | |
937 | | - | |
| 936 | + | |
| 937 | + | |
| 938 | + | |
938 | 939 | | |
939 | | - | |
| 940 | + | |
940 | 941 | | |
941 | 942 | | |
942 | 943 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | | - | |
7 | 6 | | |
8 | 7 | | |
9 | 8 | | |
| |||
65 | 64 | | |
66 | 65 | | |
67 | 66 | | |
68 | | - | |
| 67 | + | |
69 | 68 | | |
70 | 69 | | |
71 | 70 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | | - | |
7 | 6 | | |
8 | 7 | | |
9 | 8 | | |
| |||
60 | 59 | | |
61 | 60 | | |
62 | 61 | | |
63 | | - | |
| 62 | + | |
64 | 63 | | |
65 | 64 | | |
66 | 65 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | | - | |
7 | 6 | | |
8 | 7 | | |
9 | 8 | | |
| |||
91 | 90 | | |
92 | 91 | | |
93 | 92 | | |
94 | | - | |
| 93 | + | |
95 | 94 | | |
96 | 95 | | |
97 | 96 | | |
| |||
0 commit comments