[Prototype] Diagnostics capture engine (Cosmos driver)#4619
[Prototype] Diagnostics capture engine (Cosmos driver)#4619NaluTripician wants to merge 23 commits into
Conversation
Adds a parallel, opt-in diagnostics CAPTURE subsystem to the Cosmos driver under `diagnostics::capture`, coexisting with the existing DiagnosticsContext (which is left untouched). It productionizes a benchmarked "Deferred Gated Capture" design: a compact, append-only, lock-free hot path that builds an aggregatable summary (and opt-in AZD1 binary detail) only when an op-end gate decides it is worth it. Subsystem (src/diagnostics/capture/): wire (AZD1 codec, reject-unknown-version + hardened decode), bounded LogPool, lock-free Drop/panic-safe DiagnosticsRecorder (+ fan-out ChildRecord/merge_child), signals-aligned Summary, gate (Off/Threshold/Always), interned version/User-Agent preamble. Additive wiring (default Off => no behavior change, no recorder constructed): - diagnostics/mod.rs: `pub mod capture;` (existing exports preserved). - DriverOptions::with_capture_diagnostics_policy / capture_diagnostics_policy. - CosmosDriver::execute_operation_direct records the operation-level outcome when enabled and attaches it via CosmosResponse::capture_diagnostics(). - lib.rs re-exports (Capture* aliases); criterion dev-dep + bench; flate2 dep. Parallel by design: it currently builds its own Summary/Rendered. Feeding or extending DiagnosticsContext instead is the deferred open question, documented in DIAGNOSTICS-CAPTURE.md. No azure_core/typespec_client_core changes. Bench (release): dropped fast-success ~140 ns; built summary ~1.6 us; built detailed ~54 us. fmt + clippy --all-features --all-targets -D warnings + cargo test --all-features all pass (1940 lib tests incl. 28 capture). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Make the deferred, threshold-gated capture front-end build the canonical `DiagnosticsContext` instead of a parallel `Summary`/`Rendered`/`AZD1` model. There is now one diagnostics model: the cheap, append-only, lock-free hot-path recorder plus the op-end gate, and past the gate the captured log is replayed onto the existing `DiagnosticsContextBuilder`. - recorder.rs: new `AttemptRecord` value type (exec-context, region, endpoint, status, sub-status, service-request-id, RU, request_sent, durations) + `record_hedge_outcome` + new `record_end` signature; inlined the varint/TLV codec (wire.rs removed). - context.rs (new): `build_context` maps captured attempts to `RequestDiagnostics` and attaches `HedgeDiagnostics` (region legs, winning leg, terminal state) for hedged operations. - gate.rs: `finish` returns `Option<DiagnosticsContext>`; added an `off()` constructor for symmetry. - Retired the parallel `Summary`/`Rendered`/`AZD1` wire format and the interned version preamble; dropped the now-unused `flate2` dependency. - cosmos_response.rs: `capture_diagnostics()` returns `&DiagnosticsContext`. - cosmos_driver.rs: op-executor wiring builds and attaches the context; the error path logs the built context's JSON. - Added `tests/diagnostics_examples.rs` (real normal + hedged `DiagnosticsContext` JSON), an `Off`-mode criterion bench row, and updated DIAGNOSTICS-CAPTURE.md + CHANGELOG to the integrated framing. Full gate green on upstream/main: fmt --check, clippy --all-features --all-targets -D warnings, cargo test --all-features. No azure_core / typespec_client_core changes; capture stays opt-in (default Off). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Make the gated-capture module the driver's diagnostics ENGINE and the OWNER of the canonical diagnostics model, without losing rich data or breaking the public SDK boundary. This is an ownership flip scoped to the driver crate. - Re-home the rich model: move `diagnostics/diagnostics_context.rs` to `diagnostics/capture/model.rs`. `capture` now owns `DiagnosticsContext`, `DiagnosticsContextBuilder`, `RequestDiagnostics`, `ExecutionContext`, the transport-shard / fault-injection / event types, etc. `diagnostics/mod.rs` re-exports them from `capture` so the public paths (`diagnostics::DiagnosticsContext`, consumed by `azure_data_cosmos`) are unchanged. The pipeline keeps feeding the (now capture-owned) builder, so every rich field + true wall-clock timing is preserved — no data loss. - Default the gate to `Always` (was `Off`): diagnostics are produced out-of-the-box, matching the driver's historical always-on behavior; configurable to `Threshold`/`Off` via `DriverOptions`. - Route through the gate: at the operation-executor seam the recorder records the outcome + elapsed and the gate decides whether the canonical `DiagnosticsContext` is surfaced via `CosmosResponse::capture_diagnostics()` — one model, gated, not rebuilt as a parallel object. The error path logs the error's canonical diagnostics when the gate fires. - Docs: rewrite `DIAGNOSTICS-CAPTURE.md` + the CHANGELOG to the ownership-flip framing (additive / non-breaking to the public SDK; default Always). Public boundary verified intact: `azure_data_cosmos` builds + its 92 lib tests pass; `response.diagnostics()` and the public `DiagnosticsContext` type are unchanged. Full gate green on the driver crate: fmt --check, clippy --all-features --all-targets -D warnings, cargo test --all-features (1926 lib + 2 examples + doctests + 45 integration). No azure_core / typespec changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wire the capture gate so it decides BEFORE the rich build happens, instead of building then dropping. Adds an `enabled` flag to `DiagnosticsContextBuilder`: when the capture policy is `Off` the builder is created disabled, so per-request population is genuinely skipped on the hot path. - model.rs: `DiagnosticsContextBuilder` gains `enabled` + `new_disabled()`. A disabled `start_request` returns an out-of-range `RequestHandle(usize::MAX)` without pushing, so every handle method (`update_request` / `complete_request` / `add_event` / transport-shard / fault-injection) auto-no-ops via the existing `requests.get_mut(handle.0)` miss — no changes to those call sites. `set_hedge_diagnostics` + `merge_hedge_attempt` guard against re-population; `clone_for_hedge_attempt` propagates `enabled`. `complete()` then yields a minimal context (activity id + final status, empty requests). - cosmos_driver.rs: `new_diagnostics_envelope` takes `enabled` and builds a disabled builder when `capture_policy.mode == Off` (skipping the CPU-monitor / machine-id / fault-injection setup); the bootstrap account-fetch stays enabled. - gate.rs: drop `Default` from `Mode` (skeptic-review fix) so a stray `Mode::default()` can't silently return `Off` against the `Always` policy default; the meaningful default lives on `DiagnosticsPolicy`. - Tests: disabled builder records nothing + handle methods are safe no-ops; enabled builder still records fully (guards the default Always path). - Docs: `DIAGNOSTICS-CAPTURE.md` documents the Off short-circuit and the precise `Threshold` fast-success boundary (rich data is incremental; not droppable without a full collection rewrite — no data loss when the gate fires); CHANGELOG updated. Boundary (reported, not silently dropped): `Threshold` fast-success cannot avoid the pipeline's incremental build without data loss, because the slow/error verdict is only known at op-end and the always-on `response.diagnostics()` needs the context; that build is cheap (move + Arc, lazy JSON) and the standalone materialization is already gated. Adversarial skeptic review: all 8 invariants verified (no data loss when the gate fires, handle-safe, no control-flow dependency, hedging-safe, panic/Drop-safe, default Always unchanged, public boundary intact, no secrets); the one finding (Mode Default footgun) is fixed. Full gate green: fmt --check, clippy --all-features --all-targets -D warnings, cargo test --all-features (1928 lib + 2 examples + 45 integration). Public `azure_data_cosmos` builds + 92 lib tests pass. No azure_core / typespec changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Record the conclusion of the full collection-through-append-log investigation in DIAGNOSTICS-CAPTURE.md §8: it is not a net win and is rejected. Capture the three blockers concisely (full fidelity != sub-us; the builder is already lock-free append + lazy JSON with Off disabling per-request population; Instant timestamps and the non_exhaustive FaultInjectionEvaluation enum can't be losslessly byte-encoded) and state the resolution — keep the current design (full fidelity, lock-free append, lazy JSON, Off as the cheap opt-out, default Always). Doc-only; no code changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two additive, non-breaking diagnostics features on the driver crate.
FEATURE 1 — .NET-style top-level Summary block:
- Add DiagnosticsSummary (SafeDebug, Serialize) to DiagnosticsContext,
reachable via DiagnosticsContext::summary(). Modeled on .NET
CosmosDiagnostics' top-level Summary: a (status, sub-status) histogram, total
request charge, request/retry/throttle counts, regions contacted, final
status, top error, total elapsed.
- Computed ONCE at FINALIZATION (in complete() and �ggregate_sub_operations)
as a reduction over the already-collected requests — never on the hot path.
It therefore exists only when a context is built; a dropped fast success
produces no context and no summary.
- Serialized as the first top-level section of the diagnostics output (like
.NET puts Summary at the top). Parity tests for typical / retry+throttle /
error / hedged. Existing structural golden tests strip the summary key (it has
dedicated tests + timing-dependent total_duration_ms).
FEATURE 2 — diagnostics encoding as a client option:
- Add DiagnosticsEncoding { Json (pretty, default), Compact (minified),
Encoded (base64 of compact JSON) } and expose it on DriverOptions
(with_diagnostics_encoding / diagnostics_encoding).
- Add DiagnosticsContext::encode(encoding) -> String and
CosmosResponse::diagnostics_string(encoding) (pass the configured option).
Default Json keeps existing output unchanged; Encoded round-trips via base64.
Docs (DIAGNOSTICS-CAPTURE.md §5a/§5b) + CHANGELOG + examples test updated. Full
gate green: fmt --check, clippy --all-features --all-targets -D warnings, cargo
test --all-features (1934 lib + 3 examples + 45 integration). Public
azure_data_cosmos builds + 92 lib tests pass (boundary unchanged). No
azure_core / typespec changes.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add serializable absolute timestamps alongside the existing `Instant`-based durations so diagnostics can be correlated against server-side and external logs. - `DiagnosticsContext` gains an operation `start_time`/`end_time` (captured at recorder creation and at `complete()`), with matching accessors. The `DiagnosticsSummary` carries the same window. - Each `RequestDiagnostics` attempt gains a `start_time` and an optional `end_time` (captured on complete/timeout/transport-failure, omitted via `skip_serializing_if` while in flight). - Captured with `azure_core::time::OffsetDateTime` and serialized as RFC 3339 via a small `rfc3339` serde helper — no new dependency and no `azure_core` change. Additive and non-breaking: the fields exist only when a context is built, durations are unchanged, and the default output is otherwise the same. Golden JSON tests strip the non-deterministic timestamps in `normalize_diagnostics_json`; a new `timestamps_present_and_rfc3339` test plus the typical-operation parity test assert presence and RFC 3339 format. Updated DIAGNOSTICS-CAPTURE.md (§5c) and the CHANGELOG. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the diagnostics-capture recorder's Vec<u8> TLV byte stream with a typed, two-list event log (Vec<Span> + Vec<Attr>). Recording an event is a single Vec::push of a small typed value -- no varints or byte encoding on the hot path. A span carries a kind, an optional parent id, and op-relative timestamps (its id is its index); attributes are typed key/value pairs tagged with their owning span. - event.rs (new): SpanKind, Span, AttrKey, AttrValue, Attr, EventLog with push/clear/reconstruction helpers and a public compact-bytes round-trip. - pool.rs: pool the two-Vec EventLog instead of Vec<u8>; bounded retention. - recorder.rs: DiagnosticsRecorder pushes typed spans/attrs; same public API (AttemptRecord/HedgeOutcome/start/record_*); Drop/panic-safe pooling. - context.rs: walk the spans/attrs tree onto DiagnosticsContextBuilder -- the typed log is the parsed form, so the byte-parse step is gone. - encode.rs (new): optional cold-path varint/TLV serialization of the two lists, with round-trip + truncation tests. - gate.rs: finish() builds from the EventLog then returns it to the pool. - mod.rs: module decls + event-type re-exports. - DIAGNOSTICS-CAPTURE.md (5d) + CHANGELOG + bench doc updated. Internal to the capture engine; the public DiagnosticsContext output and the azure_data_cosmos boundary are unchanged. fmt + clippy --all-features --all-targets -D warnings clean; capture unit tests, diagnostics_examples, full lib suite (1944), and doctests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces an opt-in diagnostics capture engine for azure_data_cosmos_driver that records a cheap, pooled, typed event log on the hot path and applies an operation-end gate (Off / Threshold / Always) to decide whether to surface diagnostics, while keeping the existing public DiagnosticsContext boundary intact via re-exports.
Changes:
- Adds
diagnostics::capture(event log + recorder + pool + gate + cold-path reconstruction/encoding) and re-homes the canonical diagnostics model under that module. - Extends driver options and response APIs to support capture gating and diagnostics string encodings (
Json/Compact/Encoded), plus adds summary + RFC3339 wall-clock timestamps to diagnostics output. - Adds unit tests, example-generating tests, a live env-gated probe, and a Criterion benchmark for capture hot/cold paths.
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure_data_cosmos_driver/tests/live_capture_diagnostics.rs | Env-gated live probe for capture diagnostics behavior. |
| sdk/cosmos/azure_data_cosmos_driver/tests/diagnostics_examples.rs | Tests that print example diagnostics outputs (for PR/docs) and validate encoding/timestamps. |
| sdk/cosmos/azure_data_cosmos_driver/src/options/mod.rs | Re-exports DiagnosticsEncoding from options. |
| sdk/cosmos/azure_data_cosmos_driver/src/options/driver_options.rs | Adds capture policy + diagnostics encoding to DriverOptions and builder + tests. |
| sdk/cosmos/azure_data_cosmos_driver/src/options/diagnostics_options.rs | Introduces DiagnosticsEncoding enum and string/display helpers. |
| sdk/cosmos/azure_data_cosmos_driver/src/models/cosmos_response.rs | Adds capture_diagnostics() and diagnostics_string(encoding) on responses. |
| sdk/cosmos/azure_data_cosmos_driver/src/lib.rs | Re-exports capture policy/mode aliases and DiagnosticsEncoding at crate root. |
| sdk/cosmos/azure_data_cosmos_driver/src/driver/cosmos_driver.rs | Integrates capture gating into operation execution and adds a shared LogPool. |
| sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/mod.rs | Re-exports diagnostics model/types from diagnostics::capture and exposes capture module. |
| sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/mod.rs | New capture module entry point + public exports + module docs/tests. |
| sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/event.rs | New typed two-list EventLog (Span + Attr) and helpers. |
| sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/pool.rs | New bounded LogPool for reusing EventLog buffers. |
| sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/recorder.rs | New per-operation DiagnosticsRecorder, attempt records, hedge outcomes. |
| sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/gate.rs | New gate policy (DiagnosticsPolicy, Mode) + finish/should_build. |
| sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/context.rs | Cold-path reconstruction from EventLog into canonical DiagnosticsContext. |
| sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/encode.rs | Cold-path compact binary encoding/decoding for the event log. |
| sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/model.rs | Capture-owned canonical diagnostics model updates: summary + timestamps + encoding. |
| sdk/cosmos/azure_data_cosmos_driver/DIAGNOSTICS-CAPTURE.md | New design/usage document for capture engine and tradeoffs. |
| sdk/cosmos/azure_data_cosmos_driver/CHANGELOG.md | Adds release notes describing the capture engine and diagnostics additions. |
| sdk/cosmos/azure_data_cosmos_driver/Cargo.toml | Adds Criterion dev-dependency and registers a new benchmark target. |
| sdk/cosmos/azure_data_cosmos_driver/benches/diagnostics_capture.rs | Adds Criterion benchmarks for capture drop/build paths. |
| Cargo.lock | Pulls in Criterion dependency for bench support. |
Comments suppressed due to low confidence (1)
sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/model.rs:1200
- This region dedupes
regionswithVec::containsinside the request loop, which is O(n²) for fan-out operations. A simple alternative is to push all regions, thensort()+dedup()once after the loop (O(n log n)) while keeping deterministic output.
…to nalutripician/diagnostics-capture-redesign # Conflicts: # Cargo.lock # sdk/cosmos/azure_data_cosmos_driver/CHANGELOG.md # sdk/cosmos/azure_data_cosmos_driver/src/driver/cosmos_driver.rs # sdk/cosmos/azure_data_cosmos_driver/src/options/driver_options.rs
After merging main, the live capture diagnostics test used the removed runtime.get_or_create_driver(account, Some(options)) entry point. Update it to create_driver(driver_options), which now carries the account via DriverOptions::builder(account). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…to nalutripician/diagnostics-capture-redesign # Conflicts: # sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/model.rs
cSpell flagged 'verbosities' in diagnostics/capture/model.rs, failing the Analyze job's spell-check step. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
analogrelay
left a comment
There was a problem hiding this comment.
Initial thoughts. Looking like a good start!
Addresses Ashley Stanton-Nurse's design review on PR Azure#4619: - EventLog is now an RAII lease (Arc<LogPool> + inner EventLogStorage); dropping it returns the storage to the pool automatically. Removes the explicit return_buffer step and the recorder's Drop/Option plumbing. (Azure#9, Azure#10) - LogPool holds a plain Mutex<Vec<EventLogStorage>> instead of an internal Arc; consumers hold Arc<LogPool> and rent() takes &Arc<Self>, so sharing is explicit at the call site. (Azure#11) - Removed LogPool::new(); callers use LogPool::default(). (#2) - Default storage capacity (DEFAULT_SPANS=8, DEFAULT_ATTRS=32) sized for the common 1-4-attempt op; the pool only retains storages still at the default capacity and frees any that grew. (#1, #3) - SpanId(NonZeroU32) newtype (index + 1), with Option<SpanId> parents replacing the NO_PARENT sentinel; the niche makes Option<SpanId> the size of a u32. (#4) - AttrValue gains StaticStr(&'static str), SharedStr(Arc<str>), and a first-class Status(CosmosStatus) variant alongside U64/F64/Str, so the hot path can store regions/statuses without boxing. (#5, #6, Azure#7) - TimeOffset newtype for span start/end instead of _ns-suffixed u64 fields. (Azure#8) The cold-path encode/decode, context reconstruction, recorder, driver wiring, bench, and tests are updated to the new types. Public DiagnosticsContext output and the azure_data_cosmos boundary are unchanged. fmt + clippy --all-features --all-targets -D warnings, cargo doc -Dwarnings, cspell, capture unit tests (80), diagnostics_examples, full lib suite (1965), and the capture doctest all pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
DiagnosticsSummary now carries the client User-Agent (SDK + runtime identity), surfaced via DiagnosticsSummary::user_agent() and serialized inside the top-level summary block. The DiagnosticsContextBuilder gains a user_agent field + set_user_agent setter (propagated across hedge-attempt clones), and the driver's new_diagnostics_envelope feeds it from the runtime's user agent. It is omitted from the JSON when not supplied, so existing output is unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Every pipeline event in a request's `events` list was serialized with `"duration_ms": null`. Events are recorded via `RequestEvent::new`, which leaves `duration_ms` as `None`, and the only constructor that sets it (`with_duration`) is used solely in tests -- so no production path ever populated the field and it was always null. Stamp the stage duration when an event is recorded on a request: `RequestDiagnostics::add_event` now fills `duration_ms` (when the caller did not supply one) with the time elapsed since the previous event, or since the request started for the first event. So `response_headers_received` carries the transport time-to-first-byte and `transport_failed` carries the time until the failure. A caller-provided duration is still respected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Gate the unfinished event-log capture engine behind an off-by-default
`capture_engine` feature and align all docs/claims with what actually
ships, with no fidelity regression on the default path.
- A1: feature-gate event/context/encode/pool/recorder + gate::finish and
their re-exports behind `capture_engine` (default OFF). Keep the gate
(Mode/DiagnosticsPolicy/should_build) and the canonical model
unconditional. Gate the engine-only example test + bench on the feature.
- A3/A1: the driver no longer rents the inert recorder. It computes the
exposure gate directly (outcome + elapsed) so `capture_diagnostics()`
behavior is unchanged with the feature off; the builder remains the sole
populator of the surfaced context.
- A4: drop per-op `endpoint.to_string()` / `format!("{op:?} {res:?}")`.
- A2: compute `DiagnosticsSummary` lazily via `OnceLock` on first access
instead of eagerly in `complete()`; JSON output stays byte-identical.
- A6: add symmetric `CosmosError::capture_diagnostics()` populated by the
same gate; replace the tracing::debug!-only failure handling.
- A5: hardcoded attempt-count `1` removed with the inert wiring; true-count
plumbing belongs to the flagged B-series when capture is wired.
- A9: rewrite DIAGNOSTICS-CAPTURE.md, CHANGELOG, capture/driver_options
rustdoc to the honest state (default path = builder; capture behind the
flag; reconstruction still lossy; parity gate pending).
build/test/clippy/fmt green with and without `capture_engine`.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Make the capture-engine reconstruction faithful instead of fabricating, add a parity harness and concurrency/codec hardening tests, and correct the docs accordingly. All behind the off-by-default capture_engine feature (exercised by CI via --all-features); the default diagnostics path is unchanged. - Capture per-attempt pipeline type, transport security, transport kind, HTTP version, and the server-reported duration into the event log (new AttrKeys + AttemptRecord::with_transport / with_server_duration_ms), and reconstruct them in context.rs instead of hardcoding DataPlane/Secure/Gateway/Http2 and the client-observed span. - Add a parity harness asserting the reconstruction carries the real facets and matches a builder-built reference field-by-field. - Cap decode() pre-allocation against the remaining input so a malformed header cannot drive a giant allocation; add a regression test. - Add a multi-threaded LogPool rent/return stress test. - Refresh DIAGNOSTICS-CAPTURE.md, CHANGELOG, and module docs to the current state (facets captured + parity-checked; remaining work is feeding the recorder from the live pipeline). build/test/clippy/doc/fmt green with and without capture_engine. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The driver evaluates the exposure gate directly; it no longer rents the event-log recorder on the default path. Fix the scope section accordingly and drop a dangling cross-reference. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The capture_engine feature gates the DiagnosticsContext import in gate.rs, so the bare intra-doc link on DiagnosticsPolicy::always broke under the docs.rs feature set (which excludes capture_engine), failing the CI 'cargo +nightly docs-rs' step. Use the fully-qualified link path so it resolves regardless of the imported-into-scope items. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Build Analyze job runs cSpell, which flagged three coined abbreviations in test/comment identifiers. Rename to dictionary words: - parity test local refr -> ref_req - encode test name overallocate -> over_allocate - model.rs Clone comment uncomputed -> unset Verified clean with eng/common/spelling/Invoke-Cspell.ps1 over all changed files; capture tests still pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cargo fmt wraps the transport_http_version assertion now that the local rename pushed it past the 100-col width. Verified the full Analyze gate locally (fmt, taplo, check, clippy, doc, cargo +nightly docs-rs, cSpell). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
|
||
| /// Base64 of the compact JSON — a single opaque token for size-sensitive logging. Decode with | ||
| /// standard base64 and parse the resulting JSON to recover the full diagnostics. | ||
| Encoded, |
There was a problem hiding this comment.
Why - this would always result in at least 33% larger output than Compact?
Section A items 1 & 2 from the diagnostics-sync review with Fabian. Item 2 - remove the `Off` diagnostics-capture policy entirely so diagnostics are always collected (a customer who silently dropped diagnostics is unsupportable). Default stays `Always`; the gate only governs *exposure*: - gate.rs: drop `Mode::Off`, `DiagnosticsPolicy::off()`, the `Mode::Off` arm in `should_build`, and the `off_never_builds` test. - model.rs: drop the `DiagnosticsContextBuilder` `enabled` flag, `new_disabled()`, `is_enabled()`, and the `if !enabled` guards in `start_request` / `set_hedge_diagnostics` / `merge_hedge_attempt`; the builder always records. - cosmos_driver.rs: `new_diagnostics_envelope` no longer takes/branches on `enabled`; drop the `diagnostics_enabled` computation and the `if diagnostics_enabled` wrappers in the op-end gate (success + error paths). - Update Off references in driver_options.rs, cosmos_response.rs, capture/mod.rs, DIAGNOSTICS-CAPTURE.md, and the unreleased CHANGELOG entry. Item 1 - clarify what the capture benchmark measures. `capture_built_context` builds the `DiagnosticsContext` struct only (no JSON). Added `capture_built_context_and_json` which also serializes the detailed JSON, so the two isolate the serialization cost. Measured (criterion, indicative): gate-drop ~0.86us, struct-build ~3.7us, struct+JSON ~10.4us - JSON serialization adds ~6.6us and is the single most expensive step, supporting keeping serialization on-demand/cached in the SDK. Updated the section 3 table and section 7 run instructions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Pushed Removed the Clarified the benchmark (struct-build vs JSON-serialize). Measured (criterion, indicative medians):
So JSON serialization adds ~6.6 µs and is the single most expensive step (it nearly triples the struct-build cost). That argues for keeping JSON serialization on-demand/cached in the SDK; the ~3.7 µs struct build is the part a driver-side threshold gate would save. Gate green: |
What ships by default
DiagnosticsContextBuilderremains the sole populator of the diagnostics thatCosmosResponse::diagnostics()returns, with full fidelity (events, transport + failed shards,session token, system usage/CPU, machine id, fault injection, RU breakdown, true timing, hedging).
response.diagnostics()stays non-optional.diagnostics::capturecompiled by default is the gate: aDiagnosticsPolicy(
Off/Threshold/Always, defaultAlways) plusshould_build. The driver evaluates it atoperation end against the outcome + elapsed time to decide whether the builder-produced context is
also exposed through the gated accessors
CosmosResponse::capture_diagnostics()and the newsymmetric
CosmosError::capture_diagnostics(). The gate never builds the surfaced context — itonly governs exposure.
Offadditionally creates the builder disabled so per-request population isskipped.
DiagnosticsSummary) is computed lazily on first access(cached via
OnceLock) instead of eagerly at finalization; JSON output is byte-identical.The
capture_enginefeature (off by default)Behind the feature, a lock-free, pooled hot-path recorder appends a typed two-list event log
(
Vec<Span>+Vec<Attr>); recording an event is a singleVec::push. ASpancarries akind,an
Option<SpanId>parent (SpanIdis aNonZeroindex+1, soOption<SpanId>uses the niche), andop-relative
TimeOffsetstart/end. AnAttris a typed key/value tagged with itsSpanId, withAttrValuekeeping the path allocation-light (U64/F64/first-classStatus(CosmosStatus)plusStaticStr(&'static str),SharedStr(Arc<str>), ownedStr(Box<str>)). Storage is leased throughan
EventLogRAII handle that returns it to the pool on drop.Past the gate the log is replayed onto a
DiagnosticsContextBuilder. The reconstruction carries thecaptured per-attempt facets (pipeline type, transport security/kind, HTTP version) and the
server-reported duration, reconstructed from what was recorded rather than hardcoded; a parity
harness validates the reconstruction against a builder-built reference. The compact binary codec
caps its allocations against the input length and is covered by round-trip, truncation, unknown-key,
and oversized-header tests, and the pool has a multi-threaded rent/return stress test.
The feature is exercised by CI (which builds, tests, clippys, and docs with
--all-features).Remaining work
The capture engine is not yet wired into the live driver pipeline — the recorder is not fed at the
transport/pipeline sites, so on a real operation it would not yet carry every field the builder
records. Feeding the recorder from the live pipeline (including
&'static/Arc<str>region names)is the prerequisite before the engine could surface
capture_diagnostics()from its ownreconstruction and, eventually, replace the builder. None of that is switched on here.
Guardrails
response.diagnostics()stays non-optional; theazure_data_cosmospublic boundary is unchanged.[Prototype]draft; the default is not flipped and the builder is not removed.cargo build/test/clippy --all-targets/doc/fmtare green with and withoutcapture_engine.See
sdk/cosmos/azure_data_cosmos_driver/DIAGNOSTICS-CAPTURE.mdfor the full design notes.