Skip to content

[Prototype] Diagnostics capture engine (Cosmos driver)#4619

Draft
NaluTripician wants to merge 23 commits into
Azure:mainfrom
NaluTripician:nalutripician/diagnostics-capture-redesign
Draft

[Prototype] Diagnostics capture engine (Cosmos driver)#4619
NaluTripician wants to merge 23 commits into
Azure:mainfrom
NaluTripician:nalutripician/diagnostics-capture-redesign

Conversation

@NaluTripician

@NaluTripician NaluTripician commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Status: prototype / draft. Opening for design review. This PR explores a deferred,
threshold-gated diagnostics capture engine for the Cosmos driver. The event-log engine is
off by default behind the capture_engine feature; the way the driver produces diagnostics by
default is unchanged.

What ships by default

  • The always-on DiagnosticsContextBuilder remains the sole populator of the diagnostics that
    CosmosResponse::diagnostics() returns, with full fidelity (events, transport + failed shards,
    session token, system usage/CPU, machine id, fault injection, RU breakdown, true timing, hedging).
    response.diagnostics() stays non-optional.
  • The only part of diagnostics::capture compiled by default is the gate: a DiagnosticsPolicy
    (Off / Threshold / Always, default Always) plus should_build. The driver evaluates it at
    operation end against the outcome + elapsed time to decide whether the builder-produced context is
    also exposed through the gated accessors CosmosResponse::capture_diagnostics() and the new
    symmetric CosmosError::capture_diagnostics(). The gate never builds the surfaced context — it
    only governs exposure. Off additionally creates the builder disabled so per-request population is
    skipped.
  • The operation summary roll-up (DiagnosticsSummary) is computed lazily on first access
    (cached via OnceLock) instead of eagerly at finalization; JSON output is byte-identical.

The capture_engine feature (off by default)

Behind the feature, a lock-free, pooled hot-path recorder appends a typed two-list event log
(Vec<Span> + Vec<Attr>); recording an event is a single Vec::push. A Span carries a kind,
an Option<SpanId> parent (SpanId is a NonZero index+1, so Option<SpanId> uses the niche), and
op-relative TimeOffset start/end. An Attr is a typed key/value tagged with its SpanId, with
AttrValue keeping the path allocation-light (U64/F64/first-class Status(CosmosStatus) plus
StaticStr(&'static str), SharedStr(Arc<str>), owned Str(Box<str>)). Storage is leased through
an EventLog RAII handle that returns it to the pool on drop.

Past the gate the log is replayed onto a DiagnosticsContextBuilder. The reconstruction carries the
captured per-attempt facets (pipeline type, transport security/kind, HTTP version) and the
server-reported duration, reconstructed from what was recorded rather than hardcoded; a parity
harness validates the reconstruction against a builder-built reference. The compact binary codec
caps its allocations against the input length and is covered by round-trip, truncation, unknown-key,
and oversized-header tests, and the pool has a multi-threaded rent/return stress test.

The feature is exercised by CI (which builds, tests, clippys, and docs with --all-features).

Remaining work

The capture engine is not yet wired into the live driver pipeline — the recorder is not fed at the
transport/pipeline sites, so on a real operation it would not yet carry every field the builder
records. Feeding the recorder from the live pipeline (including &'static / Arc<str> region names)
is the prerequisite before the engine could surface capture_diagnostics() from its own
reconstruction and, eventually, replace the builder. None of that is switched on here.

Guardrails

  • No diagnostics fidelity regression on the default path.
  • response.diagnostics() stays non-optional; the azure_data_cosmos public boundary is unchanged.
  • The PR stays a [Prototype] draft; the default is not flipped and the builder is not removed.
  • cargo build / test / clippy --all-targets / doc / fmt are green with and without
    capture_engine.

See sdk/cosmos/azure_data_cosmos_driver/DIAGNOSTICS-CAPTURE.md for the full design notes.

NaluTripician and others added 8 commits June 15, 2026 13:03
Adds a parallel, opt-in diagnostics CAPTURE subsystem to the Cosmos driver
under `diagnostics::capture`, coexisting with the existing DiagnosticsContext
(which is left untouched). It productionizes a benchmarked "Deferred Gated
Capture" design: a compact, append-only, lock-free hot path that builds an
aggregatable summary (and opt-in AZD1 binary detail) only when an op-end gate
decides it is worth it.

Subsystem (src/diagnostics/capture/): wire (AZD1 codec, reject-unknown-version
+ hardened decode), bounded LogPool, lock-free Drop/panic-safe
DiagnosticsRecorder (+ fan-out ChildRecord/merge_child), signals-aligned
Summary, gate (Off/Threshold/Always), interned version/User-Agent preamble.

Additive wiring (default Off => no behavior change, no recorder constructed):
- diagnostics/mod.rs: `pub mod capture;` (existing exports preserved).
- DriverOptions::with_capture_diagnostics_policy / capture_diagnostics_policy.
- CosmosDriver::execute_operation_direct records the operation-level outcome
  when enabled and attaches it via CosmosResponse::capture_diagnostics().
- lib.rs re-exports (Capture* aliases); criterion dev-dep + bench; flate2 dep.

Parallel by design: it currently builds its own Summary/Rendered. Feeding or
extending DiagnosticsContext instead is the deferred open question, documented
in DIAGNOSTICS-CAPTURE.md. No azure_core/typespec_client_core changes.

Bench (release): dropped fast-success ~140 ns; built summary ~1.6 us; built
detailed ~54 us. fmt + clippy --all-features --all-targets -D warnings +
cargo test --all-features all pass (1940 lib tests incl. 28 capture).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Make the deferred, threshold-gated capture front-end build the canonical
`DiagnosticsContext` instead of a parallel `Summary`/`Rendered`/`AZD1`
model. There is now one diagnostics model: the cheap, append-only, lock-free
hot-path recorder plus the op-end gate, and past the gate the captured log is
replayed onto the existing `DiagnosticsContextBuilder`.

- recorder.rs: new `AttemptRecord` value type (exec-context, region,
  endpoint, status, sub-status, service-request-id, RU, request_sent,
  durations) + `record_hedge_outcome` + new `record_end` signature; inlined
  the varint/TLV codec (wire.rs removed).
- context.rs (new): `build_context` maps captured attempts to
  `RequestDiagnostics` and attaches `HedgeDiagnostics` (region legs,
  winning leg, terminal state) for hedged operations.
- gate.rs: `finish` returns `Option<DiagnosticsContext>`; added an
  `off()` constructor for symmetry.
- Retired the parallel `Summary`/`Rendered`/`AZD1` wire format and the
  interned version preamble; dropped the now-unused `flate2` dependency.
- cosmos_response.rs: `capture_diagnostics()` returns `&DiagnosticsContext`.
- cosmos_driver.rs: op-executor wiring builds and attaches the context; the
  error path logs the built context's JSON.
- Added `tests/diagnostics_examples.rs` (real normal + hedged
  `DiagnosticsContext` JSON), an `Off`-mode criterion bench row, and
  updated DIAGNOSTICS-CAPTURE.md + CHANGELOG to the integrated framing.

Full gate green on upstream/main: fmt --check, clippy --all-features
--all-targets -D warnings, cargo test --all-features. No azure_core /
typespec_client_core changes; capture stays opt-in (default Off).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Make the gated-capture module the driver's diagnostics ENGINE and the OWNER
of the canonical diagnostics model, without losing rich data or breaking the
public SDK boundary. This is an ownership flip scoped to the driver crate.

- Re-home the rich model: move `diagnostics/diagnostics_context.rs` to
  `diagnostics/capture/model.rs`. `capture` now owns `DiagnosticsContext`,
  `DiagnosticsContextBuilder`, `RequestDiagnostics`, `ExecutionContext`,
  the transport-shard / fault-injection / event types, etc. `diagnostics/mod.rs`
  re-exports them from `capture` so the public paths
  (`diagnostics::DiagnosticsContext`, consumed by `azure_data_cosmos`) are
  unchanged. The pipeline keeps feeding the (now capture-owned) builder, so
  every rich field + true wall-clock timing is preserved — no data loss.
- Default the gate to `Always` (was `Off`): diagnostics are produced
  out-of-the-box, matching the driver's historical always-on behavior;
  configurable to `Threshold`/`Off` via `DriverOptions`.
- Route through the gate: at the operation-executor seam the recorder records
  the outcome + elapsed and the gate decides whether the canonical
  `DiagnosticsContext` is surfaced via `CosmosResponse::capture_diagnostics()`
  — one model, gated, not rebuilt as a parallel object. The error path logs the
  error's canonical diagnostics when the gate fires.
- Docs: rewrite `DIAGNOSTICS-CAPTURE.md` + the CHANGELOG to the ownership-flip
  framing (additive / non-breaking to the public SDK; default Always).

Public boundary verified intact: `azure_data_cosmos` builds + its 92 lib
tests pass; `response.diagnostics()` and the public `DiagnosticsContext` type
are unchanged. Full gate green on the driver crate: fmt --check, clippy
--all-features --all-targets -D warnings, cargo test --all-features (1926 lib +
2 examples + doctests + 45 integration). No azure_core / typespec changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wire the capture gate so it decides BEFORE the rich build happens, instead of
building then dropping. Adds an `enabled` flag to `DiagnosticsContextBuilder`:
when the capture policy is `Off` the builder is created disabled, so per-request
population is genuinely skipped on the hot path.

- model.rs: `DiagnosticsContextBuilder` gains `enabled` + `new_disabled()`.
  A disabled `start_request` returns an out-of-range `RequestHandle(usize::MAX)`
  without pushing, so every handle method (`update_request` / `complete_request`
  / `add_event` / transport-shard / fault-injection) auto-no-ops via the existing
  `requests.get_mut(handle.0)` miss — no changes to those call sites.
  `set_hedge_diagnostics` + `merge_hedge_attempt` guard against re-population;
  `clone_for_hedge_attempt` propagates `enabled`. `complete()` then yields a
  minimal context (activity id + final status, empty requests).
- cosmos_driver.rs: `new_diagnostics_envelope` takes `enabled` and builds a
  disabled builder when `capture_policy.mode == Off` (skipping the CPU-monitor /
  machine-id / fault-injection setup); the bootstrap account-fetch stays enabled.
- gate.rs: drop `Default` from `Mode` (skeptic-review fix) so a stray
  `Mode::default()` can't silently return `Off` against the `Always` policy
  default; the meaningful default lives on `DiagnosticsPolicy`.
- Tests: disabled builder records nothing + handle methods are safe no-ops;
  enabled builder still records fully (guards the default Always path).
- Docs: `DIAGNOSTICS-CAPTURE.md` documents the Off short-circuit and the precise
  `Threshold` fast-success boundary (rich data is incremental; not droppable
  without a full collection rewrite — no data loss when the gate fires); CHANGELOG
  updated.

Boundary (reported, not silently dropped): `Threshold` fast-success cannot avoid
the pipeline's incremental build without data loss, because the slow/error verdict
is only known at op-end and the always-on `response.diagnostics()` needs the
context; that build is cheap (move + Arc, lazy JSON) and the standalone
materialization is already gated.

Adversarial skeptic review: all 8 invariants verified (no data loss when the gate
fires, handle-safe, no control-flow dependency, hedging-safe, panic/Drop-safe,
default Always unchanged, public boundary intact, no secrets); the one finding
(Mode Default footgun) is fixed. Full gate green: fmt --check, clippy
--all-features --all-targets -D warnings, cargo test --all-features (1928 lib + 2
examples + 45 integration). Public `azure_data_cosmos` builds + 92 lib tests
pass. No azure_core / typespec changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Record the conclusion of the full collection-through-append-log investigation
in DIAGNOSTICS-CAPTURE.md §8: it is not a net win and is rejected. Capture the
three blockers concisely (full fidelity != sub-us; the builder is already
lock-free append + lazy JSON with Off disabling per-request population; Instant
timestamps and the non_exhaustive FaultInjectionEvaluation enum can't be
losslessly byte-encoded) and state the resolution — keep the current design
(full fidelity, lock-free append, lazy JSON, Off as the cheap opt-out, default
Always). Doc-only; no code changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two additive, non-breaking diagnostics features on the driver crate.

FEATURE 1 — .NET-style top-level Summary block:
- Add DiagnosticsSummary (SafeDebug, Serialize) to DiagnosticsContext,
  reachable via DiagnosticsContext::summary(). Modeled on .NET
  CosmosDiagnostics' top-level Summary: a (status, sub-status) histogram, total
  request charge, request/retry/throttle counts, regions contacted, final
  status, top error, total elapsed.
- Computed ONCE at FINALIZATION (in complete() and �ggregate_sub_operations)
  as a reduction over the already-collected requests — never on the hot path.
  It therefore exists only when a context is built; a dropped fast success
  produces no context and no summary.
- Serialized as the first top-level section of the diagnostics output (like
  .NET puts Summary at the top). Parity tests for typical / retry+throttle /
  error / hedged. Existing structural golden tests strip the summary key (it has
  dedicated tests + timing-dependent total_duration_ms).

FEATURE 2 — diagnostics encoding as a client option:
- Add DiagnosticsEncoding { Json (pretty, default), Compact (minified),
  Encoded (base64 of compact JSON) } and expose it on DriverOptions
  (with_diagnostics_encoding / diagnostics_encoding).
- Add DiagnosticsContext::encode(encoding) -> String and
  CosmosResponse::diagnostics_string(encoding) (pass the configured option).
  Default Json keeps existing output unchanged; Encoded round-trips via base64.

Docs (DIAGNOSTICS-CAPTURE.md §5a/§5b) + CHANGELOG + examples test updated. Full
gate green: fmt --check, clippy --all-features --all-targets -D warnings, cargo
test --all-features (1934 lib + 3 examples + 45 integration). Public
azure_data_cosmos builds + 92 lib tests pass (boundary unchanged). No
azure_core / typespec changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add serializable absolute timestamps alongside the existing
`Instant`-based durations so diagnostics can be correlated against
server-side and external logs.

- `DiagnosticsContext` gains an operation `start_time`/`end_time`
  (captured at recorder creation and at `complete()`), with matching
  accessors. The `DiagnosticsSummary` carries the same window.
- Each `RequestDiagnostics` attempt gains a `start_time` and an optional
  `end_time` (captured on complete/timeout/transport-failure, omitted
  via `skip_serializing_if` while in flight).
- Captured with `azure_core::time::OffsetDateTime` and serialized as
  RFC 3339 via a small `rfc3339` serde helper — no new dependency and no
  `azure_core` change.

Additive and non-breaking: the fields exist only when a context is built,
durations are unchanged, and the default output is otherwise the same.
Golden JSON tests strip the non-deterministic timestamps in
`normalize_diagnostics_json`; a new `timestamps_present_and_rfc3339` test
plus the typical-operation parity test assert presence and RFC 3339
format. Updated DIAGNOSTICS-CAPTURE.md (§5c) and the CHANGELOG.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the diagnostics-capture recorder's Vec<u8> TLV byte stream with a typed, two-list event log (Vec<Span> + Vec<Attr>). Recording an event is a single Vec::push of a small typed value -- no varints or byte encoding on the hot path. A span carries a kind, an optional parent id, and op-relative timestamps (its id is its index); attributes are typed key/value pairs tagged with their owning span.

- event.rs (new): SpanKind, Span, AttrKey, AttrValue, Attr, EventLog with push/clear/reconstruction helpers and a public compact-bytes round-trip.
- pool.rs: pool the two-Vec EventLog instead of Vec<u8>; bounded retention.
- recorder.rs: DiagnosticsRecorder pushes typed spans/attrs; same public API (AttemptRecord/HedgeOutcome/start/record_*); Drop/panic-safe pooling.
- context.rs: walk the spans/attrs tree onto DiagnosticsContextBuilder -- the typed log is the parsed form, so the byte-parse step is gone.
- encode.rs (new): optional cold-path varint/TLV serialization of the two lists, with round-trip + truncation tests.
- gate.rs: finish() builds from the EventLog then returns it to the pool.
- mod.rs: module decls + event-type re-exports.
- DIAGNOSTICS-CAPTURE.md (5d) + CHANGELOG + bench doc updated.

Internal to the capture engine; the public DiagnosticsContext output and the azure_data_cosmos boundary are unchanged. fmt + clippy --all-features --all-targets -D warnings clean; capture unit tests, diagnostics_examples, full lib suite (1944), and doctests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an opt-in diagnostics capture engine for azure_data_cosmos_driver that records a cheap, pooled, typed event log on the hot path and applies an operation-end gate (Off / Threshold / Always) to decide whether to surface diagnostics, while keeping the existing public DiagnosticsContext boundary intact via re-exports.

Changes:

  • Adds diagnostics::capture (event log + recorder + pool + gate + cold-path reconstruction/encoding) and re-homes the canonical diagnostics model under that module.
  • Extends driver options and response APIs to support capture gating and diagnostics string encodings (Json / Compact / Encoded), plus adds summary + RFC3339 wall-clock timestamps to diagnostics output.
  • Adds unit tests, example-generating tests, a live env-gated probe, and a Criterion benchmark for capture hot/cold paths.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
sdk/cosmos/azure_data_cosmos_driver/tests/live_capture_diagnostics.rs Env-gated live probe for capture diagnostics behavior.
sdk/cosmos/azure_data_cosmos_driver/tests/diagnostics_examples.rs Tests that print example diagnostics outputs (for PR/docs) and validate encoding/timestamps.
sdk/cosmos/azure_data_cosmos_driver/src/options/mod.rs Re-exports DiagnosticsEncoding from options.
sdk/cosmos/azure_data_cosmos_driver/src/options/driver_options.rs Adds capture policy + diagnostics encoding to DriverOptions and builder + tests.
sdk/cosmos/azure_data_cosmos_driver/src/options/diagnostics_options.rs Introduces DiagnosticsEncoding enum and string/display helpers.
sdk/cosmos/azure_data_cosmos_driver/src/models/cosmos_response.rs Adds capture_diagnostics() and diagnostics_string(encoding) on responses.
sdk/cosmos/azure_data_cosmos_driver/src/lib.rs Re-exports capture policy/mode aliases and DiagnosticsEncoding at crate root.
sdk/cosmos/azure_data_cosmos_driver/src/driver/cosmos_driver.rs Integrates capture gating into operation execution and adds a shared LogPool.
sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/mod.rs Re-exports diagnostics model/types from diagnostics::capture and exposes capture module.
sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/mod.rs New capture module entry point + public exports + module docs/tests.
sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/event.rs New typed two-list EventLog (Span + Attr) and helpers.
sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/pool.rs New bounded LogPool for reusing EventLog buffers.
sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/recorder.rs New per-operation DiagnosticsRecorder, attempt records, hedge outcomes.
sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/gate.rs New gate policy (DiagnosticsPolicy, Mode) + finish/should_build.
sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/context.rs Cold-path reconstruction from EventLog into canonical DiagnosticsContext.
sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/encode.rs Cold-path compact binary encoding/decoding for the event log.
sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/model.rs Capture-owned canonical diagnostics model updates: summary + timestamps + encoding.
sdk/cosmos/azure_data_cosmos_driver/DIAGNOSTICS-CAPTURE.md New design/usage document for capture engine and tradeoffs.
sdk/cosmos/azure_data_cosmos_driver/CHANGELOG.md Adds release notes describing the capture engine and diagnostics additions.
sdk/cosmos/azure_data_cosmos_driver/Cargo.toml Adds Criterion dev-dependency and registers a new benchmark target.
sdk/cosmos/azure_data_cosmos_driver/benches/diagnostics_capture.rs Adds Criterion benchmarks for capture drop/build paths.
Cargo.lock Pulls in Criterion dependency for bench support.
Comments suppressed due to low confidence (1)

sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/model.rs:1200

  • This region dedupes regions with Vec::contains inside the request loop, which is O(n²) for fan-out operations. A simple alternative is to push all regions, then sort() + dedup() once after the loop (O(n log n)) while keeping deterministic output.

Comment thread sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/mod.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/DIAGNOSTICS-CAPTURE.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/DIAGNOSTICS-CAPTURE.md
Comment thread sdk/cosmos/azure_data_cosmos_driver/CHANGELOG.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/pool.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/pool.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/recorder.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/pool.rs Outdated
NaluTripician and others added 2 commits June 17, 2026 14:34
…to nalutripician/diagnostics-capture-redesign

# Conflicts:
#	Cargo.lock
#	sdk/cosmos/azure_data_cosmos_driver/CHANGELOG.md
#	sdk/cosmos/azure_data_cosmos_driver/src/driver/cosmos_driver.rs
#	sdk/cosmos/azure_data_cosmos_driver/src/options/driver_options.rs
After merging main, the live capture diagnostics test used the removed runtime.get_or_create_driver(account, Some(options)) entry point. Update it to create_driver(driver_options), which now carries the account via DriverOptions::builder(account).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician NaluTripician marked this pull request as draft June 17, 2026 21:45
@NaluTripician NaluTripician changed the title Add a gated diagnostics capture engine to the Cosmos driver [Prototype] Diagnostics capture engine — the default diagnostics path (Cosmos driver) Jun 17, 2026
@NaluTripician NaluTripician changed the title [Prototype] Diagnostics capture engine — the default diagnostics path (Cosmos driver) [Prototype] Diagnostics capture engine (Cosmos driver) Jun 17, 2026
NaluTripician and others added 2 commits June 17, 2026 15:05
…to nalutripician/diagnostics-capture-redesign

# Conflicts:
#	sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/model.rs
cSpell flagged 'verbosities' in diagnostics/capture/model.rs, failing the Analyze job's spell-check step.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@analogrelay analogrelay left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial thoughts. Looking like a good start!

Comment thread sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/pool.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/pool.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/event.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/event.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/event.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/recorder.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/event.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/diagnostics/capture/pool.rs Outdated
NaluTripician and others added 4 commits June 18, 2026 10:05
Addresses Ashley Stanton-Nurse's design review on PR Azure#4619:

- EventLog is now an RAII lease (Arc<LogPool> + inner EventLogStorage); dropping it returns the storage to the pool automatically. Removes the explicit return_buffer step and the recorder's Drop/Option plumbing. (Azure#9, Azure#10)
- LogPool holds a plain Mutex<Vec<EventLogStorage>> instead of an internal Arc; consumers hold Arc<LogPool> and rent() takes &Arc<Self>, so sharing is explicit at the call site. (Azure#11)
- Removed LogPool::new(); callers use LogPool::default(). (#2)
- Default storage capacity (DEFAULT_SPANS=8, DEFAULT_ATTRS=32) sized for the common 1-4-attempt op; the pool only retains storages still at the default capacity and frees any that grew. (#1, #3)
- SpanId(NonZeroU32) newtype (index + 1), with Option<SpanId> parents replacing the NO_PARENT sentinel; the niche makes Option<SpanId> the size of a u32. (#4)
- AttrValue gains StaticStr(&'static str), SharedStr(Arc<str>), and a first-class Status(CosmosStatus) variant alongside U64/F64/Str, so the hot path can store regions/statuses without boxing. (#5, #6, Azure#7)
- TimeOffset newtype for span start/end instead of _ns-suffixed u64 fields. (Azure#8)

The cold-path encode/decode, context reconstruction, recorder, driver wiring, bench, and tests are updated to the new types. Public DiagnosticsContext output and the azure_data_cosmos boundary are unchanged. fmt + clippy --all-features --all-targets -D warnings, cargo doc -Dwarnings, cspell, capture unit tests (80), diagnostics_examples, full lib suite (1965), and the capture doctest all pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
DiagnosticsSummary now carries the client User-Agent (SDK + runtime identity), surfaced via DiagnosticsSummary::user_agent() and serialized inside the top-level summary block. The DiagnosticsContextBuilder gains a user_agent field + set_user_agent setter (propagated across hedge-attempt clones), and the driver's new_diagnostics_envelope feeds it from the runtime's user agent. It is omitted from the JSON when not supplied, so existing output is unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Every pipeline event in a request's `events` list was serialized with
`"duration_ms": null`. Events are recorded via `RequestEvent::new`, which
leaves `duration_ms` as `None`, and the only constructor that sets it
(`with_duration`) is used solely in tests -- so no production path ever
populated the field and it was always null.

Stamp the stage duration when an event is recorded on a request:
`RequestDiagnostics::add_event` now fills `duration_ms` (when the caller
did not supply one) with the time elapsed since the previous event, or
since the request started for the first event. So `response_headers_received`
carries the transport time-to-first-byte and `transport_failed` carries the
time until the failure. A caller-provided duration is still respected.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
NaluTripician and others added 6 commits June 22, 2026 13:48
Gate the unfinished event-log capture engine behind an off-by-default
`capture_engine` feature and align all docs/claims with what actually
ships, with no fidelity regression on the default path.

- A1: feature-gate event/context/encode/pool/recorder + gate::finish and
  their re-exports behind `capture_engine` (default OFF). Keep the gate
  (Mode/DiagnosticsPolicy/should_build) and the canonical model
  unconditional. Gate the engine-only example test + bench on the feature.
- A3/A1: the driver no longer rents the inert recorder. It computes the
  exposure gate directly (outcome + elapsed) so `capture_diagnostics()`
  behavior is unchanged with the feature off; the builder remains the sole
  populator of the surfaced context.
- A4: drop per-op `endpoint.to_string()` / `format!("{op:?} {res:?}")`.
- A2: compute `DiagnosticsSummary` lazily via `OnceLock` on first access
  instead of eagerly in `complete()`; JSON output stays byte-identical.
- A6: add symmetric `CosmosError::capture_diagnostics()` populated by the
  same gate; replace the tracing::debug!-only failure handling.
- A5: hardcoded attempt-count `1` removed with the inert wiring; true-count
  plumbing belongs to the flagged B-series when capture is wired.
- A9: rewrite DIAGNOSTICS-CAPTURE.md, CHANGELOG, capture/driver_options
  rustdoc to the honest state (default path = builder; capture behind the
  flag; reconstruction still lossy; parity gate pending).

build/test/clippy/fmt green with and without `capture_engine`.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Make the capture-engine reconstruction faithful instead of fabricating,
add a parity harness and concurrency/codec hardening tests, and correct
the docs accordingly. All behind the off-by-default capture_engine feature
(exercised by CI via --all-features); the default diagnostics path is
unchanged.

- Capture per-attempt pipeline type, transport security, transport kind,
  HTTP version, and the server-reported duration into the event log
  (new AttrKeys + AttemptRecord::with_transport / with_server_duration_ms),
  and reconstruct them in context.rs instead of hardcoding
  DataPlane/Secure/Gateway/Http2 and the client-observed span.
- Add a parity harness asserting the reconstruction carries the real
  facets and matches a builder-built reference field-by-field.
- Cap decode() pre-allocation against the remaining input so a malformed
  header cannot drive a giant allocation; add a regression test.
- Add a multi-threaded LogPool rent/return stress test.
- Refresh DIAGNOSTICS-CAPTURE.md, CHANGELOG, and module docs to the
  current state (facets captured + parity-checked; remaining work is
  feeding the recorder from the live pipeline).

build/test/clippy/doc/fmt green with and without capture_engine.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The driver evaluates the exposure gate directly; it no longer rents the
event-log recorder on the default path. Fix the scope section accordingly
and drop a dangling cross-reference.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The capture_engine feature gates the DiagnosticsContext import in gate.rs,
so the bare intra-doc link on DiagnosticsPolicy::always broke under the
docs.rs feature set (which excludes capture_engine), failing the CI
'cargo +nightly docs-rs' step. Use the fully-qualified link path so it
resolves regardless of the imported-into-scope items.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Build Analyze job runs cSpell, which flagged three coined abbreviations
in test/comment identifiers. Rename to dictionary words:
- parity test local refr -> ref_req
- encode test name overallocate -> over_allocate
- model.rs Clone comment uncomputed -> unset

Verified clean with eng/common/spelling/Invoke-Cspell.ps1 over all changed
files; capture tests still pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cargo fmt wraps the transport_http_version assertion now that the local
rename pushed it past the 100-col width. Verified the full Analyze gate
locally (fmt, taplo, check, clippy, doc, cargo +nightly docs-rs, cSpell).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

/// Base64 of the compact JSON — a single opaque token for size-sensitive logging. Decode with
/// standard base64 and parse the resulting JSON to recover the full diagnostics.
Encoded,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why - this would always result in at least 33% larger output than Compact?

Section A items 1 & 2 from the diagnostics-sync review with Fabian.

Item 2 - remove the `Off` diagnostics-capture policy entirely so diagnostics
are always collected (a customer who silently dropped diagnostics is
unsupportable). Default stays `Always`; the gate only governs *exposure*:
- gate.rs: drop `Mode::Off`, `DiagnosticsPolicy::off()`, the `Mode::Off` arm in
  `should_build`, and the `off_never_builds` test.
- model.rs: drop the `DiagnosticsContextBuilder` `enabled` flag, `new_disabled()`,
  `is_enabled()`, and the `if !enabled` guards in `start_request` /
  `set_hedge_diagnostics` / `merge_hedge_attempt`; the builder always records.
- cosmos_driver.rs: `new_diagnostics_envelope` no longer takes/branches on
  `enabled`; drop the `diagnostics_enabled` computation and the
  `if diagnostics_enabled` wrappers in the op-end gate (success + error paths).
- Update Off references in driver_options.rs, cosmos_response.rs, capture/mod.rs,
  DIAGNOSTICS-CAPTURE.md, and the unreleased CHANGELOG entry.

Item 1 - clarify what the capture benchmark measures. `capture_built_context`
builds the `DiagnosticsContext` struct only (no JSON). Added
`capture_built_context_and_json` which also serializes the detailed JSON, so the
two isolate the serialization cost. Measured (criterion, indicative): gate-drop
~0.86us, struct-build ~3.7us, struct+JSON ~10.4us - JSON serialization adds
~6.6us and is the single most expensive step, supporting keeping serialization
on-demand/cached in the SDK. Updated the section 3 table and section 7 run
instructions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician

Copy link
Copy Markdown
Contributor Author

Pushed 730540709 addressing two items from our diagnostics sync:

Removed the Off capture policy entirely. Diagnostics are now always collected — Mode is just Threshold/Always (default Always), and the gate governs exposure only, never whether the context is built. Dropped Mode::Off, DiagnosticsPolicy::off(), and the DiagnosticsContextBuilder enabled/new_disabled() machinery (incl. the start_request/hedge no-op guards) and the driver's diagnostics_enabled branching. Docs + CHANGELOG updated to the no-Off framing.

Clarified the benchmark (struct-build vs JSON-serialize). capture_built_context measures building the DiagnosticsContext struct only — it does not serialize JSON. Added capture_built_context_and_json which also serializes the detailed JSON, so the two isolate the cost.

Measured (criterion, indicative medians):

Path Cost
Gate drops fast success ~0.86 µs
Build struct only (no JSON) ~3.7 µs
Build struct + detailed JSON ~10.4 µs

So JSON serialization adds ~6.6 µs and is the single most expensive step (it nearly triples the struct-build cost). That argues for keeping JSON serialization on-demand/cached in the SDK; the ~3.7 µs struct build is the part a driver-side threshold gate would save.

Gate green: fmt --check, clippy --all-features --all-targets -D warnings, doc --all-features -D warnings, lib tests (1925 passed) + diagnostics integration tests, and the public azure_data_cosmos still builds unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

4 participants