feat(cosmos-perf): record server-reported request duration as backend latency#4316
Open
tvaron3 wants to merge 9 commits intoAzure:release/azure_data_cosmos-previewsfrom
Open
feat(cosmos-perf): record server-reported request duration as backend latency#4316tvaron3 wants to merge 9 commits intoAzure:release/azure_data_cosmos-previewsfrom
tvaron3 wants to merge 9 commits intoAzure:release/azure_data_cosmos-previewsfrom
Conversation
… latency
Reads x-ms-request-duration-ms response header on every Cosmos request
in the perf binary and emits backend_{min,max,mean,p50,p90,p99}_ms per
operation per reporting interval. Surfaces server-side processing time
separately from the client-observed wall-clock latency so network plus
client-queue overhead can be isolated downstream.
Implementation:
- New helper extract_backend_duration in operations/mod.rs parses the
header value as milliseconds (f64) into a Duration.
- Operation::execute now returns Result<Option<Duration>> instead of
Result<()>; each per-op implementation reads the header off the
response (or sums across pages for QueryItems via into_pages()).
- Stats gains a parallel HdrHistogram for backend durations; samples
are independent of client samples (intervals where 0 backend
durations were observed surface as None on Summary, which serializes
as null and is skipped via skip_serializing_if).
- PerfResult struct gains 6 Option<f64> backend_*_ms fields.
Existing fields, behaviour, and JSON keys are unchanged. Old payloads
without backend_* keys ingest cleanly into ADX (the schema mapping
treats missing keys as null).
Tests:
- backend_durations_aggregate_separately_from_client verifies the two
histograms are independent.
- backend_summary_is_none_when_no_samples verifies the all-None path
when the header is absent.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ws' into feat/perf-backend-latency-v2 # Conflicts: # sdk/cosmos/azure_data_cosmos_perf/src/operations/create_item.rs # sdk/cosmos/azure_data_cosmos_perf/src/operations/upsert_item.rs
Renamed bmean -> backend_mean_dur, bmin -> back_min, bmax -> back_max to avoid cspell 'Unknown word' errors in CI Analyze step. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ting Read cgroupv2 cpu.stat and cpu.max to compute pod-level CPU utilization that matches what kubectl top reports. Falls back to None when not running in a cgroup (e.g., local dev). Wire through PerfResult for ADX ingestion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 'cgroupv' and 'usec' to the allowed words list to fix CI spell-check failures from the cgroup CPU metric addition. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move 'cgroupv' and 'usec' from .vscode/cspell.json to the local sdk/cosmos/.cspell.json ignoreWords list. Reverts the root config. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds backend (server-reported) latency measurement to the Cosmos perf runner by parsing x-ms-request-duration-ms, aggregating it alongside existing wall-clock latency, and emitting per-interval backend percentile/summary fields.
Changes:
- Parse
x-ms-request-duration-msinto an optionalDurationand plumb it throughOperation::execute. - Track backend-duration histograms separately from client wall-clock latency and emit backend summary stats (plus a “BackendP99” column in the console report).
- Add cgroup CPU quota utilization metric reporting (cgroupv2) and update editor spellchecker word list.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure_data_cosmos_perf/src/operations/mod.rs | Adds extract_backend_duration() and changes Operation::execute to return Option<Duration>. |
| sdk/cosmos/azure_data_cosmos_perf/src/operations/create_item.rs | Returns backend duration extracted from response headers. |
| sdk/cosmos/azure_data_cosmos_perf/src/operations/read_item.rs | Returns backend duration extracted from response headers. |
| sdk/cosmos/azure_data_cosmos_perf/src/operations/upsert_item.rs | Returns backend duration extracted from response headers. |
| sdk/cosmos/azure_data_cosmos_perf/src/operations/query_items.rs | Iterates query by pages and sums backend duration across pages. |
| sdk/cosmos/azure_data_cosmos_perf/src/stats.rs | Adds backend histograms/summary fields and introduces cgroup CPU percent metric collection/printing. |
| sdk/cosmos/azure_data_cosmos_perf/src/runner.rs | Records backend durations into stats and serializes backend/cgroup metrics in result documents. |
| .vscode/cspell.json | Adds words related to the new cgroup metrics (and reformats the file). |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Guard against u128→u64 truncation in histogram recording by clamping with .min(u64::MAX as u128) before cast - Add division-by-zero guard for period_usec==0 and cores<=0.0 in cgroup CPU calculation - Add 'cgroupv2' to sdk/cosmos/.cspell.json ignore list Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restore original 2-space indent and sort order, keeping diff to just the 3 added words (cgroupv, cgroupv2, usec). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enhances the Cosmos DB Rust perf runner with two measurement improvements:
1. Backend (Server-Reported) Latency
Parses the
x-ms-request-duration-msresponse header and tracks it alongside the existing client-observed wall-clock latency. This separates network transit time from server processing time in performance reports.2. Cgroup CPU Utilization (
cgroup_cpu_percent)Adds a new metric that reads cgroupv2
cpu.statandcpu.maxto compute CPU utilization relative to the container's allocated quota. This matches whatkubectl top podsreports and replaces the misleadingsystem_cpu_percent(which reads/proc/statand shows host-level CPU, appearing artificially low in containers).Changes
read_cgroup_cpu_percent()function with delta-based measurement, division-by-zero guards, and safe u128→u64 clamping in histogram recordingcgroup_cpu_percent: Option<f32>throughPerfResult(serialized to Cosmos DB → ADX)Testing
cosmos-perf-rg— cgroup CPU reports ~78% matching kubectl top