[Feat] Add per-stage pipeline store and layerwise overlap metrics by dante159753 · Pull Request #933 · ModelEngine-Group/unified-cache-management

dante159753 · 2026-04-24T03:05:28Z

Purpose

Adds observability needed to diagnose pipeline store (Cache|Posix) per-tier performance and to verify that UCMLayerWiseConnector's load/forward/save overlap actually hides backend latency.

Modifications

Pipeline Store (C++ side):

Cache: per-task load/dump duration + bandwidth, queue wait, dispatch, backend-wait, H2D/D2H, backend-submit durations; lookup hit/miss counters and instantaneous hit rate gauge.
Posix: per-IO S2H/H2S duration + bandwidth, queue wait, failure counters.

Layerwise Connector (Python side):

wait_blocking_ms: primary signal for overlap health (near 0 = perfect overlap; tracks load_duration = degenerated to serial).
inter_wait_interval_ms, next_layer_submit_ms, first_layer_submit_ms, save_submit_ms, save_per_layer_wait_ms, save_tail_total_ms, stalled_layers_total.

Infrastructure:

Change metrics library from STATIC to SHARED so cachestore.so, posixstore.so, and ucmmetrics.so share one Metrics singleton. With a STATIC metrics the function-local GetInstance() produced a separate instance in each .so and all C++ UpdateStats() calls from stores were silently dropped.
Set INSTALL_RPATH=$ORIGIN/../../shared/metrics on cachestore.so and posixstore.so; $ORIGIN on ucmmetrics.so.

Test

Adds observability needed to diagnose pipeline store (Cache|Posix) per-tier performance and to verify that UCMLayerWiseConnector's load/forward/save overlap actually hides backend latency. Pipeline Store (C++ side): - Cache: per-task load/dump duration + bandwidth, queue wait, dispatch, backend-wait, H2D/D2H, backend-submit durations; lookup hit/miss counters and instantaneous hit rate gauge. - Posix: per-IO S2H/H2S duration + bandwidth, queue wait, failure counters. Layerwise Connector (Python side): - wait_blocking_ms: primary signal for overlap health (near 0 = perfect overlap; tracks load_duration = degenerated to serial). - inter_wait_interval_ms, next_layer_submit_ms, first_layer_submit_ms, save_submit_ms, save_per_layer_wait_ms, save_tail_total_ms, stalled_layers_total. Infrastructure: - Change metrics library from STATIC to SHARED so cachestore.so, posixstore.so, and ucmmetrics.so share one Metrics singleton. With a STATIC metrics the function-local GetInstance() produced a separate instance in each .so and all C++ UpdateStats() calls from stores were silently dropped. - Set INSTALL_RPATH=\$ORIGIN/../../shared/metrics on cachestore.so and posixstore.so; \$ORIGIN on ucmmetrics.so. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Fold a short UpdateStats call back onto one line per clang-format 20. Pure formatting; no behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds 20 new panels to examples/metrics/grafana.json covering the metrics introduced in the previous commit: Pipeline / Cache stage (9 panels): Hit Rate (full width), Load/Dump Duration + Bandwidth, Backend Wait, H2D / D2H durations, Backend Submit Ratio. Pipeline / Posix stage (4 panels): S2H and H2S bandwidth and duration. Layerwise Connector (7 panels): Wait Blocking (full-width key metric), Inter-Wait Interval, Stalled Layers Rate, First Layer Submit, Save Tail, Next Layer Submit, Save Per-Layer Wait. Thresholds are set on the most actionable panels: Hit Rate (red < 0.5, green >= 0.8), Backend Submit Ratio (green < 0.3, red >= 0.7), Wait Blocking (green 0, red >= 20 ms), Save Tail (green 0, red >= 50 ms). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds docs/source/user-guide/metrics/performance_analysis.md covering diagnosis of the Cache|Posix pipeline store in both layerwise and non-layerwise mode using the per-stage and layerwise metrics. Sections: 1. Architecture and load/dump data flow with metric annotations. 2. Critical metrics ranked by diagnostic priority. 3. Nine bottleneck playbooks (low hit rate, slow loads, slow Posix, dump back-pressure, no layerwise speedup, layerwise TTFT regression, layerwise save tail, non-layerwise dump-bound, worker pool starvation) - each with metric signature and concrete tunables. 4. Layerwise vs non-layerwise diagnostic differences. 5. PromQL recipes for hit rate, miss ratio, p99 decomposition, overlap loss, dump back-pressure, worker utilization. 6. Tunables indexed by symptom. 7. Honest list of what these metrics cannot tell you. Wires the new page into the User Guide toctree in docs/source/index.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous ASCII flowcharts in performance_analysis.md misaligned in the Sphinx HTML output because the box-drawing characters and CJK punctuation have inconsistent monospace widths. Replace them with Mermaid flowcharts: - Storage tier overview (vLLM Worker → CacheStore → PosixStore) - LOAD path with per-stage metric annotations on each node (queue waits, dispatch, posix S2H, backend wait, H2D, epilog) - Cache-hit fast path - DUMP path showing the user-visible chain plus the asynchronous BackendDumpStage / Posix H2S branch with a dashed edge Color-codes nodes by tier (Cache blue, Posix orange, completion green) so the tier hand-offs are visible at a glance. Wire-up: - Add sphinxcontrib-mermaid to docs/requirements-docs.txt. - Register the extension in docs/source/conf.py. - Set myst_fence_as_directive = ["mermaid"] so plain ```mermaid fences work both on GitHub (native rendering) and on Sphinx / ReadTheDocs. Drop the now-unused 'promql' language tag from PromQL examples - the default Pygments PromQL lexer rejects the colon in 'ucm:metric_name' and emitted highlighting warnings on every build. Verified locally with `sphinx -W`: my page now builds without warnings; mermaid blocks render as <pre class="mermaid"> for the client-side mermaid.js to pick up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… range The previous Posix bandwidth buckets (0.05, 0.1, 0.2, 0.5, 1, 2, 4, 8, 12, 16, 24, 32) had only four sample points across the entire range where actual production performance lives, so p50/p90/p99 collapsed to a single bucket and changes within the band were invisible. New layout: - 0.05 / 0.1 / 0.2 / 0.5 -> degraded paths - 1, 1.5, 2, 2.5, 3, 3.5, 4 -> 0.5 GB/s steps (slow/saturated NVMe) - 5, 6, 7, 8, 9, 10, 11, 12 -> 1 GB/s steps (typical) - 14, 16, 20, 24, 32 -> sparse headroom 24 buckets total per metric (previously 12). Applied to both pipeline_posix_s2h_bandwidth_gbps (read) and pipeline_posix_h2s_bandwidth_gbps (write). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Each of the 10 critical latency / bandwidth histograms gets: 1. A heatmap panel showing the full distribution shape over time (cluster-wide aggregation, full bucket density visible). 2. A p50 / p90 / p99 time-series panel with per-worker breakdown, line styles distinguishing the three quantiles (p50 solid, p90 dashed, p99 thick dashed). Metrics covered: Cache: load_duration_ms, dump_duration_ms, load_backend_wait_duration_ms, load_bandwidth_gbps, dump_bandwidth_gbps Posix: s2h_duration_ms, h2s_duration_ms, s2h_bandwidth_gbps, h2s_bandwidth_gbps Layerwise: wait_blocking_ms Grouped into three collapsible row sections (Cache / Posix / Layerwise), collapsed by default so the existing dashboard scrolls unchanged. Adds 23 panels (3 rows + 20 children); existing 29 panels untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Every timeseries panel had spanNulls=false, which combined with showPoints=auto rendered intermittent metrics (cache miss only, layerwise-only, dump on tp_rank=0 only, NaN from rate(_sum)/rate(_count) when count=0, histogram_quantile with no observations) as scattered discrete points instead of continuous lines. Set spanNulls to 60000 ms across all 39 panels: gaps under one minute are bridged so normal "quiet window" sparseness reads as a smooth line, while real outages longer than 60s still break the line and remain visible. No query, color, or layout changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Cache and Posix stores can be used standalone (Posix can run without Cache; Cache always sits on top of some backend but isn't pipeline- specific), so the pipeline_ prefix on their metric names misrepresented the binding. The pipeline_ framing only makes sense for the composite PipelineStore wrapper, not for the underlying stores' own instrumentation. Renames (181 references across 8 files): pipeline_cache_* -> cache_* pipeline_posix_* -> posix_* Touched: - examples/metrics/metrics_configs.yaml (registration) - examples/metrics/grafana.json (panel queries) - docs/source/user-guide/metrics/performance_analysis.md (prose & PromQL) - ucm/store/cache/cc/{trans_manager.h,buffer_manager.h, load_queue.cc,dump_queue.cc} (UpdateStats calls) - ucm/store/posix/cc/trans_queue.cc (UpdateStats calls) Plus clang-format reflow on five C++ files where the now-shorter metric-name string literals fit back onto one line. layerwise_* metrics keep their prefix - they live in the connector layer, not the store layer, and the prefix correctly identifies the layerwise overlap mechanism. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

quantiles their description promises Panels 17/18/21/22 (Connector Load/Save Duration/Speed) all advertised "P50, P90, P95, P99 and Average" in their description, but each ran a single rate(_sum)/rate(_count) target that only produces the average. The percentiles were fictional. Replace each panel's single avg target with five targets: A: p50 (histogram_quantile(0.5, sum by (le, worker_id) (rate(_bucket)))) B: p90 (histogram_quantile(0.9, ...)) C: p95 (histogram_quantile(0.95, ...)) D: p99 (histogram_quantile(0.99, ...)) E: avg (the original rate(_sum)/rate(_count)) Per-worker breakdown so an outlier worker stands out as its own line. Distinct line styles (p50 solid, p90/p95/p99 dashed at decreasing period, p99 thicker, avg solid with 5% fill) keep the multi-series panel readable. Legend switched to table mode so worker rows can be scanned at a glance. This was a pre-existing issue; only Cache|Posix metric prefix renaming in the previous commit went near these panels, but the description<->query mismatch was not introduced by recent changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The 17 main-level latency and throughput panels (Cache / Posix / Layerwise) previously rendered only the rate(_sum)/rate(_count) average, so a single hot tail or a slow worker was invisible. Each now exposes four lines per worker: p50 (solid), p90 (dashed 10-6), p99 (dashed 4-4, thicker), avg (solid with 5% fill). Panels touched: Cache: load_duration, load_bandwidth, dump_duration, dump_bandwidth, load_backend_wait, h2d_duration, d2h_duration (7 panels) Posix: s2h_bandwidth, h2s_bandwidth, s2h_duration, h2s_duration (4 panels) Layerwise: wait_blocking, inter_wait_interval, first_layer_submit, save_tail_total, next_layer_submit, save_per_layer_wait (6 panels) Skipped (intentionally - not latency/throughput): - Hit-rate / counter-rate / count panels (id 14-16, 19-20, 100, 108, 115). - Distribution-row heatmaps (id 230, 232, ... 248) which already show the full shape. - Distribution-row dedicated quantile panels (id 231, 233, ... 249) which already render p50/p90/p99 - now somewhat redundant with the upgraded main panels but kept for the focused deep-dive view inside the collapsible distribution rows. Legend switched to table mode so worker rows can be scanned at a glance; tooltip set to multi-series sorted desc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

prefix from titles Two cleanups in one pass. 1) Remove redundant quantile panels from distribution rows. The previous commit gave every main-level latency / throughput panel its own p50/p90/p99 lines (commit de667fb), which made the dedicated p50/p90/p99 panels inside the collapsed distribution rows duplicate work. Drop those 10 panels (id 231, 233, 235, 237, 239, 241, 243, 245, 247, 249) and lay out the remaining heatmaps 2-up at w=12 (with the Cache backend_wait and Layerwise blocking heatmaps full-width because they are alone on their row). Total panels: 52 -> 42 (top-level unchanged at 32 since rows are counted as containers). 2) Strip leftover "Pipeline" wording from titles to match the metric rename in commit acd5af0: - "Pipeline / Cache Load Duration" -> "Cache Load Duration" - "Pipeline Cache -- Distributions" -> "Cache -- Distributions" - "Cache / Cache Load Duration (heatmap)" -> "Cache / Load Duration" - "(heatmap)" suffix dropped since rows now contain only heatmaps. The Layerwise / * panel titles are unchanged - layerwise is the correct prefix for those metrics. Queries themselves were already migrated and contain no pipeline_* references. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The four count panels (Connector Load/Save Requests/Blocks Num) used rate(_sum)/rate(_count) which yields 0/0 NaN whenever no batch happened in the rate window, so the dashboard frequently showed gaps or single isolated points even though the underlying metrics were healthy. Each panel is now split into two: Rate panel (existing id 15/16/19/20, renamed): expr: rate(ucm:METRIC_count[$__rate_interval]) unit: ops (events/sec) Always defined when there is any activity in the window - no more NaN gaps. Per worker, single line. Size distribution panel (new id 130-133): p50 / p90 / p99 / avg of per-batch value (request count or block count). Same quantile + avg multi-line treatment as the duration and bandwidth panels. Layout shift: the new size-distribution rows sit right under their sibling rate panel. All panels with y >= 8 shifted by +8 to make room for the load distributions; everything with y >= 24 shifted by +16 to also accommodate the save distributions. Distribution-row sections (id 200/210/220) re-packed so rows sit immediately after the previous row's last child (closing a 16-grid-unit slack inherited from the earlier cleanup commit). Final connector section: Hit Rate (full width) Load Req Rate | Load Blk Rate Load Req Size | Load Blk Size Load Duration | Load Speed Save Req Rate | Save Blk Rate Save Req Size | Save Blk Size Save Duration | Save Speed Top-level panel count: 32 -> 36 (4 new dist panels). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…159753/unified-cache-management into pipeline-layerwise-metrics

The single grafana.json had grown to 36 top-level panels + 10 nested heatmap children (52 total, ~7000 lines), too big to edit ergonomically and mixing concerns (overview / store-tier diagnosis / advanced layerwise) for different audiences. Drop 2 redundant panels: - id=16 Connector Load Blocks Rate - id=20 Connector Save Blocks Rate Both queries were rate(_count) of metrics observed in the same update_stats({...}) call as their Requests Rate siblings, producing mathematically identical time series. The size-distribution panels for the same metrics (130/131/132/133) are NOT redundant and stay. Split the remaining ~50 panels into three module dashboards under examples/metrics/: grafana_connector.json (11 panels) Audience: anyone running UCM. Top-level activity, hit rate, per-batch sizes, end-to-end load/save durations and speeds. grafana_pipeline_store.json (13 main + 11 in collapsible distribution rows = 24 total) Audience: people diagnosing storage tier perf. Cache hit rate / backend submit ratio at top, then per-stage Cache and Posix latency / bandwidth, then Cache + Posix distribution heatmaps in collapsible rows. grafana_layerwise.json (8 main + 1 in collapsible row = 9 total) Audience: layerwise mode users. Wait_blocking key signal full width at top, then stalls / submit costs / save tail, plus a layerwise wait_blocking heatmap. Per-dashboard hygiene: - Fresh uid (ucm-connector-overview / ucm-pipeline-store / ucm-layerwise). - version=1, panel ids renumbered from 1, gridPos repacked from y=0. - Tagged ucm + <module>; each carries an "Other UCM dashboards" dropdown link in the header that auto-discovers siblings by tag. - Cache Hit Rate full-width at top of pipeline_store so the 9 Cache panels pair cleanly without bleeding into the Posix section. - templating, time, refresh copied verbatim from the original. Documentation: - docs/source/user-guide/metrics/metrics.md: replace the single "Import Dashboard" section with a "Pick the dashboard you need" table. - docs/source/developer-guide/add_metrics.md: update the "add a new panel" pointer to the right module dashboard. Verified: JSON validity, panel id uniqueness within each dashboard, all metric refs resolve in metrics_configs.yaml, no leftover standalone rate(_count) for the dropped duplicate metrics, original grafana.json removed, pre-commit clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

dante159753 requested review from FangRun2, Infinite666, Tarrei, harrisonyhq, mag1c-h, qyh111 and ygwpz as code owners April 24, 2026 03:05

dante159753 and others added 5 commits April 27, 2026 11:18

[Fix] Reformat load_queue.cc with clang-format to satisfy CI lint

b288eef

Fold a short UpdateStats call back onto one line per clang-format 20. Pure formatting; no behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

dante159753 force-pushed the pipeline-layerwise-metrics branch from a56840a to 3fe0dd4 Compare April 27, 2026 03:21

dante159753 requested a review from flesher0813 as a code owner April 27, 2026 03:21

dante159753 and others added 11 commits April 27, 2026 11:30

Merge branch 'develop' into pipeline-layerwise-metrics

f315a78

Merge branch 'pipeline-layerwise-metrics' of https://github.com/dante…

51c260c

…159753/unified-cache-management into pipeline-layerwise-metrics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Add per-stage pipeline store and layerwise overlap metrics#933

[Feat] Add per-stage pipeline store and layerwise overlap metrics#933
dante159753 wants to merge 16 commits intoModelEngine-Group:developfrom
dante159753:pipeline-layerwise-metrics

dante159753 commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dante159753 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Modifications

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dante159753 commented Apr 24, 2026 •

edited

Loading