Skip to content

[Feat] Add per-stage pipeline store and layerwise overlap metrics#933

Open
dante159753 wants to merge 16 commits intoModelEngine-Group:developfrom
dante159753:pipeline-layerwise-metrics
Open

[Feat] Add per-stage pipeline store and layerwise overlap metrics#933
dante159753 wants to merge 16 commits intoModelEngine-Group:developfrom
dante159753:pipeline-layerwise-metrics

Conversation

@dante159753
Copy link
Copy Markdown
Contributor

@dante159753 dante159753 commented Apr 24, 2026

Purpose

Adds observability needed to diagnose pipeline store (Cache|Posix) per-tier performance and to verify that UCMLayerWiseConnector's load/forward/save overlap actually hides backend latency.

Modifications

Pipeline Store (C++ side):

  • Cache: per-task load/dump duration + bandwidth, queue wait, dispatch, backend-wait, H2D/D2H, backend-submit durations; lookup hit/miss counters and instantaneous hit rate gauge.
  • Posix: per-IO S2H/H2S duration + bandwidth, queue wait, failure counters.

Layerwise Connector (Python side):

  • wait_blocking_ms: primary signal for overlap health (near 0 = perfect overlap; tracks load_duration = degenerated to serial).
  • inter_wait_interval_ms, next_layer_submit_ms, first_layer_submit_ms, save_submit_ms, save_per_layer_wait_ms, save_tail_total_ms, stalled_layers_total.

Infrastructure:

  • Change metrics library from STATIC to SHARED so cachestore.so, posixstore.so, and ucmmetrics.so share one Metrics singleton. With a STATIC metrics the function-local GetInstance() produced a separate instance in each .so and all C++ UpdateStats() calls from stores were silently dropped.
  • Set INSTALL_RPATH=$ORIGIN/../../shared/metrics on cachestore.so and posixstore.so; $ORIGIN on ucmmetrics.so.

Test

dante159753 and others added 5 commits April 27, 2026 11:18
Adds observability needed to diagnose pipeline store (Cache|Posix)
per-tier performance and to verify that UCMLayerWiseConnector's
load/forward/save overlap actually hides backend latency.

Pipeline Store (C++ side):
  - Cache: per-task load/dump duration + bandwidth, queue wait, dispatch,
    backend-wait, H2D/D2H, backend-submit durations; lookup hit/miss
    counters and instantaneous hit rate gauge.
  - Posix: per-IO S2H/H2S duration + bandwidth, queue wait, failure
    counters.

Layerwise Connector (Python side):
  - wait_blocking_ms: primary signal for overlap health (near 0 = perfect
    overlap; tracks load_duration = degenerated to serial).
  - inter_wait_interval_ms, next_layer_submit_ms, first_layer_submit_ms,
    save_submit_ms, save_per_layer_wait_ms, save_tail_total_ms,
    stalled_layers_total.

Infrastructure:
  - Change metrics library from STATIC to SHARED so cachestore.so,
    posixstore.so, and ucmmetrics.so share one Metrics singleton. With a
    STATIC metrics the function-local GetInstance() produced a separate
    instance in each .so and all C++ UpdateStats() calls from stores were
    silently dropped.
  - Set INSTALL_RPATH=\$ORIGIN/../../shared/metrics on cachestore.so and
    posixstore.so; \$ORIGIN on ucmmetrics.so.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fold a short UpdateStats call back onto one line per clang-format 20.
Pure formatting; no behavior change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds 20 new panels to examples/metrics/grafana.json covering the metrics
introduced in the previous commit:

Pipeline / Cache stage (9 panels):
  Hit Rate (full width), Load/Dump Duration + Bandwidth, Backend Wait,
  H2D / D2H durations, Backend Submit Ratio.

Pipeline / Posix stage (4 panels):
  S2H and H2S bandwidth and duration.

Layerwise Connector (7 panels):
  Wait Blocking (full-width key metric), Inter-Wait Interval, Stalled
  Layers Rate, First Layer Submit, Save Tail, Next Layer Submit, Save
  Per-Layer Wait.

Thresholds are set on the most actionable panels: Hit Rate (red < 0.5,
green >= 0.8), Backend Submit Ratio (green < 0.3, red >= 0.7), Wait
Blocking (green 0, red >= 20 ms), Save Tail (green 0, red >= 50 ms).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds docs/source/user-guide/metrics/performance_analysis.md covering
diagnosis of the Cache|Posix pipeline store in both layerwise and
non-layerwise mode using the per-stage and layerwise metrics.

Sections:
  1. Architecture and load/dump data flow with metric annotations.
  2. Critical metrics ranked by diagnostic priority.
  3. Nine bottleneck playbooks (low hit rate, slow loads, slow Posix,
     dump back-pressure, no layerwise speedup, layerwise TTFT
     regression, layerwise save tail, non-layerwise dump-bound,
     worker pool starvation) - each with metric signature and
     concrete tunables.
  4. Layerwise vs non-layerwise diagnostic differences.
  5. PromQL recipes for hit rate, miss ratio, p99 decomposition,
     overlap loss, dump back-pressure, worker utilization.
  6. Tunables indexed by symptom.
  7. Honest list of what these metrics cannot tell you.

Wires the new page into the User Guide toctree in docs/source/index.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous ASCII flowcharts in performance_analysis.md misaligned in
the Sphinx HTML output because the box-drawing characters and CJK
punctuation have inconsistent monospace widths. Replace them with
Mermaid flowcharts:

  - Storage tier overview (vLLM Worker → CacheStore → PosixStore)
  - LOAD path with per-stage metric annotations on each node
    (queue waits, dispatch, posix S2H, backend wait, H2D, epilog)
  - Cache-hit fast path
  - DUMP path showing the user-visible chain plus the asynchronous
    BackendDumpStage / Posix H2S branch with a dashed edge

Color-codes nodes by tier (Cache blue, Posix orange, completion green)
so the tier hand-offs are visible at a glance.

Wire-up:
  - Add sphinxcontrib-mermaid to docs/requirements-docs.txt.
  - Register the extension in docs/source/conf.py.
  - Set myst_fence_as_directive = ["mermaid"] so plain ```mermaid
    fences work both on GitHub (native rendering) and on Sphinx /
    ReadTheDocs.

Drop the now-unused 'promql' language tag from PromQL examples - the
default Pygments PromQL lexer rejects the colon in 'ucm:metric_name'
and emitted highlighting warnings on every build.

Verified locally with `sphinx -W`: my page now builds without
warnings; mermaid blocks render as <pre class="mermaid"> for the
client-side mermaid.js to pick up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@dante159753 dante159753 force-pushed the pipeline-layerwise-metrics branch from a56840a to 3fe0dd4 Compare April 27, 2026 03:21
dante159753 and others added 11 commits April 27, 2026 11:30
… range

The previous Posix bandwidth buckets (0.05, 0.1, 0.2, 0.5, 1, 2, 4, 8,
12, 16, 24, 32) had only four sample points across the entire range
where actual production performance lives, so p50/p90/p99 collapsed to
a single bucket and changes within the band were invisible.

New layout:
  - 0.05 / 0.1 / 0.2 / 0.5  -> degraded paths
  - 1, 1.5, 2, 2.5, 3, 3.5, 4  -> 0.5 GB/s steps (slow/saturated NVMe)
  - 5, 6, 7, 8, 9, 10, 11, 12  -> 1 GB/s steps (typical)
  - 14, 16, 20, 24, 32  -> sparse headroom

24 buckets total per metric (previously 12). Applied to both
pipeline_posix_s2h_bandwidth_gbps (read) and
pipeline_posix_h2s_bandwidth_gbps (write).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Each of the 10 critical latency / bandwidth histograms gets:
  1. A heatmap panel showing the full distribution shape over time
     (cluster-wide aggregation, full bucket density visible).
  2. A p50 / p90 / p99 time-series panel with per-worker breakdown,
     line styles distinguishing the three quantiles (p50 solid, p90
     dashed, p99 thick dashed).

Metrics covered:
  Cache:     load_duration_ms, dump_duration_ms,
             load_backend_wait_duration_ms, load_bandwidth_gbps,
             dump_bandwidth_gbps
  Posix:     s2h_duration_ms, h2s_duration_ms,
             s2h_bandwidth_gbps, h2s_bandwidth_gbps
  Layerwise: wait_blocking_ms

Grouped into three collapsible row sections (Cache / Posix /
Layerwise), collapsed by default so the existing dashboard scrolls
unchanged. Adds 23 panels (3 rows + 20 children); existing 29 panels
untouched.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Every timeseries panel had spanNulls=false, which combined with
showPoints=auto rendered intermittent metrics (cache miss only,
layerwise-only, dump on tp_rank=0 only, NaN from rate(_sum)/rate(_count)
when count=0, histogram_quantile with no observations) as scattered
discrete points instead of continuous lines.

Set spanNulls to 60000 ms across all 39 panels: gaps under one minute
are bridged so normal "quiet window" sparseness reads as a smooth line,
while real outages longer than 60s still break the line and remain
visible.

No query, color, or layout changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cache and Posix stores can be used standalone (Posix can run without
Cache; Cache always sits on top of some backend but isn't pipeline-
specific), so the pipeline_ prefix on their metric names misrepresented
the binding. The pipeline_ framing only makes sense for the composite
PipelineStore wrapper, not for the underlying stores' own
instrumentation.

Renames (181 references across 8 files):
  pipeline_cache_*  ->  cache_*
  pipeline_posix_*  ->  posix_*

Touched:
  - examples/metrics/metrics_configs.yaml  (registration)
  - examples/metrics/grafana.json          (panel queries)
  - docs/source/user-guide/metrics/performance_analysis.md (prose & PromQL)
  - ucm/store/cache/cc/{trans_manager.h,buffer_manager.h,
                        load_queue.cc,dump_queue.cc} (UpdateStats calls)
  - ucm/store/posix/cc/trans_queue.cc      (UpdateStats calls)

Plus clang-format reflow on five C++ files where the now-shorter
metric-name string literals fit back onto one line.

layerwise_* metrics keep their prefix - they live in the connector
layer, not the store layer, and the prefix correctly identifies
the layerwise overlap mechanism.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
quantiles their description promises

Panels 17/18/21/22 (Connector Load/Save Duration/Speed) all advertised
"P50, P90, P95, P99 and Average" in their description, but each ran a
single rate(_sum)/rate(_count) target that only produces the average.
The percentiles were fictional.

Replace each panel's single avg target with five targets:
  A: p50 (histogram_quantile(0.5, sum by (le, worker_id) (rate(_bucket))))
  B: p90 (histogram_quantile(0.9, ...))
  C: p95 (histogram_quantile(0.95, ...))
  D: p99 (histogram_quantile(0.99, ...))
  E: avg (the original rate(_sum)/rate(_count))

Per-worker breakdown so an outlier worker stands out as its own line.
Distinct line styles (p50 solid, p90/p95/p99 dashed at decreasing
period, p99 thicker, avg solid with 5% fill) keep the multi-series
panel readable. Legend switched to table mode so worker rows can be
scanned at a glance.

This was a pre-existing issue; only Cache|Posix metric prefix renaming
in the previous commit went near these panels, but the
description<->query mismatch was not introduced by recent changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 17 main-level latency and throughput panels (Cache / Posix /
Layerwise) previously rendered only the rate(_sum)/rate(_count)
average, so a single hot tail or a slow worker was invisible. Each now
exposes four lines per worker: p50 (solid), p90 (dashed 10-6), p99
(dashed 4-4, thicker), avg (solid with 5% fill).

Panels touched:
  Cache:     load_duration, load_bandwidth, dump_duration,
             dump_bandwidth, load_backend_wait, h2d_duration,
             d2h_duration  (7 panels)
  Posix:     s2h_bandwidth, h2s_bandwidth, s2h_duration,
             h2s_duration  (4 panels)
  Layerwise: wait_blocking, inter_wait_interval, first_layer_submit,
             save_tail_total, next_layer_submit,
             save_per_layer_wait  (6 panels)

Skipped (intentionally - not latency/throughput):
  - Hit-rate / counter-rate / count panels (id 14-16, 19-20, 100,
    108, 115).
  - Distribution-row heatmaps (id 230, 232, ... 248) which already
    show the full shape.
  - Distribution-row dedicated quantile panels (id 231, 233, ... 249)
    which already render p50/p90/p99 - now somewhat redundant with the
    upgraded main panels but kept for the focused deep-dive view
    inside the collapsible distribution rows.

Legend switched to table mode so worker rows can be scanned at a
glance; tooltip set to multi-series sorted desc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
prefix from titles

Two cleanups in one pass.

1) Remove redundant quantile panels from distribution rows.
   The previous commit gave every main-level latency / throughput
   panel its own p50/p90/p99 lines (commit de667fb), which made the
   dedicated p50/p90/p99 panels inside the collapsed distribution
   rows duplicate work. Drop those 10 panels (id 231, 233, 235, 237,
   239, 241, 243, 245, 247, 249) and lay out the remaining heatmaps
   2-up at w=12 (with the Cache backend_wait and Layerwise blocking
   heatmaps full-width because they are alone on their row).

   Total panels: 52 -> 42 (top-level unchanged at 32 since rows are
   counted as containers).

2) Strip leftover "Pipeline" wording from titles to match the metric
   rename in commit acd5af0:
     - "Pipeline / Cache Load Duration" -> "Cache Load Duration"
     - "Pipeline Cache -- Distributions" -> "Cache -- Distributions"
     - "Cache / Cache Load Duration (heatmap)" -> "Cache / Load Duration"
     - "(heatmap)" suffix dropped since rows now contain only heatmaps.
   The Layerwise / * panel titles are unchanged - layerwise is the
   correct prefix for those metrics.

   Queries themselves were already migrated and contain no
   pipeline_* references.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The four count panels (Connector Load/Save Requests/Blocks Num) used
rate(_sum)/rate(_count) which yields 0/0 NaN whenever no batch
happened in the rate window, so the dashboard frequently showed gaps
or single isolated points even though the underlying metrics were
healthy.

Each panel is now split into two:

  Rate panel (existing id 15/16/19/20, renamed):
    expr: rate(ucm:METRIC_count[$__rate_interval])
    unit: ops (events/sec)
    Always defined when there is any activity in the window - no more
    NaN gaps. Per worker, single line.

  Size distribution panel (new id 130-133):
    p50 / p90 / p99 / avg of per-batch value (request count or block
    count). Same quantile + avg multi-line treatment as the duration
    and bandwidth panels.

Layout shift: the new size-distribution rows sit right under their
sibling rate panel. All panels with y >= 8 shifted by +8 to make room
for the load distributions; everything with y >= 24 shifted by +16 to
also accommodate the save distributions. Distribution-row sections
(id 200/210/220) re-packed so rows sit immediately after the previous
row's last child (closing a 16-grid-unit slack inherited from the
earlier cleanup commit).

Final connector section:
  Hit Rate (full width)
  Load Req Rate    | Load Blk Rate
  Load Req Size    | Load Blk Size
  Load Duration    | Load Speed
  Save Req Rate    | Save Blk Rate
  Save Req Size    | Save Blk Size
  Save Duration    | Save Speed

Top-level panel count: 32 -> 36 (4 new dist panels).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The single grafana.json had grown to 36 top-level panels + 10 nested
heatmap children (52 total, ~7000 lines), too big to edit ergonomically
and mixing concerns (overview / store-tier diagnosis / advanced
layerwise) for different audiences.

Drop 2 redundant panels:
  - id=16 Connector Load Blocks Rate
  - id=20 Connector Save Blocks Rate
Both queries were rate(_count) of metrics observed in the same
update_stats({...}) call as their Requests Rate siblings, producing
mathematically identical time series. The size-distribution panels for
the same metrics (130/131/132/133) are NOT redundant and stay.

Split the remaining ~50 panels into three module dashboards under
examples/metrics/:

  grafana_connector.json (11 panels)
    Audience: anyone running UCM. Top-level activity, hit rate,
    per-batch sizes, end-to-end load/save durations and speeds.

  grafana_pipeline_store.json (13 main + 11 in collapsible
    distribution rows = 24 total)
    Audience: people diagnosing storage tier perf. Cache hit
    rate / backend submit ratio at top, then per-stage Cache and
    Posix latency / bandwidth, then Cache + Posix distribution
    heatmaps in collapsible rows.

  grafana_layerwise.json (8 main + 1 in collapsible row = 9 total)
    Audience: layerwise mode users. Wait_blocking key signal full
    width at top, then stalls / submit costs / save tail, plus
    a layerwise wait_blocking heatmap.

Per-dashboard hygiene:
  - Fresh uid (ucm-connector-overview / ucm-pipeline-store /
    ucm-layerwise).
  - version=1, panel ids renumbered from 1, gridPos repacked from y=0.
  - Tagged ucm + <module>; each carries an "Other UCM dashboards"
    dropdown link in the header that auto-discovers siblings by tag.
  - Cache Hit Rate full-width at top of pipeline_store so the 9 Cache
    panels pair cleanly without bleeding into the Posix section.
  - templating, time, refresh copied verbatim from the original.

Documentation:
  - docs/source/user-guide/metrics/metrics.md: replace the single
    "Import Dashboard" section with a "Pick the dashboard you need"
    table.
  - docs/source/developer-guide/add_metrics.md: update the
    "add a new panel" pointer to the right module dashboard.

Verified: JSON validity, panel id uniqueness within each dashboard,
all metric refs resolve in metrics_configs.yaml, no leftover standalone
rate(_count) for the dropped duplicate metrics, original grafana.json
removed, pre-commit clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant