Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 57 additions & 31 deletions docs/source/user-guide/metrics/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,24 @@

UCM exports metrics through the vLLM `/metrics` endpoint. The metrics are
registered from `examples/metrics/metrics_configs.yaml`, accumulated inside UCM,
and exposed through `prometheus_client` in Prometheus multiprocess mode.
and fanned out to the enabled Python-side consumers.

## How Metrics Flow

1. `metrics_configs.yaml` defines counters, gauges, and histograms.
2. `PrometheusStatsLogger` creates matching `prometheus_client` metrics with
`model_name` and `worker_id` labels.
3. Histogram bucket boundaries are taken from the Python Prometheus histogram
2. The Python metrics dispatcher drains the C++ metrics snapshot once and fans
it out to the enabled `multiproc` and `vllm_connector` consumers.
3. `multiproc` creates `prometheus_client` metrics with `model_name` and
`worker_id` labels. `vllm_connector` creates vLLM KV connector metrics with
`model_name`, `engine`, and `worker_rank` labels.
4. Histogram bucket boundaries are taken from the Python Prometheus histogram
and registered into the C++ metrics library.
4. UCM code calls `UpdateStats()` on the hot path.
5. The C++ metrics library records counter, gauge, and histogram bucket deltas in
5. UCM code calls `UpdateStats()` on the hot path.
6. The C++ metrics library records counter, gauge, and histogram bucket deltas in
per-thread double buffers.
6. Every `log_interval` seconds, the observability thread calls
`get_all_stats_and_clear()` and applies the deltas to `prometheus_client`.
7. vLLM exposes the resulting cumulative Prometheus series through `/metrics`.
7. The dispatcher applies deltas to each enabled Python consumer without one
consumer clearing the other's accumulated snapshot.
8. vLLM exposes the resulting cumulative Prometheus series through `/metrics`.

Histograms are bucketed at update time. UCM no longer stores raw histogram
sample vectors, so there is no `histogram_max_length` setting and no histogram
Expand Down Expand Up @@ -83,14 +86,14 @@ vllm bench serve \
--ignore-eos
```

Check that UCM metrics are present:
Check that UCM vLLM connector metrics are present:

```bash
curl http://<vllm-worker-ip>:8000/metrics | grep ucm:
curl http://<vllm-worker-ip>:8000/metrics | grep 'ucm:'
```

Prometheus multiprocess `.db` files should also appear in
`$PROMETHEUS_MULTIPROC_DIR`.
If the `multiproc` consumer is enabled, Prometheus multiprocess `.db` files
should also appear in `$PROMETHEUS_MULTIPROC_DIR`.

### 2. Start Prometheus and Grafana

Expand Down Expand Up @@ -158,25 +161,30 @@ dashboards while preserving the time range and `model_name` value.
Each dashboard has a `job` selector. It defaults to **All** and uses regex
matching, so dashboards also work for metrics that do not carry a `job` label.

The UCM dashboards also have a `View` selector and a `worker_id` selector:
The UCM dashboards also have a `View` selector and a `worker_rank` selector:

- **Aggregated**: default service-level view. Worker labels are collapsed.
- **Per Worker**: split panels by `worker_id` for worker-specific diagnosis.
- **worker_id**: defaults to **All**. Select a specific worker ID to filter all
- **Per Worker**: split panels by `worker_rank` for worker-specific diagnosis.
- **worker_rank**: defaults to **All**. Select a specific worker rank to filter all
UCM panels to that worker only.

Heatmap panels and panels grouped by another dimension may ignore the `View`
selector because their grouping is already defined by the panel. They still use
the `worker_id` filter.
the `worker_rank` filter.

## Metrics Configuration

Metrics are configured in `examples/metrics/metrics_configs.yaml`:

```yaml
log_interval: 5
multiproc_dir: "/vllm-workspace"
metric_prefix: "ucm:"
# multiproc_dir: "/vllm-workspace"
# multiproc_prefix: "ucm_multiproc:"
vllm_connector_prefix: "ucm:"

consumers:
# multiproc: true
vllm_connector: true

counter:
- name: "cache_load_bytes_total"
Expand All @@ -193,17 +201,22 @@ histogram:
buckets: [0.1, 0.5, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000]
```

Metric names are exported with the configured prefix. For example,
`cache_load_duration_ms` becomes `ucm:cache_load_duration_ms`. Prometheus also
exports histogram helper series such as `_bucket`, `_sum`, and `_count`.
Metric names are exported per consumer. For example, `cache_load_duration_ms`
is exported as `ucm:cache_load_duration_ms` by the default `vllm_connector`
consumer. If `multiproc` is also enabled, use a separate prefix such as
`ucm_multiproc:` so both consumers do not register the same Prometheus metric.
Prometheus also exports histogram helper series such as `_bucket`, `_sum`,
and `_count`.

Counter values are increments. Gauge values replace the current value.
Histogram values are observations that are immediately assigned to configured
buckets in the C++ metrics library.

## Available Metrics

The default metrics configuration contains the following UCM metrics.
The default metrics configuration contains the following UCM metric names. The
table uses the default `vllm_connector_prefix`. UCM duration metrics are
exported in milliseconds.

### Counters

Expand Down Expand Up @@ -240,30 +253,43 @@ The default metrics configuration contains the following UCM metrics.
| `ucm:load_speed` | Speed of loading from UCM in GB/s. |
| `ucm:save_requests_num` | Number of requests saved to UCM. |
| `ucm:save_blocks_num` | Number of blocks saved to UCM. |
| `ucm:save_duration` | Time to save to UCM in milliseconds. |
| `ucm:save_speed` | Speed of saving to UCM in GB/s. |
| `ucm:save_duration` | Time from UCM connector `wait_for_save` entry to async dump task completion in milliseconds. |
| `ucm:save_completion_wait_duration` | Time spent blocked while confirming async UCM connector dump completion in milliseconds. |
| `ucm:interval_lookup_hit_rates` | Hit rates of UCM lookup requests. |
| `ucm:cache_lookup_duration_ms` | Cache buffer lookup wall-clock time. |
| `ucm:cache_lookup_backend_duration_ms` | Backend lookup wall-clock time when descending due to no buffer or buffer miss. |
| `ucm:cache_load_duration_ms` | End-to-end Cache stage load task duration in milliseconds. |
| `ucm:cache_dump_duration_ms` | End-to-end Cache stage dump task duration in milliseconds. |
| `ucm:cache_load_bandwidth_gbps` | Cache stage effective load bandwidth in GB/s. |
| `ucm:cache_dump_bandwidth_gbps` | Cache stage effective dump bandwidth in GB/s. |
| `ucm:cache_load_bandwidth_gbps` | Cache stage effective load throughput in GB/s over the whole task lifetime, including queue/backend waits. Not a DMA bandwidth (see `cache_h2d_bandwidth_gbps`). |
| `ucm:cache_dump_bandwidth_gbps` | Cache stage effective dump throughput in GB/s over the whole task lifetime, including queue and compute-event waits. Not a DMA bandwidth (see `cache_d2h_bandwidth_gbps`). |
| `ucm:cache_load_queue_wait_duration_ms` | Time a Cache load task spent queued before dispatch worker pickup. |
| `ucm:cache_dump_queue_wait_duration_ms` | Time a Cache dump task spent queued before dispatch worker pickup. |
| `ucm:cache_load_dispatch_duration_ms` | Cache load dispatch cost: buffer allocation plus backend submission. |
| `ucm:cache_load_backend_submit_duration_ms` | Cache load backend submit duration: buffer allocation plus backend load submission. |
| `ucm:cache_shard_backend_wait_ms` | Cache load per-shard `WaitBackendTaskReady()` duration. |
| `ucm:cache_shard_h2d_ms` | Cache load per-shard H2D async submit duration. |
| `ucm:cache_dump_mkbuf_duration_ms` | Cache dump mk_buf phase duration. |
| `ucm:cache_d2h_duration_ms` | Cache dump D2H stream sync phase duration. |
| `ucm:cache_h2d_submit_ms` | Cache load per-shard H2D async submit CPU cost. Submission only, not the transfer. |
| `ucm:cache_h2d_sync_ms` | Cache load residual H2D stream drain after the last shard submit. Large values mean H2D copy is the bottleneck. |
| `ucm:cache_h2d_bandwidth_gbps` | Cache load pure H2D copy bandwidth, directly comparable to memcpy microbenchmarks. |
| `ucm:cache_dump_mkbuf_duration_ms` | Cache dump mk_buf phase duration (buffer allocation/reuse plus D2H async submit). |
| `ucm:cache_dump_prereq_wait_ms` | Cache dump wait for the prerequisite compute event before D2H can start. Large values mean dump is compute-gated. |
| `ucm:cache_d2h_duration_ms` | Cache dump pure D2H copy drain, compute-event wait excluded. |
| `ucm:cache_d2h_bandwidth_gbps` | Cache dump pure D2H copy bandwidth, directly comparable to memcpy microbenchmarks. |
| `ucm:cache_dump_backend_submit_duration_ms` | Cache dump synchronous backend submit duration. |
| `ucm:cache_dump_backend_wait_duration_ms` | Cache dump wait for the lower tier to finish writing. Large values mean storage write is the bottleneck. |
| `ucm:posix_load_task_duration_ms` | End-to-end Posix load task duration. |
| `ucm:posix_dump_task_duration_ms` | End-to-end Posix dump task duration. |
| `ucm:posix_s2h_bandwidth_gbps` | Posix stage read bandwidth per task in GB/s. |
| `ucm:posix_h2s_bandwidth_gbps` | Posix stage write bandwidth per task in GB/s. |
| `ucm:posix_load_queue_wait_duration_ms` | Time a Posix load task spent queued before first worker pickup. |
| `ucm:posix_dump_queue_wait_duration_ms` | Time a Posix dump task spent queued before first worker pickup. |
| `ucm:layerwise_batch_total_ms` | Layerwise batch wall-clock time from `start_load_kv()` entry to `wait_for_save()` return. |
| `ucm:layerwise_batch_total_load_only_ms` | Layerwise load-only batch wall-clock time. |
| `ucm:layerwise_batch_total_save_only_ms` | Layerwise save-only batch wall-clock time. |
| `ucm:layerwise_batch_total_load_save_ms` | Layerwise load-and-save batch wall-clock time. |
| `ucm:layerwise_batch_total_no_transfer_ms` | Layerwise batch wall-clock time with neither load nor save work. |
| `ucm:layerwise_batch_load_wait_total_load_only_ms` | Total `wait_for_layer_load()` blocking time accumulated within one load-only layerwise batch. |
| `ucm:layerwise_batch_load_wait_total_load_save_ms` | Total `wait_for_layer_load()` blocking time accumulated within one load-and-save layerwise batch. |
| `ucm:layerwise_batch_save_tail_save_only_ms` | `wait_for_save()` tail duration within one save-only layerwise batch. |
| `ucm:layerwise_batch_save_tail_load_save_ms` | `wait_for_save()` tail duration within one load-and-save layerwise batch. |
| `ucm:layerwise_wait_blocking_ms` | Time `wait_for_layer_load()` blocked before returning. |
| `ucm:layerwise_wait_tasks_count` | Number of per-request load tasks awaited in a single layer wait. |
| `ucm:layerwise_inter_wait_interval_ms` | Interval between consecutive `wait_for_layer_load()` calls. |
Expand Down
Loading
Loading