Skip to content

[Metric] Support custom metric labels#7865

Merged
Jiang-Jia-Jun merged 11 commits into
PaddlePaddle:developfrom
liyonghua0910:develop+20260520_metric_labels
May 27, 2026
Merged

[Metric] Support custom metric labels#7865
Jiang-Jia-Jun merged 11 commits into
PaddlePaddle:developfrom
liyonghua0910:develop+20260520_metric_labels

Conversation

@liyonghua0910

@liyonghua0910 liyonghua0910 commented May 20, 2026

Copy link
Copy Markdown
Collaborator

Motivation

Re-implement PR #4480 on current develop branch. The original PR introduced MetricsManagerInterface to support custom labels (e.g., model_id) on Prometheus metrics, but the codebase has changed significantly since then (WorkMetricsManager removed, new v1/serving_chat.py added, internal_adapter_utils.py no longer imports metrics, etc.).

Modifications

  1. New file fastdeploy/metrics/interface.py: Define MetricsManagerInterface with 4 abstract methods: set_value, inc_value, dec_value, obs_value.

  2. fastdeploy/metrics/metrics.py:

    • MetricsManager inherits from MetricsManagerInterface
    • Parse FD_DEFAULT_METRIC_LABEL_VALUES env var; when set to a valid non-empty JSON dict, enable metric labels
    • _patch_labelnames(): add label keys from _default_labelvalues to all metrics' labelnames
    • Implement the 4 interface methods: when labels enabled, call metric.labels(**merged).set()/inc()/dec()/observe(); otherwise, call metric.set()/inc()/dec()/observe() directly
    • Handle set_cache_config_info(), record_zmq_stats(), init_zmq_metrics(), _init_speculative_metrics() with label support
  3. fastdeploy/envs.py: Add FD_DEFAULT_METRIC_LABEL_VALUES environment variable

  4. 14 call-site files: Migrate all main_process_metrics.<metric>.set()/inc()/dec()/observe() calls to set_value()/inc_value()/dec_value()/obs_value()

  5. fastdeploy/metrics/metrics_middleware.py: Migrate HTTP metric .labels().inc()/.observe() to inc_value()/obs_value() with labelvalues parameter

Usage or Command

# Enable custom labels on all metrics
export FD_DEFAULT_METRIC_LABEL_VALUES='{"model_id":"qwen3-30b"}'

# Or with multiple labels
export FD_DEFAULT_METRIC_LABEL_VALUES='{"model_id":"qwen3-30b","version":"v2"}'

When not set (default {}), behavior is identical to current code — no labels are added.

An example of metrics text when default label values are enabled:

# HELP fastdeploy:spec_decode_draft_single_head_acceptance_rate Single head acceptance rate of speculative decoding
# TYPE fastdeploy:spec_decode_draft_single_head_acceptance_rate gauge
fastdeploy:spec_decode_draft_single_head_acceptance_rate{head="0",model_id="qwen3-30b",version="v2"} 0.9

Accuracy Tests

No model output changes. This only affects Prometheus metric formatting.

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. This PR only changes Prometheus metric label routing logic with no model output changes; unit tests for the metrics interface are not included in this PR and can be added as a follow-up.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot

paddle-bot Bot commented May 20, 2026

Copy link
Copy Markdown

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot

PaddlePaddle-bot commented May 20, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-27 08:55:25

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

Required 任务仍有 2 个失败:主覆盖率任务未达阈值,Approval 任务等待人工审批;处理完成前不建议合入。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 37 5 0 0 0

2 任务状态汇总

日志列说明:失败任务直接使用 CI 工具预生成链接;运行中任务手动链接 Job。

2.1 Required任务 : 8/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 1h25m PR问题:diff覆盖率77%,低于80% 补充 metrics/common_engine 覆盖率测试 Job -
Approval 18s 需要 Approval 请通过人工审批 Job -
其余 8 个必选任务通过 - - - - -

2.2 可选任务 — 29/32 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 2m12s Job -
CI_HPU 1h3m Job -
Trigger Jenkins for PR 20s Job -
其余 29 个可选任务通过 - - -

3 失败详情(仅 required)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率阈值未达标(置信度: 高)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

  • 状态: ❌ 失败
  • 错误类型: 覆盖率阈值未达标
  • 置信度: 高
  • 根因摘要: diff 覆盖率 77%,低于80%
  • 分析器: 通用分析(coverage)

失败用例:

测试 错误 根因
CoverageExitCode 9 单测已通过,覆盖率校验失败

根因详情:
主测试日志显示 All tests passed,失败发生在 Verify Code Coverage Threshold (80%) 步骤。diff_coverage.jsontotal_percent_covered 为 77,低于 80% 阈值;共 183 行变更中有 42 行未覆盖。主要缺口集中在 fastdeploy/metrics/metrics.py(61%),包括 _patch_labelnames、默认 label 合并、set_value/inc_value/dec_value/obs_value 的 label 分支、ZMQ metrics、cache config info、speculative metrics 注册等;另有 fastdeploy/engine/common_engine.py 的 2099/2176/2191 三个 metrics 迁移调用行未覆盖。

代码上下文复核:

  • tests/metrics/test_metrics.py 当前只覆盖 metric 输出过滤与 spec_decode_draft_single_head_acceptance_rate 的 label 场景,未覆盖新增默认 label、ZMQ、cache config 与统一接口分支。
  • fastdeploy/metrics/metrics.py:660-905 包含本 PR 新增的默认 label patch、统一 set/inc/dec/observe、ZMQ 与 cache_config_info 支持,是本次 diff 覆盖率缺口主体。
  • fastdeploy/engine/common_engine.py:2099/2176/2191inc_value/dec_value 迁移调用位于 splitwise decode 资源预分配/回收路径,当前测试未覆盖。

修复建议:

  1. tests/metrics/test_metrics.py 或新增同目录用例中,针对 fastdeploy/metrics/metrics.py:660-774 补充测试:合法/非法/空 FD_DEFAULT_METRIC_LABEL_VALUES_patch_labelnames 不修改原始 dict、默认 labels 与调用方 labelvalues 合并,以及 set_value/inc_value/dec_value/obs_value 的带 label 分支。
  2. 针对 fastdeploy/metrics/metrics.py:836-905 补充 init_zmq_metrics()record_zmq_stats()set_cache_config_info() 在开启默认 labels、无 metrics_info、已有 cache_config_info Gauge 等路径下的单测。
  3. tests/engine/test_common_engine.py 中用 mock 覆盖 fastdeploy/engine/common_engine.py:2099/2176/2191decode_preallocated_req_numfailed_recv_first_token_req_numinc_value/dec_value 调用路径。

修复建议摘要: 补充 metrics/common_engine 覆盖率测试

Approval — 需要人工审批(置信度: 高)

该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。

@codecov-commenter

codecov-commenter commented May 20, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 72.13115% with 51 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@8a4ac65). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/metrics/metrics.py 52.00% 39 Missing and 9 partials ⚠️
fastdeploy/engine/common_engine.py 66.66% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7865   +/-   ##
==========================================
  Coverage           ?   64.06%           
==========================================
  Files              ?      468           
  Lines              ?    65107           
  Branches           ?     9984           
==========================================
  Hits               ?    41709           
  Misses             ?    20565           
  Partials           ?     2833           
Flag Coverage Δ
GPU 73.16% <72.13%> (?)
XPU 7.09% <13.66%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

…e interface

Introduce MetricsManagerInterface with unified set_value/inc_value/dec_value/obs_value methods.
When FD_DEFAULT_METRIC_LABEL_VALUES is set to a valid non-empty JSON dict, metric labels
(e.g. model_id) are automatically applied. Otherwise, operations fall back to the raw
prometheus_client calls.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-26 20:59:12

📋 Review 摘要

PR 概述:新增 MetricsManagerInterface 抽象层,支持通过 FD_DEFAULT_METRIC_LABEL_VALUES 环境变量为所有 Prometheus 指标附加自定义 label(如 model_id)。
变更范围fastdeploy/metrics/fastdeploy/envs.py、14 个调用方文件、测试文件
影响面 Tag[Feature] [Engine] [APIServer] [KVCache] [DataProcessor] [PD Disaggregation]

问题

级别 文件 概述
🔴 兼容性 fastdeploy/metrics/metrics.py:819 spec_decode_draft_single_head_acceptance_rate 指标名称 Breaking Change

📝 PR 规范检查

标题 Tag [Metric] 不在官方 Tag 列表中,属于自创 Tag。

标题建议(可直接复制):

  • [Feature] Support custom metric labels
PR 描述建议(点击展开,可直接复制)
## Motivation

Re-implement PR #4480 on current develop branch. The original PR introduced `MetricsManagerInterface` to support custom labels (e.g., `model_id`) on Prometheus metrics, but the codebase has changed significantly since then (`WorkMetricsManager` removed, new `v1/serving_chat.py` added, `internal_adapter_utils.py` no longer imports metrics, etc.).

## Modifications

1. **New file `fastdeploy/metrics/interface.py`**: Define `MetricsManagerInterface` with 4 abstract methods: `set_value`, `inc_value`, `dec_value`, `obs_value`.

2. **`fastdeploy/metrics/metrics.py`**:
   - `MetricsManager` inherits from `MetricsManagerInterface`
   - Parse `FD_DEFAULT_METRIC_LABEL_VALUES` env var; when set to a valid non-empty JSON dict, enable metric labels
   - `_patch_labelnames()`: add label keys from `_default_labelvalues` to all metrics' `labelnames`
   - Implement the 4 interface methods: when labels enabled, call `metric.labels(**merged).set()/inc()/dec()/observe()`; otherwise, call `metric.set()/inc()/dec()/observe()` directly
   - Handle `set_cache_config_info()`, `record_zmq_stats()`, `init_zmq_metrics()`, `_init_speculative_metrics()` with label support
   - **Breaking Change**: `spec_decode_draft_single_head_acceptance_rate` 指标从多个独立 Gauge(`_0``_1` 等后缀)改为单个带 `head` label 的 Gauge,旧指标名称不再存在,请更新相关 Dashboard/告警规则。

3. **`fastdeploy/envs.py`**: Add `FD_DEFAULT_METRIC_LABEL_VALUES` environment variable

4. **14 call-site files**: Migrate all `main_process_metrics.<metric>.set()/inc()/dec()/observe()` calls to `set_value()/inc_value()/dec_value()/obs_value()`

5. **`fastdeploy/metrics/metrics_middleware.py`**: Migrate HTTP metric `.labels().inc()/.observe()` to `inc_value()/obs_value()` with `labelvalues` parameter

## Usage or Command

```bash
# Enable custom labels on all metrics
export FD_DEFAULT_METRIC_LABEL_VALUES='{"model_id":"qwen3-30b"}'

# Or with multiple labels
export FD_DEFAULT_METRIC_LABEL_VALUES='{"model_id":"qwen3-30b","version":"v2"}'
```

When not set (default `{}`), behavior is identical to current code — no labels are added.

## Accuracy Tests

N/A — This only affects Prometheus metric formatting, no model output changes.

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Unit tests for the metrics interface are not included in this PR and will be added as a follow-up.
- [ ] Provide accuracy results. N/A — no model output changes.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

整体实现思路清晰,抽象层设计合理,默认行为与现有代码完全兼容。主要关注点是 spec_decode_draft_single_head_acceptance_rate 指标名称的 Breaking Change 需在描述中明确说明,以及标题 Tag 需替换为官方 Tag。

),
)

patched_spec_metrics = self._patch_labelnames(self.SPECULATIVE_METRICS)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 兼容性 spec_decode_draft_single_head_acceptance_rate 指标名称发生 Breaking Change

旧实现创建多个独立 Gauge,名称为 fastdeploy:spec_decode_draft_single_head_acceptance_rate_0_1 等;新实现改为单个带 head label 的 Gauge,名称变为 fastdeploy:spec_decode_draft_single_head_acceptance_rate{head="0"}

已有 Prometheus 告警规则、Grafana Dashboard 若依赖旧指标名称将全部失效,属于不向后兼容的破坏性变更。

建议在 PR 描述中明确说明此 Breaking Change,并提供迁移指引(如旧指标名 → 新 PromQL 查询方式)。

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit a918693 into PaddlePaddle:develop May 27, 2026
38 of 43 checks passed
Jiang-Jia-Jun pushed a commit that referenced this pull request May 28, 2026
* [Metric] Support model_id as metric labels by redefining metric update interface

Introduce MetricsManagerInterface with unified set_value/inc_value/dec_value/obs_value methods.
When FD_DEFAULT_METRIC_LABEL_VALUES is set to a valid non-empty JSON dict, metric labels
(e.g. model_id) are automatically applied. Otherwise, operations fall back to the raw
prometheus_client calls.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [chore] add logger

* [fix] fix spec metrics and cache info

* [refactor] reimplement spec_decode_draft_single_head_acceptance_rate

* [chore] fix pre-commit

* [fix] fix spec labels

* [test] fix test

* [update] update cache_info and zmq_labels

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants