Skip to content

feat(metrics): add endpoint server and metrics#948

Closed
MaurUppi wants to merge 17 commits into
daeuniverse:legacyfrom
MaurUppi:feat/metrics-endpoint-clean
Closed

feat(metrics): add endpoint server and metrics#948
MaurUppi wants to merge 17 commits into
daeuniverse:legacyfrom
MaurUppi:feat/metrics-endpoint-clean

Conversation

@MaurUppi
Copy link
Copy Markdown

@MaurUppi MaurUppi commented Feb 28, 2026

Background

This PR supersedes #941 with a clean metrics-only branch based on daeuniverse/dae:main.

The previous PR mixed in many unrelated commits. This one keeps only metrics endpoint work and required dependencies.

dae 一直没有实施 metrics,从 clash/Surge 一路过来,都是有 Dashboard 可以看看,所以不是很习惯老去翻 journal 日志,且 trace/debug 的大量日志中翻找信息还是挺麻烦/费劲的,而且消耗空间还很快。dae 的机器配置不高,查询量大很费劲。哦, 忘了说,试过用 vector 解析日志,调试是在太麻烦且多个香炉多个鬼,维护不易。
因此,假期无事就 vibe coding 完成这个 PR,希望给 v1.1.0 送去有价值功能,实际效果有待 team 审核。
附带我基于 metrics 捣鼓的 Grafana Dashboard,有事没事看看还是挺有趣的。

本 PR 实现 metrics Phase1(不进入热路径):

  • 新增 global.endpoint_* 配置项
  • 接入 endpoint server(metrics + pprof)
  • 新增 metrics 基础模块:pkg/metrics/{state,registry,server,auth}
  • 新增 Phase1 gauge collectors(dialer/dns/connection)
  • 更新 example.dae 示例配置 <-- user-facing docs PR

在 Phase1 基础上实现 Phase2:

  • 增加 DNS 计数器与延迟直方图快照
  • 增加 TCP/UDP 连接总量计数器
  • 增加 dialer 健康检查计数器
  • 扩展 metrics collectors 输出 Phase2 指标
  • 补充对应测试
  • 补充 dae Transparent Proxy-Grafana_dashboard.json

Checklist

Full Changelogs

  1. Phase 1 Gauges
  2. Phase 2 Counters / Histograms

请见 #948 (comment)


Test Result

  • Local verification:
    • go build ./...
    • go test ./...
    • Runtime check: /metrics, /debug/pprof/
  • Note:
    • eBPF/Kernel CI validation has been passed.

Notes

Checklist

  • Metrics-related changes are isolated from unrelated optimization/history commits
  • Required dependency entries for Prometheus are included

最终实现效果:

dae_dashboard_0 dae_dashboard_1

@MaurUppi
Copy link
Copy Markdown
Author

代码审计报告

dae Metrics Endpoint — Audit Report

Date: 2026-02-23
Scope: Phase 1 + Phase 2 metrics implementation (feat/metrics-endpoint-phase1, PR#941)
Live endpoint verified: http://192.168.1.174:5556/metrics
Files reviewed: metrics-related changes in pkg/metrics/, control/, component/outbound/dialer/, cmd/run.go


Code Review Summary

Overall assessment: APPROVE (no blocking issue)

Notes:

  • Previous dae_dns_concurrency_in_use inversion issue has been fixed in metrics branch (fix(metrics): correct dns concurrency in-use gauge semantics).
  • Collector descriptor coverage has been strengthened by test (test(metrics): verify all collector descriptors are exposed).

Findings

P0 — Critical

(none)

P1 — High

(none)

P2 — Medium

(none)

P3 — Low

(none)


Endpoint Verification

All metrics from both phases are present at http://192.168.1.174:5556/metrics.

Phase 1 Gauges — All present ✅

Metric Type Status
dae_dialer_alive gauge ✅ Values: 0/1 per dialer×network
dae_dialer_latency_last_seconds gauge ✅ Real latency values (e.g., 0.0497s)
dae_dialer_latency_avg10_seconds gauge ✅ Real latency averages
dae_dialer_latency_moving_avg_seconds gauge ✅ EWMA values
dae_group_alive_dialers_total gauge ✅ Per group×network
dae_dns_cache_entries gauge ✅ (0 — cache expired at scrape time)
dae_dns_concurrency_in_use gauge ✅ Current semantics are correct; on dae/main fallback path this is typically 0
dae_dns_concurrency_limit gauge ✅ On dae/main fallback path this is 0 (legacy DNS controller has no explicit limiter)
dae_dns_forwarder_cache_entries gauge ✅ (0 — no long-lived forwarders active)
dae_dns_forwarder_in_flight{upstream} gauge ✅ Implemented; emits only when there are in-flight upstream requests
dae_tcp_connections_active gauge
dae_udp_endpoints_active gauge
dae_udp_task_queues_active gauge

Phase 2 Counters / Histograms — All present ✅

Metric Type Status
dae_dns_query_total counter 4
dae_dns_cache_hit_total counter 1
dae_dns_cache_lazy_hit_total counter 0
dae_dns_cache_miss_total counter 3
dae_dns_upstream_query_total{upstream} counter tcp://192.168.1.8:5553
dae_dns_upstream_err_total{upstream} counter
dae_dns_rejected_total counter 0
dae_dns_refused_total counter 0
dae_dns_response_latency_seconds histogram ✅ 12 buckets + sum + count
dae_dns_upstream_latency_seconds{upstream} histogram ✅ per upstream
dae_health_check_total{group,dialer,network} counter
dae_health_check_failure_total{group,dialer,network} counter
dae_tcp_connections_total{protocol,group} counter tcp4/HK = 1
dae_udp_connections_total{protocol,group} counter

Process / Go Runtime ✅

All standard process_* and go_* metrics are present.


Architecture Assessment

Concern Status Notes
Dependency direction pkg/metrics/ depends on control/; no reverse dependency
prometheus import isolation Metrics dependency remains in metrics package; domain structs expose snapshots/getters
Hot-path safety Counters use atomic.Uint64; collectors scrape snapshots
Thread safety — gauge reads Guarded by mutex/sync.Map/atomic in corresponding components
Thread safety — histogram Atomic bucket/counter/sum update and snapshot
Nil safety in collectors Collectors guard state == nil, cp == nil, dc == nil
Reload handling metrics.State swaps ControlPlane atomically
Deterministic output Key sorting is used for labeled DNS upstream metrics

Required Action Before Upstream PR

No blocking change required based on current audited state.


Optional (Non-blocking)

  • Keep dashboard and changelog text aligned with dae/main fallback semantics for DNS concurrency gauges (0/0 until PR936-style limiter model is present).

metrics 观测开销评估

dae metrics 观测开销评估

基准环境

  • 机器:192.168.1.15
  • Prometheus scrape_interval:5s
  • dae DNS QPS:日常 ~几十,压测峰值 200 QPS
  • metrics endpoint:http://192.168.1.15:5556/metrics(响应体 ~224KB / 2338 行)

一、实测基线(采集自 /metrics

process_cpu_seconds_total       386.55
process_resident_memory_bytes   139,923,456  (~133 MB RSS)
go_memstats_heap_alloc_bytes    62,481,112   (~60 MB)
go_memstats_heap_inuse_bytes    73,883,648   (~70 MB)
go_memstats_heap_objects        658,288

二、热路径开销(每条 DNS 查询)

代码路径(control/dns_metrics.go

每次 HandleWithResponseWriter_ 调用触发:

操作 实现 估算耗时
dnsQueryTotal.Add(1) atomic.Uint64 ~5ns
分支计数器 .Add(1) atomic.Uint64 ~5ns
Observe(seconds) 线性扫描 12 个 bucket ~5ns
atomic.Add × 2 count + bucket ~10ns
addAtomicFloat64 CAS 循环(无竞争路径) ~10ns
每条 query 合计 ~35–50 ns

upstream miss 时额外触发一次 Observe(),成本相同。

200 QPS 时 CPU 消耗:

200 × 50ns = 10µs/s → 占单核 0.001%

关键设计:全程使用 atomic.Uint64 + 自定义 histogram,热路径无任何 mutex。


三、抓取路径开销(Prometheus pull)

Collect() 执行内容

DnsCountersSnapshot()      → 8 × atomic.Load + sync.Map.Range(当前 1 个 upstream)
DnsResponseLatencySnapshot → 14 × atomic.Load + map 分配(12 buckets)
DnsUpstreamSnapshot()      → 14 × atomic.Load + map 分配(per upstream)
ConcurrencyInfo()          → len(channel),O(1)
CacheSize()                → sync.Map 遍历
ForwarderCacheInfo()       → sync.Map 遍历
HTTP 序列化                → ~224KB 文本输出

5s 抓取间隔下的 CPU 影响

参数
单次 scrape 估算耗时 ~1–2ms
每分钟 scrape 次数 12 次
每分钟占用 CPU 时间 12–24ms
占单核比例 ~0.02–0.04%

与 15s 间隔对比:

scrape_interval 每分钟次数 CPU 占比
15s 4 次 ~0.01%
5s 12 次 ~0.03%
1s 60 次 ~0.15%

5s 间隔将抓取频率提高 3 倍,但绝对值仍可忽略不计。

每次 scrape 的内存分配

DnsHistogramSnapshot.Buckets  map 分配:12 个 entry × 2 = ~400B/次
DnsUpstreamSnapshot           map 分配:per upstream ~200B
合计:< 1KB/次 → GC 一个周期内回收

四、常驻内存占用(静态结构)

dnsLatencyHistogram(per histogram):
  buckets: 13 × atomic.Uint64 = 104B
  count + sumBits              =  16B
  ─────────────────────────────── ~120B

当前实例:
  response latency histogram   = 120B
  upstream latency histogram × 1 = 120B
  计数器 atomic × 6            =  48B
  metrics HTTP server goroutine = ~8KB(goroutine stack)

metrics 模块总静态内存:< 10KB

五、结论

开销来源 5s scrape_interval 下的影响 评级
热路径 atomic 计数(per query) < 0.001% CPU @ 200 QPS 可忽略
Histogram Observe CAS 极低(CAS 无竞争) 可忽略
Scrape 序列化(5s 周期) ~0.03% CPU 可忽略
常驻内存(metrics state) < 10KB 可忽略
Heap 分配(per scrape) < 1KB,GC 即回收 可忽略
metrics HTTP goroutine 1 goroutine ~8KB stack 可忽略

整体结论scrape_interval: 5s 相比 15s 将抓取 CPU 开销提高 3 倍,但绝对值仍在 0.03% 量级,在任何实际场景下均不构成性能瓶颈。当前实现的无锁设计保证了热路径与抓取路径互不阻塞,即使在 200 QPS 压测期间也不会产生可观测的影响。

若要设置更激进的 scrape_interval(如 1s),建议先通过方法 2 确认单次 scrape 耗时,再评估是否产生可感知影响。

@MaurUppi
Copy link
Copy Markdown
Author

@cubercsl

请检查

…CI workflows by olicesx (daeuniverse#970)

Co-authored-by: kix <olices@9up.in>
@MaurUppi MaurUppi force-pushed the feat/metrics-endpoint-clean branch from bae7ccb to 816aab5 Compare April 22, 2026 12:35
@MaurUppi MaurUppi requested a review from a team as a code owner April 22, 2026 13:05
@MaurUppi
Copy link
Copy Markdown
Author

看来,我这个 PR 需要 rebase 重新解决冲突,且需要将 PR#968 的部分 runtime/latency 数据也接进 Prometheus
这两天解决了,再重新push

dae-prow Bot and others added 12 commits April 22, 2026 20:40
Co-authored-by: dae-prow-robot <dae@v2raya.org>
Co-authored-by: Sumire (菫) <151038614+sumire88@users.noreply.github.com>
Co-authored-by: dae-prow-robot <dae@v2raya.org>
Co-authored-by: Sumire (菫) <151038614+sumire88@users.noreply.github.com>
Add a dedicated runtime Prometheus collector for upload/download totals and rates, register it in the metrics registry, and cover the new series with collector tests.\n\nThe rebase onto dae/main also surfaced a stale AliveDialerSet field reference that would break downstream verification, so this commit folds in the minimal compile fix needed to keep the branch testable in CI.\n\nConstraint: Local verification is limited to macOS code-level checks; Linux/eBPF build and test paths must run in CI\nConstraint: Runtime metrics must consume the exported control.SnapshotRuntimeStats API rather than runtimeStats internals\nRejected: Modify control/runtime_stats.go directly | unnecessary coupling to upstream implementation details\nRejected: Skip the AliveDialerSet fix | leaves the rebased branch uncompilable in downstream validation\nConfidence: medium\nScope-risk: moderate\nReversibility: clean\nDirective: Keep the metrics layer reading runtime stats through public control APIs so future upstream rebases stay small\nTested: git diff --check; gofmt on changed Go sources; attempted pkg/metrics TDD runs locally to validate red state and dependency setup\nNot-tested: Full go test ./pkg/metrics/... on macOS (blocked by Linux-specific code paths); CI compile/test on origin/main
@MaurUppi MaurUppi force-pushed the feat/metrics-endpoint-clean branch from 7cdbcea to 51e374f Compare April 23, 2026 03:05
@MaurUppi MaurUppi requested review from a team as code owners April 23, 2026 03:05
Codex added 3 commits April 23, 2026 11:36
Systematic debugging showed three distinct failure classes on PR #28. The code failures (Go Test, Lint, Kernel Test) shared one root cause: DialerGroup accessors were still reading removed pre-selectionState fields, so outbound package compilation broke on Linux CI. This commit switches those accessors to currentSelectionState and relaxes AliveCount to a read lock.\n\nThe same investigation showed the document check was tripping over repo-wide markdownlint debt carried into the PR relative to origin/main, so the commit applies the minimal lint-only markdown fixes reported by CI. PR Build (Preview) was failing for a separate infrastructure reason: the fork does not provide GH_APP_ID/GH_APP_PRIVATE_KEY, so the preview workflow now skips cleanly when those secrets are absent instead of reporting a false red check.\n\nConstraint: Local verification is limited to code-level checks on macOS; Linux/eBPF execution must stay in CI\nConstraint: Preview workflow must not require unavailable GitHub App secrets on forks\nRejected: Guess at the metrics code first | the logs showed the primary breakage was in outbound accessor code, not runtime collector logic\nRejected: Fix only dialer_group.go | would leave document lint and preview workflow red on the same PR\nConfidence: medium\nScope-risk: moderate\nReversibility: clean\nDirective: Keep PR-only workflow glue tolerant of missing fork secrets, and keep accessor helpers reading selectionState rather than stale duplicated fields\nTested: gh Actions log inspection for PR #28; git diff --check; npm run markdown-lint; npm run check-broken-link; Linux-targeted compile of ./component/outbound with dae_stub_ebpf\nNot-tested: Full repository CI rerun after push; local execution of preview workflow action stack
Comparing the failing PR 28 runs with the older green runs showed the remaining Go Test, Lint, and Kernel Test failures all converged on DnsController metrics helper methods. During the rebase onto dae/main, the facade moved cache and forwarder state into the shared dnsControllerStore (sync.Map-backed), but CacheSize and ForwarderCacheInfo were still reading removed mutex-protected map fields.\n\nThis commit switches those helpers to iterate the current shared store, which matches the post-reload architecture and unblocks pkg/metrics/control compilation again.\n\nConstraint: CI failures were compared against known-good runs 24814452464/22514228055 and 24815333258/22514228007 before changing code\nConstraint: Fix must preserve the dae/main shared-store DNS architecture rather than reintroduce legacy mutex+map fields\nRejected: Re-add dnsCacheMu/dnsForwarderCacheMu to DnsController | would regress the new shared-store design and mask the real mismatch\nRejected: Change collector code to stop calling these helpers | symptom fix, not root cause\nConfidence: medium\nScope-risk: narrow\nReversibility: clean\nDirective: Any future DnsController metrics/helper methods must read through the shared store facade, not legacy per-controller cache fields\nTested: GitHub Actions log comparison for failed vs successful runs; git diff --check\nNot-tested: Full rerun of PR 28 after push (pending GitHub Actions)
Comparing the latest failing PR 28 runs with the earlier green baselines showed two more regressions after the first CI-fix pass. Kernel Test and Go Test both failed on cmd/run.go because shutdownAfterSignalWithHandoff referenced the Run-local endpointServer variable from package scope. The same Go Test run then exposed a second issue in control/dns_metrics_test.go: the tests still constructed zero-value DnsController instances, but the rebased code now requires a shared dnsControllerStore-backed test helper.\n\nThis commit removes the invalid out-of-scope endpointServer shutdown branch and migrates the DNS metrics tests to the store-aware helper used elsewhere in the control package.\n\nConstraint: Root cause was determined by comparing failed runs 24822215119/24822215133 with the older successful run 22514228007 before changing code\nConstraint: Test updates must follow the new shared-store DnsController contract rather than bypass requireStore panics\nRejected: Reintroduce a package-level endpointServer just to satisfy shutdownAfterSignalWithHandoff | wrong ownership boundary and unnecessary duplication\nRejected: Loosen requireStore for zero-value tests | would hide contract violations instead of updating tests to the supported helper\nConfidence: medium\nScope-risk: narrow\nReversibility: clean\nDirective: Keep endpoint server lifecycle owned inside Run, and keep control tests constructing DnsController through shared-store helpers\nTested: GitHub Actions failed-vs-successful run comparison; git diff --check; local targeted go test attempt for dns_metrics tests (blocked by macOS/Linux dependency divergence, no new logic errors)\nNot-tested: Full PR 28 CI rerun after push
The next failed-vs-successful CI comparison showed the remaining red signal had moved from structural compile regressions to pure static checks. The latest lint run failed on unchecked Close calls in endpoint TLS validation, deprecated Prometheus collector constructors, the deliberate use of the exported runtime snapshot API, and a dead refreshPprofServer helper left behind after the endpoint-server integration.\n\nThis commit makes the Close paths explicit, switches registry construction to prometheus/collectors, documents the intentional staticcheck exception for the public runtime snapshot API, and removes the now-unused reload manager helper.\n\nConstraint: Changes must preserve the plan’s decision to source runtime traffic from control.SnapshotRuntimeStats instead of touching runtime internals\nConstraint: Fixes should address lint/static analysis directly without changing runtime behavior\nRejected: Replace SnapshotRuntimeStats with control-plane internals | violates the integration plan and increases merge friction\nRejected: Suppress all lint at workflow level | hides real regressions instead of fixing the small concrete issues\nConfidence: medium\nScope-risk: narrow\nReversibility: clean\nDirective: Keep lint fixes local to the actual warning site; do not broaden suppressions when a precise code change will do\nTested: GitHub Actions log comparison for the latest failing lint run; git diff --check\nNot-tested: Fresh CI rerun after push (pending)
@MaurUppi
Copy link
Copy Markdown
Author

MaurUppi commented Apr 24, 2026

太难了。。。。
PR#968 这个其实加了很简单的 runtime/latency 接口
但是 #970 改动太大了,,,恐怕我这个 PR 得重写。。。。而不是仅仅 rebase
而且 #980 的 fix 又改动了不少。。。

算了算了,,,我关闭这个 PR 吧,等有空了重新搞。

现在自己用着,PR#970/980没有并入

CleanShot 2026-04-24 at 17 08 10 CleanShot 2026-04-24 at 17 11 08

@MaurUppi MaurUppi closed this Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants