feat(metrics): add endpoint server and metrics by MaurUppi · Pull Request #948 · daeuniverse/dae

MaurUppi · 2026-02-28T14:25:54Z

Background

This PR supersedes #941 with a clean metrics-only branch based on daeuniverse/dae:main.

The previous PR mixed in many unrelated commits. This one keeps only metrics endpoint work and required dependencies.

dae 一直没有实施 metrics，从 clash/Surge 一路过来，都是有 Dashboard 可以看看，所以不是很习惯老去翻 journal 日志，且 trace/debug 的大量日志中翻找信息还是挺麻烦/费劲的，而且消耗空间还很快。dae 的机器配置不高，查询量大很费劲。哦，忘了说，试过用 vector 解析日志，调试是在太麻烦且多个香炉多个鬼，维护不易。
因此，假期无事就 vibe coding 完成这个 PR，希望给 v1.1.0 送去有价值功能，实际效果有待 team 审核。
附带我基于 metrics 捣鼓的 Grafana Dashboard，有事没事看看还是挺有趣的。

本 PR 实现 metrics Phase1（不进入热路径）：

新增 global.endpoint_* 配置项
接入 endpoint server（metrics + pprof）
新增 metrics 基础模块：pkg/metrics/{state,registry,server,auth}
新增 Phase1 gauge collectors（dialer/dns/connection）
更新 example.dae 示例配置 <-- user-facing docs PR

在 Phase1 基础上实现 Phase2：

增加 DNS 计数器与延迟直方图快照
增加 TCP/UDP 连接总量计数器
增加 dialer 健康检查计数器
扩展 metrics collectors 输出 Phase2 指标
补充对应测试
补充 dae Transparent Proxy-Grafana_dashboard.json

Checklist

[ X ] The Pull Request has been fully tested
[ X ] There's an entry in the CHANGELOGS
[ X ] There is a user-facing docs PR against https://github.com/daeuniverse/dae

Full Changelogs

Phase 1 Gauges
Phase 2 Counters / Histograms

请见 #948 (comment)

Test Result

Local verification:
- go build ./...
- go test ./...
- Runtime check: /metrics, /debug/pprof/
Note:
- eBPF/Kernel CI validation has been passed.

Notes

Branch base: latest daeuniverse/dae:main at split time
This PR is intended to replace feat(metrics): add endpoint server and metrics #941 for review/merge

Checklist

Metrics-related changes are isolated from unrelated optimization/history commits
Required dependency entries for Prometheus are included

最终实现效果：

MaurUppi · 2026-02-28T14:30:39Z

代码审计报告

dae Metrics Endpoint — Audit Report

Date: 2026-02-23
Scope: Phase 1 + Phase 2 metrics implementation (feat/metrics-endpoint-phase1, PR#941)
Live endpoint verified: http://192.168.1.174:5556/metrics
Files reviewed: metrics-related changes in pkg/metrics/, control/, component/outbound/dialer/, cmd/run.go

Code Review Summary

Overall assessment: APPROVE (no blocking issue)

Notes:

Previous dae_dns_concurrency_in_use inversion issue has been fixed in metrics branch (fix(metrics): correct dns concurrency in-use gauge semantics).
Collector descriptor coverage has been strengthened by test (test(metrics): verify all collector descriptors are exposed).

Findings

P0 — Critical

(none)

P1 — High

(none)

P2 — Medium

(none)

P3 — Low

(none)

Endpoint Verification

All metrics from both phases are present at http://192.168.1.174:5556/metrics.

Phase 1 Gauges — All present ✅

Metric	Type	Status
`dae_dialer_alive`	gauge	✅ Values: 0/1 per dialer×network
`dae_dialer_latency_last_seconds`	gauge	✅ Real latency values (e.g., 0.0497s)
`dae_dialer_latency_avg10_seconds`	gauge	✅ Real latency averages
`dae_dialer_latency_moving_avg_seconds`	gauge	✅ EWMA values
`dae_group_alive_dialers_total`	gauge	✅ Per group×network
`dae_dns_cache_entries`	gauge	✅ (`0` — cache expired at scrape time)
`dae_dns_concurrency_in_use`	gauge	✅ Current semantics are correct; on `dae/main` fallback path this is typically `0`
`dae_dns_concurrency_limit`	gauge	✅ On `dae/main` fallback path this is `0` (legacy DNS controller has no explicit limiter)
`dae_dns_forwarder_cache_entries`	gauge	✅ (`0` — no long-lived forwarders active)
`dae_dns_forwarder_in_flight{upstream}`	gauge	✅ Implemented; emits only when there are in-flight upstream requests
`dae_tcp_connections_active`	gauge	✅
`dae_udp_endpoints_active`	gauge	✅
`dae_udp_task_queues_active`	gauge	✅

Phase 2 Counters / Histograms — All present ✅

Metric	Type	Status
`dae_dns_query_total`	counter	✅ `4`
`dae_dns_cache_hit_total`	counter	✅ `1`
`dae_dns_cache_lazy_hit_total`	counter	✅ `0`
`dae_dns_cache_miss_total`	counter	✅ `3`
`dae_dns_upstream_query_total{upstream}`	counter	✅ `tcp://192.168.1.8:5553`
`dae_dns_upstream_err_total{upstream}`	counter	✅
`dae_dns_rejected_total`	counter	✅ `0`
`dae_dns_refused_total`	counter	✅ `0`
`dae_dns_response_latency_seconds`	histogram	✅ 12 buckets + sum + count
`dae_dns_upstream_latency_seconds{upstream}`	histogram	✅ per upstream
`dae_health_check_total{group,dialer,network}`	counter	✅
`dae_health_check_failure_total{group,dialer,network}`	counter	✅
`dae_tcp_connections_total{protocol,group}`	counter	✅ `tcp4/HK = 1`
`dae_udp_connections_total{protocol,group}`	counter	✅

Process / Go Runtime ✅

All standard process_* and go_* metrics are present.

Architecture Assessment

Concern	Status	Notes
Dependency direction	✅	`pkg/metrics/` depends on `control/`; no reverse dependency
prometheus import isolation	✅	Metrics dependency remains in metrics package; domain structs expose snapshots/getters
Hot-path safety	✅	Counters use `atomic.Uint64`; collectors scrape snapshots
Thread safety — gauge reads	✅	Guarded by mutex/sync.Map/atomic in corresponding components
Thread safety — histogram	✅	Atomic bucket/counter/sum update and snapshot
Nil safety in collectors	✅	Collectors guard `state == nil`, `cp == nil`, `dc == nil`
Reload handling	✅	`metrics.State` swaps `ControlPlane` atomically
Deterministic output	✅	Key sorting is used for labeled DNS upstream metrics

Required Action Before Upstream PR

No blocking change required based on current audited state.

Optional (Non-blocking)

Keep dashboard and changelog text aligned with dae/main fallback semantics for DNS concurrency gauges (0/0 until PR936-style limiter model is present).

metrics 观测开销评估

dae metrics 观测开销评估

基准环境

机器：192.168.1.15
Prometheus scrape_interval：5s
dae DNS QPS：日常 ~几十，压测峰值 200 QPS
metrics endpoint：http://192.168.1.15:5556/metrics（响应体 ~224KB / 2338 行）

一、实测基线（采集自 `/metrics`）

process_cpu_seconds_total       386.55
process_resident_memory_bytes   139,923,456  (~133 MB RSS)
go_memstats_heap_alloc_bytes    62,481,112   (~60 MB)
go_memstats_heap_inuse_bytes    73,883,648   (~70 MB)
go_memstats_heap_objects        658,288

二、热路径开销（每条 DNS 查询）

代码路径（`control/dns_metrics.go`）

每次 HandleWithResponseWriter_ 调用触发：

操作	实现	估算耗时
`dnsQueryTotal.Add(1)`	`atomic.Uint64`	~5ns
分支计数器 `.Add(1)`	`atomic.Uint64`	~5ns
`Observe(seconds)`	线性扫描 12 个 bucket	~5ns
`atomic.Add` × 2	count + bucket	~10ns
`addAtomicFloat64`	CAS 循环（无竞争路径）	~10ns
每条 query 合计		~35–50 ns

upstream miss 时额外触发一次 Observe()，成本相同。

200 QPS 时 CPU 消耗：

200 × 50ns = 10µs/s → 占单核 0.001%

关键设计：全程使用 atomic.Uint64 + 自定义 histogram，热路径无任何 mutex。

三、抓取路径开销（Prometheus pull）

Collect() 执行内容

DnsCountersSnapshot()      → 8 × atomic.Load + sync.Map.Range（当前 1 个 upstream）
DnsResponseLatencySnapshot → 14 × atomic.Load + map 分配（12 buckets）
DnsUpstreamSnapshot()      → 14 × atomic.Load + map 分配（per upstream）
ConcurrencyInfo()          → len(channel)，O(1)
CacheSize()                → sync.Map 遍历
ForwarderCacheInfo()       → sync.Map 遍历
HTTP 序列化                → ~224KB 文本输出

5s 抓取间隔下的 CPU 影响

参数	值
单次 scrape 估算耗时	~1–2ms
每分钟 scrape 次数	12 次
每分钟占用 CPU 时间	12–24ms
占单核比例	~0.02–0.04%

与 15s 间隔对比：

scrape_interval	每分钟次数	CPU 占比
15s	4 次	~0.01%
5s	12 次	~0.03%
1s	60 次	~0.15%

5s 间隔将抓取频率提高 3 倍，但绝对值仍可忽略不计。

每次 scrape 的内存分配

DnsHistogramSnapshot.Buckets  map 分配：12 个 entry × 2 = ~400B/次
DnsUpstreamSnapshot           map 分配：per upstream ~200B
合计：< 1KB/次 → GC 一个周期内回收

四、常驻内存占用（静态结构）

dnsLatencyHistogram（per histogram）：
  buckets: 13 × atomic.Uint64 = 104B
  count + sumBits              =  16B
  ─────────────────────────────── ~120B

当前实例：
  response latency histogram   = 120B
  upstream latency histogram × 1 = 120B
  计数器 atomic × 6            =  48B
  metrics HTTP server goroutine = ~8KB（goroutine stack）

metrics 模块总静态内存：< 10KB

五、结论

开销来源	5s scrape_interval 下的影响	评级
热路径 atomic 计数（per query）	< 0.001% CPU @ 200 QPS	可忽略
Histogram Observe CAS	极低（CAS 无竞争）	可忽略
Scrape 序列化（5s 周期）	~0.03% CPU	可忽略
常驻内存（metrics state）	< 10KB	可忽略
Heap 分配（per scrape）	< 1KB，GC 即回收	可忽略
metrics HTTP goroutine	1 goroutine ~8KB stack	可忽略

整体结论：scrape_interval: 5s 相比 15s 将抓取 CPU 开销提高 3 倍，但绝对值仍在 0.03% 量级，在任何实际场景下均不构成性能瓶颈。当前实现的无锁设计保证了热路径与抓取路径互不阻塞，即使在 200 QPS 压测期间也不会产生可观测的影响。

若要设置更激进的 scrape_interval（如 1s），建议先通过方法 2 确认单次 scrape 耗时，再评估是否产生可感知影响。

MaurUppi · 2026-02-28T14:38:25Z

@cubercsl

请检查

…CI workflows by olicesx (daeuniverse#970) Co-authored-by: kix <olices@9up.in>

MaurUppi · 2026-04-22T13:41:17Z

看来，我这个 PR 需要 rebase 重新解决冲突，且需要将 PR#968 的部分 runtime/latency 数据也接进 Prometheus
这两天解决了，再重新push

Co-authored-by: dae-prow-robot <dae@v2raya.org> Co-authored-by: Sumire (菫) <151038614+sumire88@users.noreply.github.com>

Add a dedicated runtime Prometheus collector for upload/download totals and rates, register it in the metrics registry, and cover the new series with collector tests.\n\nThe rebase onto dae/main also surfaced a stale AliveDialerSet field reference that would break downstream verification, so this commit folds in the minimal compile fix needed to keep the branch testable in CI.\n\nConstraint: Local verification is limited to macOS code-level checks; Linux/eBPF build and test paths must run in CI\nConstraint: Runtime metrics must consume the exported control.SnapshotRuntimeStats API rather than runtimeStats internals\nRejected: Modify control/runtime_stats.go directly | unnecessary coupling to upstream implementation details\nRejected: Skip the AliveDialerSet fix | leaves the rebased branch uncompilable in downstream validation\nConfidence: medium\nScope-risk: moderate\nReversibility: clean\nDirective: Keep the metrics layer reading runtime stats through public control APIs so future upstream rebases stay small\nTested: git diff --check; gofmt on changed Go sources; attempted pkg/metrics TDD runs locally to validate red state and dependency setup\nNot-tested: Full go test ./pkg/metrics/... on macOS (blocked by Linux-specific code paths); CI compile/test on origin/main

Systematic debugging showed three distinct failure classes on PR #28. The code failures (Go Test, Lint, Kernel Test) shared one root cause: DialerGroup accessors were still reading removed pre-selectionState fields, so outbound package compilation broke on Linux CI. This commit switches those accessors to currentSelectionState and relaxes AliveCount to a read lock.\n\nThe same investigation showed the document check was tripping over repo-wide markdownlint debt carried into the PR relative to origin/main, so the commit applies the minimal lint-only markdown fixes reported by CI. PR Build (Preview) was failing for a separate infrastructure reason: the fork does not provide GH_APP_ID/GH_APP_PRIVATE_KEY, so the preview workflow now skips cleanly when those secrets are absent instead of reporting a false red check.\n\nConstraint: Local verification is limited to code-level checks on macOS; Linux/eBPF execution must stay in CI\nConstraint: Preview workflow must not require unavailable GitHub App secrets on forks\nRejected: Guess at the metrics code first | the logs showed the primary breakage was in outbound accessor code, not runtime collector logic\nRejected: Fix only dialer_group.go | would leave document lint and preview workflow red on the same PR\nConfidence: medium\nScope-risk: moderate\nReversibility: clean\nDirective: Keep PR-only workflow glue tolerant of missing fork secrets, and keep accessor helpers reading selectionState rather than stale duplicated fields\nTested: gh Actions log inspection for PR #28; git diff --check; npm run markdown-lint; npm run check-broken-link; Linux-targeted compile of ./component/outbound with dae_stub_ebpf\nNot-tested: Full repository CI rerun after push; local execution of preview workflow action stack

Comparing the failing PR 28 runs with the older green runs showed the remaining Go Test, Lint, and Kernel Test failures all converged on DnsController metrics helper methods. During the rebase onto dae/main, the facade moved cache and forwarder state into the shared dnsControllerStore (sync.Map-backed), but CacheSize and ForwarderCacheInfo were still reading removed mutex-protected map fields.\n\nThis commit switches those helpers to iterate the current shared store, which matches the post-reload architecture and unblocks pkg/metrics/control compilation again.\n\nConstraint: CI failures were compared against known-good runs 24814452464/22514228055 and 24815333258/22514228007 before changing code\nConstraint: Fix must preserve the dae/main shared-store DNS architecture rather than reintroduce legacy mutex+map fields\nRejected: Re-add dnsCacheMu/dnsForwarderCacheMu to DnsController | would regress the new shared-store design and mask the real mismatch\nRejected: Change collector code to stop calling these helpers | symptom fix, not root cause\nConfidence: medium\nScope-risk: narrow\nReversibility: clean\nDirective: Any future DnsController metrics/helper methods must read through the shared store facade, not legacy per-controller cache fields\nTested: GitHub Actions log comparison for failed vs successful runs; git diff --check\nNot-tested: Full rerun of PR 28 after push (pending GitHub Actions)

Comparing the latest failing PR 28 runs with the earlier green baselines showed two more regressions after the first CI-fix pass. Kernel Test and Go Test both failed on cmd/run.go because shutdownAfterSignalWithHandoff referenced the Run-local endpointServer variable from package scope. The same Go Test run then exposed a second issue in control/dns_metrics_test.go: the tests still constructed zero-value DnsController instances, but the rebased code now requires a shared dnsControllerStore-backed test helper.\n\nThis commit removes the invalid out-of-scope endpointServer shutdown branch and migrates the DNS metrics tests to the store-aware helper used elsewhere in the control package.\n\nConstraint: Root cause was determined by comparing failed runs 24822215119/24822215133 with the older successful run 22514228007 before changing code\nConstraint: Test updates must follow the new shared-store DnsController contract rather than bypass requireStore panics\nRejected: Reintroduce a package-level endpointServer just to satisfy shutdownAfterSignalWithHandoff | wrong ownership boundary and unnecessary duplication\nRejected: Loosen requireStore for zero-value tests | would hide contract violations instead of updating tests to the supported helper\nConfidence: medium\nScope-risk: narrow\nReversibility: clean\nDirective: Keep endpoint server lifecycle owned inside Run, and keep control tests constructing DnsController through shared-store helpers\nTested: GitHub Actions failed-vs-successful run comparison; git diff --check; local targeted go test attempt for dns_metrics tests (blocked by macOS/Linux dependency divergence, no new logic errors)\nNot-tested: Full PR 28 CI rerun after push

The next failed-vs-successful CI comparison showed the remaining red signal had moved from structural compile regressions to pure static checks. The latest lint run failed on unchecked Close calls in endpoint TLS validation, deprecated Prometheus collector constructors, the deliberate use of the exported runtime snapshot API, and a dead refreshPprofServer helper left behind after the endpoint-server integration.\n\nThis commit makes the Close paths explicit, switches registry construction to prometheus/collectors, documents the intentional staticcheck exception for the public runtime snapshot API, and removes the now-unused reload manager helper.\n\nConstraint: Changes must preserve the plan’s decision to source runtime traffic from control.SnapshotRuntimeStats instead of touching runtime internals\nConstraint: Fixes should address lint/static analysis directly without changing runtime behavior\nRejected: Replace SnapshotRuntimeStats with control-plane internals | violates the integration plan and increases merge friction\nRejected: Suppress all lint at workflow level | hides real regressions instead of fixing the small concrete issues\nConfidence: medium\nScope-risk: narrow\nReversibility: clean\nDirective: Keep lint fixes local to the actual warning site; do not broaden suppressions when a precise code change will do\nTested: GitHub Actions log comparison for the latest failing lint run; git diff --check\nNot-tested: Fresh CI rerun after push (pending)

MaurUppi · 2026-04-24T09:03:20Z

太难了。。。。
PR#968 这个其实加了很简单的 runtime/latency 接口
但是 #970 改动太大了，，，恐怕我这个 PR 得重写。。。。而不是仅仅 rebase
而且 #980 的 fix 又改动了不少。。。

算了算了，，，我关闭这个 PR 吧，等有空了重新搞。

现在自己用着，PR#970/980没有并入

MaurUppi requested a review from a team as a code owner February 28, 2026 14:25

dae-prow Bot assigned MaurUppi Feb 28, 2026

MaurUppi mentioned this pull request Feb 28, 2026

feat(metrics): add endpoint server and metrics #941

Closed

dae-prow Bot added feature not-yet-tested labels Feb 28, 2026

MaurUppi mentioned this pull request Apr 22, 2026

feat(control): add runtime traffic metrics and node latency probing #968

Merged

5 tasks

ci/docs/optimize/feature: Enhance control plane features and improve …

85a1fc3

…CI workflows by olicesx (daeuniverse#970) Co-authored-by: kix <olices@9up.in>

MaurUppi force-pushed the feat/metrics-endpoint-clean branch from bae7ccb to 816aab5 Compare April 22, 2026 12:35

MaurUppi requested a review from a team as a code owner April 22, 2026 13:05

dae-prow Bot and others added 12 commits April 22, 2026 20:40

ci(release): draft release v1.1.0 (daeuniverse#976)

654c7eb

Co-authored-by: dae-prow-robot <dae@v2raya.org> Co-authored-by: Sumire (菫) <151038614+sumire88@users.noreply.github.com>

ci(release): draft release v2.0.0rc1 (daeuniverse#978)

e239c3b

Co-authored-by: dae-prow-robot <dae@v2raya.org> Co-authored-by: Sumire (菫) <151038614+sumire88@users.noreply.github.com>

feat(metrics): add endpoint server and phase1 gauge collectors

75b0617

docs(config): quote endpoint listen address example

319d1dd

feat(metrics): implement phase2 counters and histograms

042dd9e

fix(metrics): correct dns concurrency in-use gauge semantics

d4fe752

docs(metrics): add Grafana dashboard template JSON

7787170

test(metrics): verify all collector descriptors are exposed

29690e8

feat(metrics): validate endpoint TLS file permissions

f1a771b

fix(metrics): validate endpoint TLS files before control plane init

81a1c33

chore(metrics): add prometheus dependencies for endpoint metrics

84e0e5b

MaurUppi force-pushed the feat/metrics-endpoint-clean branch from 7cdbcea to 51e374f Compare April 23, 2026 03:05

MaurUppi requested review from a team as code owners April 23, 2026 03:05

Codex added 3 commits April 23, 2026 11:36

MaurUppi closed this Apr 24, 2026

Conversation

MaurUppi commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Checklist

Full Changelogs

Test Result

Notes

Checklist

最终实现效果：

Uh oh!

MaurUppi commented Feb 28, 2026

dae Metrics Endpoint — Audit Report

Code Review Summary

Findings

P0 — Critical

P1 — High

P2 — Medium

P3 — Low

Endpoint Verification

Phase 1 Gauges — All present ✅

Phase 2 Counters / Histograms — All present ✅

Process / Go Runtime ✅

Architecture Assessment

Required Action Before Upstream PR

Optional (Non-blocking)

dae metrics 观测开销评估

一、实测基线（采集自 /metrics）

二、热路径开销（每条 DNS 查询）

代码路径（control/dns_metrics.go）

三、抓取路径开销（Prometheus pull）

Collect() 执行内容

5s 抓取间隔下的 CPU 影响

每次 scrape 的内存分配

四、常驻内存占用（静态结构）

五、结论

Uh oh!

MaurUppi commented Feb 28, 2026

Uh oh!

MaurUppi commented Apr 22, 2026

Uh oh!

MaurUppi commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

算了算了，，，我关闭这个 PR 吧，等有空了重新搞。

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MaurUppi commented Feb 28, 2026 •

edited

Loading

一、实测基线（采集自 `/metrics`）

代码路径（`control/dns_metrics.go`）

MaurUppi commented Apr 24, 2026 •

edited

Loading