feat(metrics): add endpoint server and metrics#948
Conversation
代码审计报告
dae Metrics Endpoint — Audit ReportDate: 2026-02-23 Code Review SummaryOverall assessment: APPROVE (no blocking issue) Notes:
FindingsP0 — Critical(none) P1 — High(none) P2 — Medium(none) P3 — Low(none) Endpoint VerificationAll metrics from both phases are present at Phase 1 Gauges — All present ✅
Phase 2 Counters / Histograms — All present ✅
Process / Go Runtime ✅All standard Architecture Assessment
Required Action Before Upstream PRNo blocking change required based on current audited state. Optional (Non-blocking)
metrics 观测开销评估
dae metrics 观测开销评估基准环境
一、实测基线(采集自
|
| 操作 | 实现 | 估算耗时 |
|---|---|---|
dnsQueryTotal.Add(1) |
atomic.Uint64 |
~5ns |
分支计数器 .Add(1) |
atomic.Uint64 |
~5ns |
Observe(seconds) |
线性扫描 12 个 bucket | ~5ns |
atomic.Add × 2 |
count + bucket | ~10ns |
addAtomicFloat64 |
CAS 循环(无竞争路径) | ~10ns |
| 每条 query 合计 | ~35–50 ns |
upstream miss 时额外触发一次 Observe(),成本相同。
200 QPS 时 CPU 消耗:
200 × 50ns = 10µs/s → 占单核 0.001%
关键设计:全程使用
atomic.Uint64+ 自定义 histogram,热路径无任何 mutex。
三、抓取路径开销(Prometheus pull)
Collect() 执行内容
DnsCountersSnapshot() → 8 × atomic.Load + sync.Map.Range(当前 1 个 upstream)
DnsResponseLatencySnapshot → 14 × atomic.Load + map 分配(12 buckets)
DnsUpstreamSnapshot() → 14 × atomic.Load + map 分配(per upstream)
ConcurrencyInfo() → len(channel),O(1)
CacheSize() → sync.Map 遍历
ForwarderCacheInfo() → sync.Map 遍历
HTTP 序列化 → ~224KB 文本输出
5s 抓取间隔下的 CPU 影响
| 参数 | 值 |
|---|---|
| 单次 scrape 估算耗时 | ~1–2ms |
| 每分钟 scrape 次数 | 12 次 |
| 每分钟占用 CPU 时间 | 12–24ms |
| 占单核比例 | ~0.02–0.04% |
与 15s 间隔对比:
| scrape_interval | 每分钟次数 | CPU 占比 |
|---|---|---|
| 15s | 4 次 | ~0.01% |
| 5s | 12 次 | ~0.03% |
| 1s | 60 次 | ~0.15% |
5s 间隔将抓取频率提高 3 倍,但绝对值仍可忽略不计。
每次 scrape 的内存分配
DnsHistogramSnapshot.Buckets map 分配:12 个 entry × 2 = ~400B/次
DnsUpstreamSnapshot map 分配:per upstream ~200B
合计:< 1KB/次 → GC 一个周期内回收
四、常驻内存占用(静态结构)
dnsLatencyHistogram(per histogram):
buckets: 13 × atomic.Uint64 = 104B
count + sumBits = 16B
─────────────────────────────── ~120B
当前实例:
response latency histogram = 120B
upstream latency histogram × 1 = 120B
计数器 atomic × 6 = 48B
metrics HTTP server goroutine = ~8KB(goroutine stack)
metrics 模块总静态内存:< 10KB
五、结论
| 开销来源 | 5s scrape_interval 下的影响 | 评级 |
|---|---|---|
| 热路径 atomic 计数(per query) | < 0.001% CPU @ 200 QPS | 可忽略 |
| Histogram Observe CAS | 极低(CAS 无竞争) | 可忽略 |
| Scrape 序列化(5s 周期) | ~0.03% CPU | 可忽略 |
| 常驻内存(metrics state) | < 10KB | 可忽略 |
| Heap 分配(per scrape) | < 1KB,GC 即回收 | 可忽略 |
| metrics HTTP goroutine | 1 goroutine ~8KB stack | 可忽略 |
整体结论:scrape_interval: 5s 相比 15s 将抓取 CPU 开销提高 3 倍,但绝对值仍在 0.03% 量级,在任何实际场景下均不构成性能瓶颈。当前实现的无锁设计保证了热路径与抓取路径互不阻塞,即使在 200 QPS 压测期间也不会产生可观测的影响。
若要设置更激进的 scrape_interval(如 1s),建议先通过方法 2 确认单次 scrape 耗时,再评估是否产生可感知影响。
|
请检查 |
…CI workflows by olicesx (daeuniverse#970) Co-authored-by: kix <olices@9up.in>
bae7ccb to
816aab5
Compare
|
看来,我这个 PR 需要 rebase 重新解决冲突,且需要将 |
Co-authored-by: dae-prow-robot <dae@v2raya.org> Co-authored-by: Sumire (菫) <151038614+sumire88@users.noreply.github.com>
Co-authored-by: dae-prow-robot <dae@v2raya.org> Co-authored-by: Sumire (菫) <151038614+sumire88@users.noreply.github.com>
Add a dedicated runtime Prometheus collector for upload/download totals and rates, register it in the metrics registry, and cover the new series with collector tests.\n\nThe rebase onto dae/main also surfaced a stale AliveDialerSet field reference that would break downstream verification, so this commit folds in the minimal compile fix needed to keep the branch testable in CI.\n\nConstraint: Local verification is limited to macOS code-level checks; Linux/eBPF build and test paths must run in CI\nConstraint: Runtime metrics must consume the exported control.SnapshotRuntimeStats API rather than runtimeStats internals\nRejected: Modify control/runtime_stats.go directly | unnecessary coupling to upstream implementation details\nRejected: Skip the AliveDialerSet fix | leaves the rebased branch uncompilable in downstream validation\nConfidence: medium\nScope-risk: moderate\nReversibility: clean\nDirective: Keep the metrics layer reading runtime stats through public control APIs so future upstream rebases stay small\nTested: git diff --check; gofmt on changed Go sources; attempted pkg/metrics TDD runs locally to validate red state and dependency setup\nNot-tested: Full go test ./pkg/metrics/... on macOS (blocked by Linux-specific code paths); CI compile/test on origin/main
7cdbcea to
51e374f
Compare
Systematic debugging showed three distinct failure classes on PR #28. The code failures (Go Test, Lint, Kernel Test) shared one root cause: DialerGroup accessors were still reading removed pre-selectionState fields, so outbound package compilation broke on Linux CI. This commit switches those accessors to currentSelectionState and relaxes AliveCount to a read lock.\n\nThe same investigation showed the document check was tripping over repo-wide markdownlint debt carried into the PR relative to origin/main, so the commit applies the minimal lint-only markdown fixes reported by CI. PR Build (Preview) was failing for a separate infrastructure reason: the fork does not provide GH_APP_ID/GH_APP_PRIVATE_KEY, so the preview workflow now skips cleanly when those secrets are absent instead of reporting a false red check.\n\nConstraint: Local verification is limited to code-level checks on macOS; Linux/eBPF execution must stay in CI\nConstraint: Preview workflow must not require unavailable GitHub App secrets on forks\nRejected: Guess at the metrics code first | the logs showed the primary breakage was in outbound accessor code, not runtime collector logic\nRejected: Fix only dialer_group.go | would leave document lint and preview workflow red on the same PR\nConfidence: medium\nScope-risk: moderate\nReversibility: clean\nDirective: Keep PR-only workflow glue tolerant of missing fork secrets, and keep accessor helpers reading selectionState rather than stale duplicated fields\nTested: gh Actions log inspection for PR #28; git diff --check; npm run markdown-lint; npm run check-broken-link; Linux-targeted compile of ./component/outbound with dae_stub_ebpf\nNot-tested: Full repository CI rerun after push; local execution of preview workflow action stack
Comparing the failing PR 28 runs with the older green runs showed the remaining Go Test, Lint, and Kernel Test failures all converged on DnsController metrics helper methods. During the rebase onto dae/main, the facade moved cache and forwarder state into the shared dnsControllerStore (sync.Map-backed), but CacheSize and ForwarderCacheInfo were still reading removed mutex-protected map fields.\n\nThis commit switches those helpers to iterate the current shared store, which matches the post-reload architecture and unblocks pkg/metrics/control compilation again.\n\nConstraint: CI failures were compared against known-good runs 24814452464/22514228055 and 24815333258/22514228007 before changing code\nConstraint: Fix must preserve the dae/main shared-store DNS architecture rather than reintroduce legacy mutex+map fields\nRejected: Re-add dnsCacheMu/dnsForwarderCacheMu to DnsController | would regress the new shared-store design and mask the real mismatch\nRejected: Change collector code to stop calling these helpers | symptom fix, not root cause\nConfidence: medium\nScope-risk: narrow\nReversibility: clean\nDirective: Any future DnsController metrics/helper methods must read through the shared store facade, not legacy per-controller cache fields\nTested: GitHub Actions log comparison for failed vs successful runs; git diff --check\nNot-tested: Full rerun of PR 28 after push (pending GitHub Actions)
Comparing the latest failing PR 28 runs with the earlier green baselines showed two more regressions after the first CI-fix pass. Kernel Test and Go Test both failed on cmd/run.go because shutdownAfterSignalWithHandoff referenced the Run-local endpointServer variable from package scope. The same Go Test run then exposed a second issue in control/dns_metrics_test.go: the tests still constructed zero-value DnsController instances, but the rebased code now requires a shared dnsControllerStore-backed test helper.\n\nThis commit removes the invalid out-of-scope endpointServer shutdown branch and migrates the DNS metrics tests to the store-aware helper used elsewhere in the control package.\n\nConstraint: Root cause was determined by comparing failed runs 24822215119/24822215133 with the older successful run 22514228007 before changing code\nConstraint: Test updates must follow the new shared-store DnsController contract rather than bypass requireStore panics\nRejected: Reintroduce a package-level endpointServer just to satisfy shutdownAfterSignalWithHandoff | wrong ownership boundary and unnecessary duplication\nRejected: Loosen requireStore for zero-value tests | would hide contract violations instead of updating tests to the supported helper\nConfidence: medium\nScope-risk: narrow\nReversibility: clean\nDirective: Keep endpoint server lifecycle owned inside Run, and keep control tests constructing DnsController through shared-store helpers\nTested: GitHub Actions failed-vs-successful run comparison; git diff --check; local targeted go test attempt for dns_metrics tests (blocked by macOS/Linux dependency divergence, no new logic errors)\nNot-tested: Full PR 28 CI rerun after push
The next failed-vs-successful CI comparison showed the remaining red signal had moved from structural compile regressions to pure static checks. The latest lint run failed on unchecked Close calls in endpoint TLS validation, deprecated Prometheus collector constructors, the deliberate use of the exported runtime snapshot API, and a dead refreshPprofServer helper left behind after the endpoint-server integration.\n\nThis commit makes the Close paths explicit, switches registry construction to prometheus/collectors, documents the intentional staticcheck exception for the public runtime snapshot API, and removes the now-unused reload manager helper.\n\nConstraint: Changes must preserve the plan’s decision to source runtime traffic from control.SnapshotRuntimeStats instead of touching runtime internals\nConstraint: Fixes should address lint/static analysis directly without changing runtime behavior\nRejected: Replace SnapshotRuntimeStats with control-plane internals | violates the integration plan and increases merge friction\nRejected: Suppress all lint at workflow level | hides real regressions instead of fixing the small concrete issues\nConfidence: medium\nScope-risk: narrow\nReversibility: clean\nDirective: Keep lint fixes local to the actual warning site; do not broaden suppressions when a precise code change will do\nTested: GitHub Actions log comparison for the latest failing lint run; git diff --check\nNot-tested: Fresh CI rerun after push (pending)


Background
This PR supersedes #941 with a clean metrics-only branch based on
daeuniverse/dae:main.The previous PR mixed in many unrelated commits. This one keeps only metrics endpoint work and required dependencies.
本 PR 实现 metrics Phase1(不进入热路径):
global.endpoint_*配置项pkg/metrics/{state,registry,server,auth}example.dae示例配置 <--user-facing docs PR在 Phase1 基础上实现 Phase2:
dae Transparent Proxy-Grafana_dashboard.jsonChecklist
Full Changelogs
请见 #948 (comment)
Test Result
go build ./...go test ./.../metrics,/debug/pprof/Notes
daeuniverse/dae:mainat split timeChecklist
最终实现效果: