Skip to content

feat(metrics): add endpoint server and metrics#941

Closed
MaurUppi wants to merge 107 commits into
daeuniverse:mainfrom
MaurUppi:feat/metrics-endpoint-phase1
Closed

feat(metrics): add endpoint server and metrics#941
MaurUppi wants to merge 107 commits into
daeuniverse:mainfrom
MaurUppi:feat/metrics-endpoint-phase1

Conversation

@MaurUppi
Copy link
Copy Markdown

@MaurUppi MaurUppi commented Feb 23, 2026

Background

dae 一直没有实施 metrics,从 clash/Surge 一路过来,都是有 Dashboard 可以看看,所以不是很习惯老去翻 journal 日志,且 trace/debug 的大量日志中翻找信息还是挺麻烦/费劲的,而且消耗空间还很快。dae 的机器配置不高,查询量大很费劲。哦, 忘了说,试过用 vector 解析日志,调试是在太麻烦且多个香炉多个鬼,维护不易。
因此,假期无事就 vibe coding 完成这个 PR,希望给 v1.1.0 送去有价值功能,实际效果有待 team 审核。
附带我基于 metrics 捣鼓的 Grafana Dashboard,有事没事看看还是挺有趣的。

本 PR 实现 metrics Phase1(不进入热路径):

  • 新增 global.endpoint_* 配置项
  • 接入 endpoint server(metrics + pprof)
  • 新增 metrics 基础模块:pkg/metrics/{state,registry,server,auth}
  • 新增 Phase1 gauge collectors(dialer/dns/connection)
  • 更新 example.dae 示例配置 <-- user-facing docs PR

在 Phase1 基础上实现 Phase2:

  • 增加 DNS 计数器与延迟直方图快照
  • 增加 TCP/UDP 连接总量计数器
  • 增加 dialer 健康检查计数器
  • 扩展 metrics collectors 输出 Phase2 指标
  • 补充对应测试
  • 补充 dae Transparent Proxy-Grafana_dashboard.json

Checklist

Full Changelogs

  1. Phase 1 Gauges
  2. Phase 2 Counters / Histograms

请见 #941 (comment)


Test Result

  • Local verification:
    • go build ./...
    • go test ./...
    • Runtime check: /metrics, /debug/pprof/
  • Note:
    • eBPF/Kernel CI validation has been passed.

最终实现效果:

dae_dashboard_0 dae_dashboard_1

kix and others added 30 commits February 3, 2026 22:28
- Implemented a concurrency limit in DnsController to manage simultaneous DNS queries.
- Added a pipelined connection mechanism to optimize DNS request handling.
- Introduced tests for concurrency limits and race conditions in DNS processing.
- Enhanced error handling and logging in DNS listener and TCP relay functions.
- Refactored DNS handling methods to support singleflight for duplicate requests.
- Added benchmarks for pipelined connections and singleflight performance.
- Improved resource management with context cancellation in TCP relay operations.
… packet detection

- Implemented IsLikelyQuicInitialPacket to perform a fast header check on incoming UDP packets to filter out non-QUIC datagrams.
- Updated Sniffer to utilize this function for early rejection of irrelevant packets.
- Enhanced tests for IsLikelyQuicInitialPacket to ensure correct identification of QUIC initial packets.

refactor(control): optimize DNS connection handling and routing cache

- Improved connection pooling logic to prevent blocking on slow dials.
- Replaced sync.Map with atomic operations for pending request slots in pipelined connections.
- Added caching mechanism for UDP routing results with TTL to reduce redundant lookups.
- Updated DNS controller to use sync.Map for forwarder cache, enhancing concurrency.

test(control): add comprehensive tests for connection pool and routing cache

- Introduced tests for connection pool to ensure non-blocking behavior during slow dials.
- Added tests for response slot lifecycle to verify proper reuse and error handling.
- Implemented tests for UDP endpoint routing cache to validate hit and expiration behavior.
- guard DNS resolve against nil dialer to avoid panic paths in tests

- initialize direct dialers in netutils tests and skip when network is unavailable

- skip domain matcher geosite-dependent test when geosite.dat is absent

- gate eBPF kernel tests behind explicit dae_bpf_tests build tag

- remove fragile bitlist capacity assertions and validate tighten semantics

- enhance config marshaller for repeatable function filters and int/uint values

- make marshal test use secure temp files and assert round-trip idempotent output
- discard stale/mismatched UDP DNS responses and keep reading

- close connection only after stale/malformed response threshold

- add DoUDP regression tests for stale-discard and threshold-close
Revert DNS(53) goroutine fast-path introduced after run daeuniverse#697.

This aligns packet handling semantics with the last known-good run and avoids kernel-test WAN IPv6 UDP instability.
Drop pre-singleflight cache short-circuit introduced at run daeuniverse#698 boundary.

Restore the previous DNS handling flow to avoid WAN IPv6 UDP kernel-test regression.
- remove redundant EmitTask retry loop while preserving ordering semantics

- simplify queue recycle path after idle GC

- keep API and behavior unchanged
- add IPv4 fast path in hashAddrPort for sharded pools

- reuse single timestamp in LookupDnsRespCache to reduce hot-path overhead

- no API/behavior changes
Avoid waiting for secondary A/AAAA lookup when current query type is already preferred.

Keep response semantics unchanged; secondary lookup still runs for cache warming.
- allocate/wait secondary-lookup done channel only when needed

- early-return on canceled context in pipelined RoundTrip before write wait

- no API or protocol semantics changes
Problem:
- When DNS check option parsing fails or IP version is unavailable,
  CheckFunc returns (false, nil) to indicate 'skip check'
- But Check() treated this as failure, marking Alive=false and
  adding Timeout latency
- This caused all dialers to be marked unavailable when DNS check
  prerequisites weren't met, resulting in 'no alive dialer' errors

Root Cause:
Check() didn't distinguish between:
1. (true, nil) - success
2. (false, nil) - skip (should preserve state)
3. (false, err) - failure (should mark unavailable)

Solution:
Only update alive state on success (ok=true) or actual failure (err!=nil).
When (ok=false, err=nil), preserve existing alive state instead of
incorrectly marking as unavailable.

This allows dialers to remain alive when certain check types are
skipped due to configuration or network conditions.
Add regression tests for Dialer.Check state machine:
- repeated (ok=false, err=nil) skip checks must not mark dialer unavailable
- real failures (ok=false, err!=nil) must still mark dialer unavailable

This guards against cascading no-alive-dialer collapse when a check path
is temporarily skipped (e.g. DNS IP-version not available), while
preserving existing failure semantics.
@MaurUppi
Copy link
Copy Markdown
Author

Kernel Test / Test (6.1-20250527.055456) (pull_request)
这个应该仅仅是 “Check WAN IPv6 UDP” 步骤 超时错误,不会是 Build 失败,re-run 看看
[docker exec dae dig @2606:4700:4700::1111 one.one.one.one](app://-/index.html#)

kix and others added 19 commits February 24, 2026 11:39
- Add blank line after variable declaration (LINE_SPACING)
- Remove unnecessary braces for single-statement if/else (BRACES)
- Implement tests for DNS port detection, concurrent DNS queries, and non-DNS traffic order preservation.
- Include memory profiling tests to compare direct execution with UdpTaskPool.
- Benchmark various execution paths for DNS queries and validate packet handling for valid and invalid DNS packets.
- Ensure mixed traffic is handled correctly and that non-DNS traffic uses the appropriate endpoints.
- Cover edge cases for DNS queries, including multiple questions and long domain names.
… optimize task submission

fix(dns): enhance HTTP transport settings for better connection management
chore(deps): add ants v2.11.5 for improved concurrency handling
@MaurUppi MaurUppi force-pushed the feat/metrics-endpoint-phase1 branch from bbc7b85 to 186599a Compare February 25, 2026 12:04
@MaurUppi MaurUppi requested review from a team as code owners February 25, 2026 12:04
Copy link
Copy Markdown

@cubercsl cubercsl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请把无关的改动清理掉

@MaurUppi
Copy link
Copy Markdown
Author

Superseded by #948 (clean metrics-only branch based on main). Closing this PR to avoid mixed unrelated history in review.

@MaurUppi MaurUppi closed this Feb 28, 2026
@MaurUppi MaurUppi deleted the feat/metrics-endpoint-phase1 branch February 28, 2026 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants