feat(metrics): add endpoint server and metrics#941
Closed
MaurUppi wants to merge 107 commits into
Closed
Conversation
- Implemented a concurrency limit in DnsController to manage simultaneous DNS queries. - Added a pipelined connection mechanism to optimize DNS request handling. - Introduced tests for concurrency limits and race conditions in DNS processing. - Enhanced error handling and logging in DNS listener and TCP relay functions. - Refactored DNS handling methods to support singleflight for duplicate requests. - Added benchmarks for pipelined connections and singleflight performance. - Improved resource management with context cancellation in TCP relay operations.
… packet detection - Implemented IsLikelyQuicInitialPacket to perform a fast header check on incoming UDP packets to filter out non-QUIC datagrams. - Updated Sniffer to utilize this function for early rejection of irrelevant packets. - Enhanced tests for IsLikelyQuicInitialPacket to ensure correct identification of QUIC initial packets. refactor(control): optimize DNS connection handling and routing cache - Improved connection pooling logic to prevent blocking on slow dials. - Replaced sync.Map with atomic operations for pending request slots in pipelined connections. - Added caching mechanism for UDP routing results with TTL to reduce redundant lookups. - Updated DNS controller to use sync.Map for forwarder cache, enhancing concurrency. test(control): add comprehensive tests for connection pool and routing cache - Introduced tests for connection pool to ensure non-blocking behavior during slow dials. - Added tests for response slot lifecycle to verify proper reuse and error handling. - Implemented tests for UDP endpoint routing cache to validate hit and expiration behavior.
…or failure scenarios
- guard DNS resolve against nil dialer to avoid panic paths in tests - initialize direct dialers in netutils tests and skip when network is unavailable - skip domain matcher geosite-dependent test when geosite.dat is absent - gate eBPF kernel tests behind explicit dae_bpf_tests build tag - remove fragile bitlist capacity assertions and validate tighten semantics - enhance config marshaller for repeatable function filters and int/uint values - make marshal test use secure temp files and assert round-trip idempotent output
- discard stale/mismatched UDP DNS responses and keep reading - close connection only after stale/malformed response threshold - add DoUDP regression tests for stale-discard and threshold-close
Revert DNS(53) goroutine fast-path introduced after run daeuniverse#697. This aligns packet handling semantics with the last known-good run and avoids kernel-test WAN IPv6 UDP instability.
Drop pre-singleflight cache short-circuit introduced at run daeuniverse#698 boundary. Restore the previous DNS handling flow to avoid WAN IPv6 UDP kernel-test regression.
- remove redundant EmitTask retry loop while preserving ordering semantics - simplify queue recycle path after idle GC - keep API and behavior unchanged
- add IPv4 fast path in hashAddrPort for sharded pools - reuse single timestamp in LookupDnsRespCache to reduce hot-path overhead - no API/behavior changes
Avoid waiting for secondary A/AAAA lookup when current query type is already preferred. Keep response semantics unchanged; secondary lookup still runs for cache warming.
- allocate/wait secondary-lookup done channel only when needed - early-return on canceled context in pipelined RoundTrip before write wait - no API or protocol semantics changes
Problem: - When DNS check option parsing fails or IP version is unavailable, CheckFunc returns (false, nil) to indicate 'skip check' - But Check() treated this as failure, marking Alive=false and adding Timeout latency - This caused all dialers to be marked unavailable when DNS check prerequisites weren't met, resulting in 'no alive dialer' errors Root Cause: Check() didn't distinguish between: 1. (true, nil) - success 2. (false, nil) - skip (should preserve state) 3. (false, err) - failure (should mark unavailable) Solution: Only update alive state on success (ok=true) or actual failure (err!=nil). When (ok=false, err=nil), preserve existing alive state instead of incorrectly marking as unavailable. This allows dialers to remain alive when certain check types are skipped due to configuration or network conditions.
Add regression tests for Dialer.Check state machine: - repeated (ok=false, err=nil) skip checks must not mark dialer unavailable - real failures (ok=false, err!=nil) must still mark dialer unavailable This guards against cascading no-alive-dialer collapse when a check path is temporarily skipped (e.g. DNS IP-version not available), while preserving existing failure semantics.
Author
|
Kernel Test / Test (6.1-20250527.055456) (pull_request) |
- Add blank line after variable declaration (LINE_SPACING) - Remove unnecessary braces for single-statement if/else (BRACES)
- Implement tests for DNS port detection, concurrent DNS queries, and non-DNS traffic order preservation. - Include memory profiling tests to compare direct execution with UdpTaskPool. - Benchmark various execution paths for DNS queries and validate packet handling for valid and invalid DNS packets. - Ensure mixed traffic is handled correctly and that non-DNS traffic uses the appropriate endpoints. - Cover edge cases for DNS queries, including multiple questions and long domain names.
…r short-lived protocols
… task pool management
… optimize task submission fix(dns): enhance HTTP transport settings for better connection management chore(deps): add ants v2.11.5 for improved concurrency handling
bbc7b85 to
186599a
Compare
2 tasks
Author
|
Superseded by #948 (clean metrics-only branch based on main). Closing this PR to avoid mixed unrelated history in review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
本 PR 实现 metrics Phase1(不进入热路径):
global.endpoint_*配置项pkg/metrics/{state,registry,server,auth}example.dae示例配置 <--user-facing docs PR在 Phase1 基础上实现 Phase2:
dae Transparent Proxy-Grafana_dashboard.jsonChecklist
Full Changelogs
请见 #941 (comment)
Test Result
go build ./...go test ./.../metrics,/debug/pprof/最终实现效果: