fix(ipc): replace retry workaround with WaitNamedPipe + readiness signal (APMSP-2356)#1848
fix(ipc): replace retry workaround with WaitNamedPipe + readiness signal (APMSP-2356)#1848
Conversation
…ipe IPC On Windows, a named-pipe client's CreateFileA fails immediately if the server has not yet called ConnectNamedPipe — there is no kernel-level connection backlog as Unix domain sockets provide. After daemonize() returns the child process exists but its Tokio runtime may not have entered the accept loop yet, causing test_ddog_sidecar_register_app to fail intermittently. Add connect_to_sidecar_with_retry() that retries the connection with exponential back-off (1 ms → 100 ms cap, up to 20 attempts ≈ 1.9 s total) when just_spawned is true. On non-Windows platforms is_pipe_not_ready() always returns false so the helper is a single-attempt no-op there. Remove the APMSP-2356 ignore annotation from test_ddog_sidecar_register_app so the fix is validated in CI.
6bd5773 to
ba5ba4b
Compare
Clippy Allow Annotation ReportComparing clippy allow annotations between branches:
Summary by Rule
Annotation Counts by File
Annotation Stats by Crate
About This ReportThis report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality. |
Fix clippy::precedence warning in libdd-trace-utils/src/otlp_encoder/mapper.rs: the shift operator binds tighter than bitwise OR, but explicit parentheses make the intent unambiguous.
📚 Documentation Check Results📦
|
|
✅ Tests 🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage (details) 🔗 Commit SHA: 4d43334 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback! |
🔒 Cargo Deny Results✅ No issues found! 📦
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1848 +/- ##
==========================================
- Coverage 71.52% 71.50% -0.02%
==========================================
Files 426 426
Lines 67369 67370 +1
==========================================
- Hits 48186 48176 -10
- Misses 19183 19194 +11
🚀 New features to boost your workflow:
|
Artifact Size Benchmark Reportaarch64-alpine-linux-musl
aarch64-unknown-linux-gnu
libdatadog-x64-windows
libdatadog-x86-windows
x86_64-alpine-linux-musl
x86_64-unknown-linux-gnu
|
- Remove WouldBlock from is_pipe_not_ready: CreateFileA uses synchronous mode and only returns ERROR_FILE_NOT_FOUND (NotFound) or ERROR_PIPE_BUSY (ConnectionRefused); WouldBlock can never appear on this path. Add comments mapping Windows error codes to io::ErrorKind variants. - Replace unreachable post-loop connect_to_server() call with unreachable!(): every loop iteration already returns via Ok, retry, or Err arm, making the trailing call dead code. - Add "Safe to block" comment on std::thread::sleep to clarify it is called from synchronous FFI context, not from within a Tokio task.
Replace the exponential-backoff retry loop in start_or_connect_to_sidecar() with two proper Windows-native fixes that address the root cause: 1. WaitNamedPipeA in SeqpacketConn::connect(): instead of immediately returning ConnectionRefused on ERROR_PIPE_BUSY, block up to 5 s for a pipe instance to become available, then retry CreateFileA. 2. Readiness signal via named Windows event: setup_daemon_process() creates a named manual-reset event (Local\dd-sidecar-ready-<PID>), passes its name to the child via DD_SIDECAR_READY_EVENT, and returns a ReadinessWaiter that blocks on WaitForSingleObject(30 s). The child calls signal_sidecar_ready() inside acquire_listener (Tokio runtime running, listener ready) before entering the accept loop. daemonize() calls the waiter after wait_spawn() so the parent cannot connect before the pipe is guaranteed to exist. On Unix, domain sockets have a kernel-level connection backlog so neither fix is needed; setup_daemon_process returns a no-op ReadinessWaiter.
Problem
test_ddog_sidecar_register_appwas flaky on Windows and suppressed with#[ignore](APMSP-2356).Root cause:
daemonize()returns as soon as the child process exists (wait_spawn()), but the child's Tokio runtime may not have started yet. On Windows, named pipes have no kernel-level connection backlog —CreateFileAfails immediately if the server hasn't yet calledConnectNamedPipe. Unix domain sockets queueconnect()calls untilaccept()is called, but Windows does not.Fix
Two complementary Windows-native fixes that address the root cause rather than masking it with retries:
1 —
WaitNamedPipeAinSeqpacketConn::connect()(datadog-ipc)When
CreateFileAreturnsERROR_PIPE_BUSY(all pipe instances occupied), callWaitNamedPipeA(name, 5000)to block up to 5 s for an instance to become available, then retry — instead of immediately returningConnectionRefused.2 — Readiness signal via named Windows event (
datadog-sidecar)setup_daemon_process()creates a named manual-reset eventLocal\dd-sidecar-ready-<PID>, passes its name to the child viaDD_SIDECAR_READY_EVENT, and returns aReadinessWaiterclosure that blocks onWaitForSingleObject(30 s).signal_sidecar_ready()insideacquire_listener(Tokio runtime running, listener ready to accept) before entering the accept loop.daemonize()calls the waiter afterwait_spawn(), so the parent cannot attemptconnect_to_server()before the pipe is guaranteed to exist.Unix
setup_daemon_processreturns a no-opReadinessWaiter. Unix domain sockets have a kernel-level connection backlog — clients can connect as soon asbind()+listen()have been called, even before the firstaccept(), so neither fix is needed.The
#[ignore]annotation ontest_ddog_sidecar_register_appis removed so the fix is validated in CI.