Skip to content

fix(ipc): replace retry workaround with WaitNamedPipe + readiness signal (APMSP-2356)#1848

Draft
Leiyks wants to merge 7 commits intomainfrom
leiyks/fix-apmsp-2356-windows-pipe-race
Draft

fix(ipc): replace retry workaround with WaitNamedPipe + readiness signal (APMSP-2356)#1848
Leiyks wants to merge 7 commits intomainfrom
leiyks/fix-apmsp-2356-windows-pipe-race

Conversation

@Leiyks
Copy link
Copy Markdown
Contributor

@Leiyks Leiyks commented Apr 7, 2026

Problem

test_ddog_sidecar_register_app was flaky on Windows and suppressed with #[ignore] (APMSP-2356).

Root cause: daemonize() returns as soon as the child process exists (wait_spawn()), but the child's Tokio runtime may not have started yet. On Windows, named pipes have no kernel-level connection backlog — CreateFileA fails immediately if the server hasn't yet called ConnectNamedPipe. Unix domain sockets queue connect() calls until accept() is called, but Windows does not.

Fix

Two complementary Windows-native fixes that address the root cause rather than masking it with retries:

1 — WaitNamedPipeA in SeqpacketConn::connect() (datadog-ipc)

When CreateFileA returns ERROR_PIPE_BUSY (all pipe instances occupied), call WaitNamedPipeA(name, 5000) to block up to 5 s for an instance to become available, then retry — instead of immediately returning ConnectionRefused.

2 — Readiness signal via named Windows event (datadog-sidecar)

  • setup_daemon_process() creates a named manual-reset event Local\dd-sidecar-ready-<PID>, passes its name to the child via DD_SIDECAR_READY_EVENT, and returns a ReadinessWaiter closure that blocks on WaitForSingleObject(30 s).
  • The child sidecar calls signal_sidecar_ready() inside acquire_listener (Tokio runtime running, listener ready to accept) before entering the accept loop.
  • daemonize() calls the waiter after wait_spawn(), so the parent cannot attempt connect_to_server() before the pipe is guaranteed to exist.

Unix

setup_daemon_process returns a no-op ReadinessWaiter. Unix domain sockets have a kernel-level connection backlog — clients can connect as soon as bind() + listen() have been called, even before the first accept(), so neither fix is needed.

The #[ignore] annotation on test_ddog_sidecar_register_app is removed so the fix is validated in CI.

…ipe IPC

On Windows, a named-pipe client's CreateFileA fails immediately if the
server has not yet called ConnectNamedPipe — there is no kernel-level
connection backlog as Unix domain sockets provide.  After daemonize()
returns the child process exists but its Tokio runtime may not have
entered the accept loop yet, causing test_ddog_sidecar_register_app to
fail intermittently.

Add connect_to_sidecar_with_retry() that retries the connection with
exponential back-off (1 ms → 100 ms cap, up to 20 attempts ≈ 1.9 s
total) when just_spawned is true.  On non-Windows platforms is_pipe_not_ready()
always returns false so the helper is a single-attempt no-op there.

Remove the APMSP-2356 ignore annotation from test_ddog_sidecar_register_app
so the fix is validated in CI.
@Leiyks Leiyks force-pushed the leiyks/fix-apmsp-2356-windows-pipe-race branch from 6bd5773 to ba5ba4b Compare April 7, 2026 15:16
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

Clippy Allow Annotation Report

Comparing clippy allow annotations between branches:

  • Base Branch: origin/main
  • PR Branch: origin/leiyks/fix-apmsp-2356-windows-pipe-race

Summary by Rule

Rule Base Branch PR Branch Change
unwrap_used 4 4 No change (0%)
Total 4 4 No change (0%)

Annotation Counts by File

File Base Branch PR Branch Change
datadog-sidecar/src/unix.rs 4 4 No change (0%)

Annotation Stats by Crate

Crate Base Branch PR Branch Change
clippy-annotation-reporter 5 5 No change (0%)
datadog-ffe-ffi 1 1 No change (0%)
datadog-ipc 21 21 No change (0%)
datadog-live-debugger 6 6 No change (0%)
datadog-live-debugger-ffi 10 10 No change (0%)
datadog-profiling-replayer 4 4 No change (0%)
datadog-remote-config 3 3 No change (0%)
datadog-sidecar 55 55 No change (0%)
libdd-common 10 10 No change (0%)
libdd-common-ffi 12 12 No change (0%)
libdd-data-pipeline 5 5 No change (0%)
libdd-ddsketch 2 2 No change (0%)
libdd-dogstatsd-client 1 1 No change (0%)
libdd-profiling 13 13 No change (0%)
libdd-telemetry 19 19 No change (0%)
libdd-tinybytes 4 4 No change (0%)
libdd-trace-normalization 2 2 No change (0%)
libdd-trace-obfuscation 8 8 No change (0%)
libdd-trace-utils 15 15 No change (0%)
Total 196 196 No change (0%)

About This Report

This report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality.

Fix clippy::precedence warning in libdd-trace-utils/src/otlp_encoder/mapper.rs:
the shift operator binds tighter than bitwise OR, but explicit parentheses
make the intent unambiguous.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

📚 Documentation Check Results

⚠️ 562 documentation warning(s) found

📦 libdd-trace-utils - 562 warning(s)


Updated: 2026-04-08 13:48:30 UTC | Commit: 5a6a500 | missing-docs job results

@datadog-prod-us1-4
Copy link
Copy Markdown

datadog-prod-us1-4 bot commented Apr 7, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 88.89%
Overall Coverage: 71.51% (-0.02%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 4d43334 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

🔒 Cargo Deny Results

No issues found!

📦 libdd-trace-utils - ✅ No issues


Updated: 2026-04-08 13:51:53 UTC | Commit: 5a6a500 | dependency-check job results

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 88.88889% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 71.50%. Comparing base (2cdcdcb) to head (4d43334).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1848      +/-   ##
==========================================
- Coverage   71.52%   71.50%   -0.02%     
==========================================
  Files         426      426              
  Lines       67369    67370       +1     
==========================================
- Hits        48186    48176      -10     
- Misses      19183    19194      +11     
Components Coverage Δ
libdd-crashtracker 66.00% <ø> (+0.01%) ⬆️
libdd-crashtracker-ffi 34.09% <ø> (ø)
libdd-alloc 98.77% <ø> (ø)
libdd-data-pipeline 86.65% <ø> (ø)
libdd-data-pipeline-ffi 74.85% <ø> (ø)
libdd-common 79.16% <ø> (ø)
libdd-common-ffi 73.87% <ø> (ø)
libdd-telemetry 62.80% <ø> (+0.03%) ⬆️
libdd-telemetry-ffi 16.75% <ø> (ø)
libdd-dogstatsd-client 82.64% <ø> (ø)
datadog-ipc 73.10% <ø> (ø)
libdd-profiling 82.14% <ø> (+<0.01%) ⬆️
libdd-profiling-ffi 64.94% <ø> (ø)
datadog-sidecar 30.25% <100.00%> (+0.01%) ⬆️
datdog-sidecar-ffi 7.42% <ø> (ø)
spawn-worker 54.69% <ø> (ø)
libdd-tinybytes 93.16% <ø> (ø)
libdd-trace-normalization 81.71% <ø> (ø)
libdd-trace-obfuscation 87.24% <ø> (ø)
libdd-trace-protobuf 68.25% <ø> (ø)
libdd-trace-utils 88.73% <0.00%> (ø)
datadog-tracer-flare 86.88% <ø> (ø)
libdd-log 74.69% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts bot commented Apr 7, 2026

Artifact Size Benchmark Report

aarch64-alpine-linux-musl
Artifact Baseline Commit Change
/aarch64-alpine-linux-musl/lib/libdatadog_profiling.a 82.89 MB 82.89 MB +0% (+8 B) 👌
/aarch64-alpine-linux-musl/lib/libdatadog_profiling.so 7.63 MB 7.63 MB 0% (0 B) 👌
aarch64-unknown-linux-gnu
Artifact Baseline Commit Change
/aarch64-unknown-linux-gnu/lib/libdatadog_profiling.so 10.08 MB 10.08 MB 0% (0 B) 👌
/aarch64-unknown-linux-gnu/lib/libdatadog_profiling.a 98.86 MB 98.86 MB +0% (+8 B) 👌
libdatadog-x64-windows
Artifact Baseline Commit Change
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.dll 24.98 MB 24.98 MB 0% (0 B) 👌
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.lib 77.85 KB 77.85 KB 0% (0 B) 👌
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.pdb 182.37 MB 182.37 MB -0% (-8.00 KB) 👌
/libdatadog-x64-windows/debug/static/datadog_profiling_ffi.lib 904.68 MB 904.68 MB 0% (0 B) 👌
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.dll 7.85 MB 7.85 MB 0% (0 B) 👌
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.lib 77.85 KB 77.85 KB 0% (0 B) 👌
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.pdb 23.51 MB 23.51 MB 0% (0 B) 👌
/libdatadog-x64-windows/release/static/datadog_profiling_ffi.lib 45.85 MB 45.85 MB 0% (0 B) 👌
libdatadog-x86-windows
Artifact Baseline Commit Change
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.dll 21.48 MB 21.48 MB 0% (0 B) 👌
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.lib 79.06 KB 79.06 KB 0% (0 B) 👌
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.pdb 186.39 MB 186.38 MB -0% (-8.00 KB) 👌
/libdatadog-x86-windows/debug/static/datadog_profiling_ffi.lib 889.47 MB 889.47 MB 0% (0 B) 👌
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.dll 6.09 MB 6.09 MB 0% (0 B) 👌
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.lib 79.06 KB 79.06 KB 0% (0 B) 👌
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.pdb 25.16 MB 25.16 MB 0% (0 B) 👌
/libdatadog-x86-windows/release/static/datadog_profiling_ffi.lib 43.33 MB 43.33 MB 0% (0 B) 👌
x86_64-alpine-linux-musl
Artifact Baseline Commit Change
/x86_64-alpine-linux-musl/lib/libdatadog_profiling.a 74.07 MB 74.07 MB 0% (0 B) 👌
/x86_64-alpine-linux-musl/lib/libdatadog_profiling.so 8.49 MB 8.49 MB 0% (0 B) 👌
x86_64-unknown-linux-gnu
Artifact Baseline Commit Change
/x86_64-unknown-linux-gnu/lib/libdatadog_profiling.a 91.32 MB 91.32 MB 0% (0 B) 👌
/x86_64-unknown-linux-gnu/lib/libdatadog_profiling.so 10.13 MB 10.13 MB 0% (0 B) 👌

Leiyks added 3 commits April 8, 2026 11:13
- Remove WouldBlock from is_pipe_not_ready: CreateFileA uses synchronous
  mode and only returns ERROR_FILE_NOT_FOUND (NotFound) or ERROR_PIPE_BUSY
  (ConnectionRefused); WouldBlock can never appear on this path. Add
  comments mapping Windows error codes to io::ErrorKind variants.
- Replace unreachable post-loop connect_to_server() call with
  unreachable!(): every loop iteration already returns via Ok, retry,
  or Err arm, making the trailing call dead code.
- Add "Safe to block" comment on std::thread::sleep to clarify it is
  called from synchronous FFI context, not from within a Tokio task.
Replace the exponential-backoff retry loop in start_or_connect_to_sidecar()
with two proper Windows-native fixes that address the root cause:

1. WaitNamedPipeA in SeqpacketConn::connect(): instead of immediately
   returning ConnectionRefused on ERROR_PIPE_BUSY, block up to 5 s for a
   pipe instance to become available, then retry CreateFileA.

2. Readiness signal via named Windows event: setup_daemon_process() creates
   a named manual-reset event (Local\dd-sidecar-ready-<PID>), passes its
   name to the child via DD_SIDECAR_READY_EVENT, and returns a ReadinessWaiter
   that blocks on WaitForSingleObject(30 s). The child calls
   signal_sidecar_ready() inside acquire_listener (Tokio runtime running,
   listener ready) before entering the accept loop. daemonize() calls the
   waiter after wait_spawn() so the parent cannot connect before the pipe
   is guaranteed to exist.

On Unix, domain sockets have a kernel-level connection backlog so neither
fix is needed; setup_daemon_process returns a no-op ReadinessWaiter.
@Leiyks Leiyks changed the title fix(sidecar): retry connect on Windows to fix named-pipe startup race (APMSP-2356) fix(ipc): replace retry workaround with WaitNamedPipe + readiness signal (APMSP-2356) Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants