Fit-for-purpose goal: integrate plugin-ipc into ~/src/netdata/netdata/ so Netdata can immediately replace the current Linux cgroups.plugin -> ebpf.plugin custom metadata transport with typed IPC that is reliable, maintainable, testable, and ready for guarded production rollout.
- Analyze how
plugin-ipcshould be integrated into the Netdata repo and build. - Before any Netdata integration, implement transparent SHM resizing in
plugin-ipcitself. - Validate that feature thoroughly first, including full C/Rust/Go interop matrices on Unix and Windows.
- Use it first to replace the current
cgroups.plugin->ebpf.pluginmetadata channel on Linux. - Make the library available to C, Rust, and Go code inside Netdata.
- Record integration design decisions before implementation.
- User-approved local workspace cleanup in this slice:
- remove the generated Go test / helper binaries after the push
- affected files:
src/go/cgroups.test.exesrc/go/mainsrc/go/raw.test.exesrc/go/windows.test.exe
- User-directed benchmark follow-up now in scope:
- treat the Linux
shm-batch-ping-pongC/Rust spread as two independent problems:- Rust server penalty versus C server with the same C client
- Rust client penalty versus C client with the same C server
- worst-case
rust -> rustis the compounded result of both penalties - objective:
- identify the exact Rust-side hot paths responsible for the server-side and client-side losses
- fix Rust until the Linux C/Rust SHM batch path is materially closer to the C baseline
- scope expansion approved by the user:
- do the same benchmark-delta investigation across all material language/client/server combinations
- identify every real implementation issue behind the benchmark gaps
- fix the implementation issues, not just explain them
- keep benchmark artifacts and benchmark-derived docs in sync after each validated fix
- first verified benchmark-delta findings:
- POSIX
shm-batch-ping-pongwithclient ∈ {c,rust}andserver ∈ {c,rust}still has a real Rust penalty on both sides:c -> c = 64,148,960c -> rust = 58,334,803rust -> c = 52,277,542rust -> rust = 48,220,338- implication:
- Rust server penalty is real
- Rust client penalty is larger
rust -> rustis the compounded case
- benchmark-driver distortion is also real and must be fixed before deeper transport conclusions:
- Go
lookupbenchmark does a synthetic linear scan instead of using the actual O(1) cache structure:bench/drivers/go/main.go
- Rust
lookupbenchmark also does a synthetic linear scan:bench/drivers/rust/src/main.rs
- Rust actual cache lookup currently allocates
name.to_string()on every lookup:src/crates/netipc/src/service/raw.rs
- Go and Rust batch / pipeline clients still do avoidable hot-loop allocations that C avoids or minimizes:
- Go:
bench/drivers/go/main.go
- Rust:
bench/drivers/rust/src/main.rs
- Go:
- Go
- POSIX
- treat the Linux
- Current execution scope:
- remove the multi-method service drift from docs, code, tests, and public APIs
- align the implementation to one-service-kind-per-endpoint
- implement the accepted SHM resize / renegotiation behavior
- eliminate contradictory wording and examples across the repository
- refresh the Linux and Windows benchmark matrices on the current tree
- update benchmark artifacts and all benchmark-derived docs so everything is in sync
- investigate the remaining benchmark spreads and identify whether they reflect real transport/runtime inefficiency, measurement distortion, or pair-specific implementation overhead
- correct the benchmark build path so C benchmark results are generated from optimized C libraries, not from a local Debug CMake tree
- Current implementation status:
- docs/specs/TODOs now explicitly state service-oriented discovery and one request kind per endpoint
- Go public cgroups APIs and Go raw service/tests were rewritten to the single-kind model
cd src/go && go test -count=1 ./pkg/netipc/service/rawnow passes after aligning the raw client/server with learned SHM req/resp capacities and transparent overflow-driven reconnect/retrycd src/go && go test -count=1 ./pkg/netipc/service/cgroupsnow passes- Rust public cgroups facade now uses the single-kind raw server constructor instead of the old multi-handler bundle
- targeted Rust verification now passes:
cargo test --manifest-path src/crates/netipc/Cargo.toml --lib service::cgroups:: -- --test-threads=1
- Rust raw Unix tests no longer use the old mixed
pingpong_handlers()helper - the Rust raw service subset now passes after binding increment-only and string-reverse-only endpoints explicitly and teaching the raw client/server the learned SHM req/resp resize path:
cargo test --manifest-path src/crates/netipc/Cargo.toml --lib service::raw::tests:: -- --test-threads=1
- Go raw L2 now tracks learned request/response capacities, treats
STATUS_LIMIT_EXCEEDEDas an overflow signal, reconnects, renegotiates larger capacities, and retries transparently for overflow-safe calls - Rust raw L2 now tracks learned request/response capacities, treats
STATUS_LIMIT_EXCEEDEDas an overflow signal, reconnects, renegotiates larger capacities, and retries transparently for overflow-safe calls - Go and Rust transport listeners now expose payload-limit setters so the server can advertise learned capacities to later clients before
accept():- Go POSIX:
src/go/pkg/netipc/transport/posix/uds.go - Go Windows:
src/go/pkg/netipc/transport/windows/pipe.go - Rust POSIX:
src/crates/netipc/src/transport/posix.rs - Rust Windows:
src/crates/netipc/src/transport/windows.rs
- Go POSIX:
src/crates/netipc/src/service/raw.rsno longer exposes the genericHandlersbundle or the transitionalnew_single_kind/with_workers_single_kindconstructorssrc/crates/netipc/src/service/raw.rsnow models managed servers as single-kind endpoints directly:ManagedServer::new(..., expected_method_code, handler)ManagedServer::with_workers(..., expected_method_code, handler, worker_count)
- Rust POSIX and Windows benchmark drivers now use the single-kind raw service surface instead of the deleted multi-handler
Handlersbundle:bench/drivers/rust/src/main.rsbench/drivers/rust/src/bench_windows.rs
src/crates/netipc/src/service/raw_unix_tests.rsandsrc/crates/netipc/src/service/raw_windows_tests.rsnow use that single-kind raw service surface directly instead of feeding a generic handler bundle into the raw server- verified source-level residue scan for
src/crates/netipc/src/service/raw_windows_tests.rsis now clean:- no remaining
Handlers - no remaining
test_cgroups_handlers() - no remaining
increment_handlers()
- no remaining
- verified source-level residue scan for
src/crates/netipc/src/service/raw.rsandsrc/crates/netipc/src/service/raw_unix_tests.rsis now clean:- no remaining
Handlers - no remaining
new_single_kind - no remaining
with_workers_single_kind
- no remaining
- C public naming drift was reduced from plural handler bundles to singular service-handler naming
tests/fixtures/c/test_win_service.cis now snapshot-only; it no longer starts a typed snapshot service and then exercises increment / string-reverse / batch calls against it- source-level cleanup of the remaining Windows C fixtures is only partial so far:
- the obvious typed snapshot
.on_increment/.on_string_reversebundle drift was removed from:tests/fixtures/c/test_win_service_extra.ctests/fixtures/c/test_win_stress.ctests/fixtures/c/test_win_service_guards.ctests/fixtures/c/test_win_service_guards_extra.c
- but real
win11compilation later proved these files still contain stale calls to removed C APIs and stale raw-server assumptions
- the obvious typed snapshot
- verified source-level residue scan across the touched Windows C fixtures is therefore not enough on its own:
- it proves only that the obvious typed-handler bundle names were removed
- it does not prove runtime or even compile-time correctness on Windows
- verified source-level residue scan for the touched Windows Go raw helpers/tests is now clean:
- no remaining
Handlers{...}bundle initializers - no remaining
winTestHandlers()/winFailingHandlers()helpers - no remaining
server.handlersreferences in the Windows raw tests
- no remaining
- Windows Go package cross-compile proof now passes from this Linux host:
cd src/go && GOOS=windows GOARCH=amd64 go test -c ./pkg/netipc/service/rawcd src/go && GOOS=windows GOARCH=amd64 go test -c ./pkg/netipc/service/cgroups
- the Unix interop/service/cache matrix now passes end-to-end after the resize rewrite:
/usr/bin/ctest --test-dir build --output-on-failure -R '^(test_uds_interop|test_shm_interop|test_service_interop|test_service_shm_interop|test_cache_interop|test_cache_shm_interop)$'
- the broader Unix shm/service/cache slice across C, Rust, and Go now also passes:
/usr/bin/ctest --test-dir build --output-on-failure -R '^(test_shm|test_service|test_cache|test_shm_rust|test_service_rust|test_shm_go|test_service_go|test_cache_go)$'
- the previously exposed POSIX UDS mismatch is now resolved:
- Rust
cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1now passes299/299 - the stale transport tests were rewritten to match the accepted directional negotiation semantics:
- requests are sender-driven
- responses are server-driven
- C
test_udsnow proves directional negotiation explicitly and keeps direct receive-limit coverage through a raw malformed-response path - the broader non-fuzz Unix CTest sweep now passes end-to-end:
/usr/bin/ctest --test-dir build --output-on-failure -E '^(fuzz_protocol_30s|go_FuzzDecodeHeader|go_FuzzDecodeChunkHeader|go_FuzzDecodeHello|go_FuzzDecodeHelloAck|go_FuzzDecodeCgroupsRequest|go_FuzzDecodeCgroupsResponse|go_FuzzBatchDirDecode|go_FuzzBatchItemGet)$'- result:
28/28passed
- Rust
- the public docs now match the accepted directional handshake semantics:
docs/level1-wire-envelope.mdexplicitly says request limits are sender-driven and response limits are server-drivendocs/getting-started.mdno longer documents the deleted RustCgroupsHandlers/CgroupsServersurface
- Windows transport test sources were aligned to the same directional contract:
- Go
src/go/pkg/netipc/transport/windows/pipe_integration_test.gono longer expects the old min-style negotiation - Rust
src/crates/netipc/src/transport/windows.rsnow contains a matching directional negotiation test - Go Windows transport tests still have cross-compile proof from this Linux host:
cd src/go && GOOS=windows GOARCH=amd64 go test -c ./pkg/netipc/transport/windows
- Go
- local source checks are clean for the touched Windows C files:
git diff --check -- tests/fixtures/c/test_win_stress.c tests/fixtures/c/test_win_service_guards.c tests/fixtures/c/test_win_service_guards_extra.c TODO-netdata-plugin-ipc-integration.md
- local source checks are also clean for the touched Go/Rust raw files:
git diff --check -- src/crates/netipc/src/service/raw.rs src/crates/netipc/src/service/raw_unix_tests.rs src/go/pkg/netipc/service/raw/client.go src/go/pkg/netipc/service/raw/client_windows.go src/go/pkg/netipc/service/raw/shm_unix_test.go src/go/pkg/netipc/service/raw/helpers_windows_test.go src/go/pkg/netipc/service/raw/more_windows_test.go src/go/pkg/netipc/service/raw/shm_windows_test.go TODO-netdata-plugin-ipc-integration.md
- limitation:
- this Linux host does not have
x86_64-w64-mingw32-gcc - so local source cleanup alone is not enough for the edited Windows C fixtures
- the same host limitation means the
raw_windows_tests.rssource cleanup is not backed by a real Windows Rust compile/run proof from this environment either - the touched Windows Go packages now have cross-compile proof, but still do not have a real Windows runtime proof from this environment
- this Linux host does not have
- current verified Windows runtime status from the real
win11workflow:- the documented
ssh win11+MSYSTEM=MINGW64toolchain path works and has been used for real validation - after syncing the local tree,
cmake --build build -j4onwin11exposed real stale C fixture/API mismatches that were not visible from Linux source scans alone - the first verified
win11failure classes were:- stale removed client helpers:
nipc_client_call_incrementnipc_client_call_increment_batchnipc_client_call_string_reverse
- stale internal error enum usage:
NIPC_ERR_INTERNAL_ERROR
- stale raw-server handler signature assumptions:
- old
boolraw handlers instead ofnipc_error_t (*)(..., const nipc_header_t *, ...)
- old
- stale
nipc_server_init(...)argument ordering under the internal test macro path - stale client struct field assumptions such as
client.request_buf_size
- stale removed client helpers:
- those compile-time failures have now been corrected locally and revalidated on
win11:test_win_service_extra.exenow builds and passes onwin11
- the remaining active Windows C problem is now narrower and runtime-only:
- after correcting the stale Windows C fixture/API mismatches and the baseline request-overflow signaling gap,
test_win_service_guards.exenow passes onwin11:=== Results: 141 passed, 0 failed ===- the previous apparent timeout was not a persistent runtime hang:
- later reruns completed normally once the stale one-item batch test drift was removed
- the last real guard-binary contradiction was:
- a one-item increment "batch" test still expecting reconnect/growth
- that expectation was wrong under the accepted semantics:
- one-item increment batches are normalized to the plain increment path
- the guard was rewritten to use a real 2-item batch for baseline request-resize coverage
- the rest of the edited Windows C runtime slice has now been validated on
win11too:test_win_service.exe:=== Results: 80 passed, 0 failed ===
test_win_service_extra.exe:=== Results: 82 passed, 0 failed ===
test_win_service_guards_extra.exe:=== Results: 93 passed, 0 failed ===
test_win_stress.exe:=== Results: 1 passed, 0 failed ===
- a combined rerun of all edited Windows C binaries also passed cleanly on
win11
- the earlier
test_win_service.exetimeout is not currently reproducible as a deterministic bug:- it timed out once in a combined slice and once in an early soak run
- after the stale guard/test contradictions were removed, a focused rerun passed
- a subsequent combined rerun passed
- a targeted 3-run
win11soak oftest_win_service.exealso passed3/3 - working theory:
- that earlier timeout was a transient host/process stall, not a currently reproducible library correctness bug
- a real L2 behavior gap was exposed and fixed during this
win11investigation:- on baseline request overflow, the server session loop now emits a zero-payload
LIMIT_EXCEEDEDresponse before disconnecting, instead of silently breaking the session - this fix was needed for transparent request-side resize/reconnect to work on Windows baseline transport at all
- on baseline request overflow, the server session loop now emits a zero-payload
- current remaining Windows Rust runtime blocker:
- focused
win11run:timeout 120 cargo test --manifest-path src/crates/netipc/Cargo.toml test_cache_round_trip_windows -- --nocapture --test-threads=1
- current observed behavior:
- build completes
- test process prints:
running 1 testtest service::cgroups::windows_tests::test_cache_round_trip_windows ...
- then stalls without completing
- strongest current evidence:
- Rust raw Windows tests already implement reliable Windows shutdown by:
- storing the service name + wake client config
- setting
running_flag = false - issuing a dummy
NpSession::connect(...)to wake the blockingConnectNamedPipe()
- cgroups Windows tests and Rust Windows interop binaries still use the weaker pattern:
- only
running_flag = false - no wake connection
- only
- the Windows accept loop in
src/crates/netipc/src/service/raw.rsblocks inlistener.accept(), which ultimately blocks inConnectNamedPipe(), sorunning_flag = falsealone is not sufficient to stop the server reliably on Windows
- Rust raw Windows tests already implement reliable Windows shutdown by:
- working theory:
- the cache test body may already be completing
- the stall is very likely in Windows server shutdown/join, not in snapshot/cache decoding itself
- focused
- that Rust Windows blocker is now verified fixed on
win11:- fix:
- cgroups Windows tests and Rust Windows interop binaries now use the same reliable Windows stop pattern already used by the Rust raw Windows tests:
- set
running_flag = false - then issue a wake connection so the blocking
ConnectNamedPipe()returns and the accept loop can observe shutdown
- set
- cgroups Windows tests and Rust Windows interop binaries now use the same reliable Windows stop pattern already used by the Rust raw Windows tests:
- focused proof:
timeout 120 cargo test --manifest-path src/crates/netipc/Cargo.toml test_cache_round_trip_windows -- --nocapture --test-threads=1- result:
test service::cgroups::windows_tests::test_cache_round_trip_windows ... ok
- full Rust Windows lib proof:
timeout 900 cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1- result:
176 passed0 failed1 ignored
- factual conclusion:
- the live bug was stale Windows shutdown/test-fixture behavior, not a current Rust cache decode/refresh correctness issue
- fix:
- broader real Windows interop/service/cache proof is now also green on
win11:- command:
timeout 1800 ctest --test-dir build --output-on-failure -R "^(test_named_pipe_interop|test_win_shm_interop|test_service_win_interop|test_service_win_shm_interop|test_cache_win_interop|test_cache_win_shm_interop)$"
- result:
test_named_pipe_interop: passedtest_win_shm_interop: passedtest_service_win_interop: passedtest_service_win_shm_interop: passedtest_cache_win_interop: passedtest_cache_win_shm_interop: passed- summary:
100% tests passed, 0 tests failed out of 6
- command:
- the documented
- targeted C rebuild and runtime verification now passes:
cmake --build build --target test_service test_hardening test_ping_pong/usr/bin/ctest --test-dir build --output-on-failure -R '^(test_service|test_hardening|test_ping_pong)$'
- the latest naming / contract cleanup slice is now backed by both local Linux and real
win11proof:- local Linux rerun:
/usr/bin/ctest --test-dir build --output-on-failure -R '^(test_hardening|test_ping_pong)$'- result:
100% tests passed, 0 failed
- after syncing this slice's edited files to
win11, targeted rebuild passed:cmake --build build -j4 --target test_win_service test_win_service_extra test_win_service_guards test_win_service_guards_extra
- direct
win11runtime proof for the edited guard binaries passed:./test_win_service_guards.exe- result:
=== Results: 141 passed, 0 failed ===
- result:
./test_win_service_guards_extra.exe- result:
=== Results: 93 passed, 0 failed ===
- result:
- direct
win11runtime proof for the edited service binaries also passed via CTest:ctest --test-dir build --output-on-failure -R "^(test_win_service|test_win_service_extra)$"- result:
test_win_service: passedtest_win_service_extra: passed
- local Linux rerun:
- benchmark refresh on the current tree is now complete and synced:
- factual root cause of the benchmark blocker:
- the C and Rust batch benchmark clients still generated random batch sizes in the range
1..1000 - the actual batch protocol normalizes
item_count == 1to the non-batch path - Go was already correct and generated
2..1000, which is why the same C batch server still interoperated with the Go client
- the C and Rust batch benchmark clients still generated random batch sizes in the range
- fixed in:
bench/drivers/c/bench_posix.cbench/drivers/c/bench_windows.cbench/drivers/rust/src/main.rsbench/drivers/rust/src/bench_windows.rsbench/drivers/go/main.gotests/run-posix-bench.shtests/run-windows-bench.sh
- specific fixes:
- batch benchmark generators now use
2..1000items for real batch scenarios - Windows benchmark failure reporting now defines
server_outbefore callingdump_server_output
- batch benchmark generators now use
- targeted proof after the fix:
- the previously failing pairs now succeed locally and on
win11:uds-batch-ping-pong c->cuds-batch-ping-pong rust->cshm-batch-ping-pong c->cshm-batch-ping-pong rust->cnp-batch-ping-pong c->cnp-batch-ping-pong rust->c
- the previously failing pairs now succeed locally and on
- clean official reruns:
- Linux:
bash tests/run-posix-bench.sh benchmarks-posix.csv 5- result:
Total measurements: 201
- Windows:
ssh win11 'cd /tmp/plugin-ipc-bench-fixed && ... && bash tests/run-windows-bench.sh benchmarks-windows.csv 5'- result:
Total measurements: 201
- Linux:
- clean generated artifacts:
bash tests/generate-benchmarks-posix.sh benchmarks-posix.csv benchmarks-posix.md- result:
All performance floors met
- result:
ssh win11 'cd /tmp/plugin-ipc-bench-fixed && ... && bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md'- result:
All performance floors met
- result:
- factual root cause of the benchmark blocker:
- the follow-up benchmark spread investigation has now established a real benchmark-build bug on POSIX:
- the local benchmark runner used:
- C from
build/bin/bench_posix_c - Rust from
src/crates/netipc/target/release/bench_posix - Go from
build/bin/bench_posix_go
- C from
- the local CMake tree used for the C benchmark was configured as:
build/CMakeCache.txt:CMAKE_BUILD_TYPE:STRING=Debug
- the benchmark target itself added
-O2, but the C libraries it linked against were still unoptimized:build/CMakeFiles/bench_posix_c.dir/flags.make:C_FLAGS = -g -std=gnu11 -O2
build/CMakeFiles/netipc_protocol.dir/flags.make:C_FLAGS = -g -std=gnu11
build/CMakeFiles/netipc_service.dir/flags.make:C_FLAGS = -g -std=gnu11
- a dedicated optimized benchmark tree proved this materially changes the published POSIX rows:
- release build setup:
cmake -S . -B build-release -DCMAKE_BUILD_TYPE=Releasecmake --build build-release --target bench_posix_c bench_posix_go -j8
- direct targeted reruns:
- published
shm-batch-ping-pong c->c:25,947,290
- optimized C libs
shm-batch-ping-pong c(rel)->c(rel):63,699,472
- published
uds-pipeline-batch-d16 c->c:49,512,090
- optimized C libs
uds-pipeline-batch-d16 c(rel)->c(rel):103,212,623
- published
- mixed-language targeted reruns also moved sharply upward when the C side used optimized libraries:
- intended
shm-batch-ping-pong c(rel)->rust:57,122,454
- intended
shm-batch-ping-pong rust->c(rel):52,041,263
- intended
uds-pipeline-batch-d16 c(rel)->rust:91,093,895
- intended
uds-pipeline-batch-d16 rust->c(rel):101,978,294
- intended
- release build setup:
- implemented fix:
tests/run-posix-bench.shnow configures and uses a dedicated optimized benchmark tree:- default:
build-bench-posix - build type:
Release
- default:
tests/run-windows-bench.shnow configures and uses a dedicated optimized benchmark tree:- default:
build-bench-windows - build type:
Release - explicit MinGW toolchain export on
win11
- default:
- factual conclusion:
- the old checked-in POSIX benchmark report was distorted by linking the C benchmark binary against Debug-built C libraries
- the current checked-in POSIX and Windows benchmark artifacts now come from the corrected dedicated benchmark build paths
- the local benchmark runner used:
- the Windows benchmark tree is not affected by the same local Debug-build distortion:
ssh win11 '... grep CMAKE_BUILD_TYPE build/CMakeCache.txt'CMAKE_BUILD_TYPE:STRING=RelWithDebInfo
- the previously suspicious Windows SHM batch outlier did not survive the corrected rerun:
- old checked-in row:
shm-batch-ping-pong c->rust = 9,282,667
- corrected clean rerun row:
shm-batch-ping-pong c->rust = 55,868,058
- old checked-in row:
- final artifact sanity checks:
benchmarks-posix.csv- rows:
201 - duplicate keys:
0 - zero-throughput rows:
0
- rows:
benchmarks-windows.csv- rows:
201 - duplicate keys:
0 - zero-throughput rows:
0
- rows:
- checked-in benchmark docs are now synced to the refreshed artifacts:
benchmarks-posix.csvbenchmarks-posix.mdbenchmarks-windows.csvbenchmarks-windows.mdREADME.md
- corrected max-throughput ranges from the current checked-in artifacts:
- POSIX:
uds-ping-pong:182,963to231,160shm-ping-pong:2,460,317to3,450,961uds-batch-ping-pong:27,182,404to40,240,940shm-batch-ping-pong:31,250,784to64,148,960uds-pipeline-d16:568,373to735,829uds-pipeline-batch-d16:51,960,946to102,954,841snapshot-baseline:158,948to205,624snapshot-shm:1,006,053to1,738,616lookup:114,556,227to203,279,430
- Windows:
np-ping-pong:18,241to21,039shm-ping-pong:2,099,392to2,715,487np-batch-ping-pong:7,013,700to8,550,220shm-batch-ping-pong:36,494,096to58,768,397np-pipeline-d16:245,420to270,488np-pipeline-batch-d16:28,977,365to41,270,903snapshot-baseline:16,090to20,967snapshot-shm:857,823to1,262,493lookup:107,472,315to164,305,717
- POSIX:
- current remaining raw Rust drift is now narrower and well-scoped:
- the raw managed server already enforces one
expected_method_code - the raw client surface still exposes a generic constructor and mixed call surface under the stale internal name
CgroupsClient - the next cleanup slice is to bind the raw Rust client constructors to one service kind and migrate the raw Rust tests to those constructors, matching the already-correct Go raw design
- the raw managed server already enforces one
- raw Rust client drift is now removed from the active service surface:
src/crates/netipc/src/service/raw.rsnow exposesRawClientinstead of the stale internal multi-kind nameCgroupsClient- the raw client is now created only through service-kind-specific constructors:
RawClient::new_snapshot(...)RawClient::new_increment(...)RawClient::new_string_reverse(...)
- request kind remains only as envelope validation on the raw client
- the raw Rust Unix/Windows tests now create snapshot, increment, and string-reverse clients explicitly instead of reusing one generic constructor across service kinds
- local Linux Rust proof for that slice is now green:
cargo test --manifest-path src/crates/netipc/Cargo.toml --lib service::raw::tests:: -- --test-threads=1- result:
75 passed0 failed
- result:
cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1- result:
299 passed0 failed
- result:
- real
win11Rust proof for that slice is now green too:timeout 900 cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1- result:
176 passed0 failed1 ignored
- result:
- the broader
win11interop/service/cache matrix initially exposed two more stale constructor residues outside the Rust raw tests:- Rust benchmark drivers still imported the deleted raw
CgroupsClientinstead of using the public snapshot facade- fixed in:
bench/drivers/rust/src/main.rsbench/drivers/rust/src/bench_windows.rs
- fixed in:
- Go public cgroups wrappers still called the deleted generic raw constructor:
raw.NewClient(...)- fixed in:
src/go/pkg/netipc/service/cgroups/client.gosrc/go/pkg/netipc/service/cgroups/client_windows.go
- Go benchmark drivers still hand-rolled the stale raw dispatch signature instead of using the single-kind increment adapter
- fixed in:
bench/drivers/go/main.go
- fixed in:
- Rust benchmark drivers still imported the deleted raw
- the next verified contradiction slice was documentation-heavy and is now resolved:
- low-level SHM / handshake docs now describe the accepted directional negotiation model and the current session-scoped SHM lifecycle:
- request limits are sender-driven
- response limits are server-driven
- SHM capacities are fixed per session
- larger learned capacities require a reconnect and a new session, not in-place SHM resize
docs/level1-wire-envelope.mdno longer says handshake rule 6 takes the minimum of client and server valuesdocs/level1-windows-np.mdnow documents per-session Windows SHM object names withsession_id, aligned with both code anddocs/level1-windows-shm.md- public L2 comments/docs no longer claim a blanket "retry ONCE":
- ordinary failures still retry once
- overflow-driven resize recovery may reconnect more than once while capacities grow
- Unix test/script cleanup helpers no longer remove the stale pre-session path
{service}.ipcshm; they now use per-session cleanup that matches{service}-{session_id}.ipcshm - validation for this slice is green:
cargo test --manifest-path src/crates/netipc/Cargo.toml --lib service::raw::tests:: -- --test-threads=1- result:
75 passed0 failed
- result:
cd src/go && go test -count=1 ./pkg/netipc/service/raw- result:
ok
- result:
/usr/bin/ctest --test-dir build --output-on-failure -R '^(test_service_interop|test_cache_interop|test_shm_interop)$'- result:
100% tests passed0 failed
- result:
- low-level SHM / handshake docs now describe the accepted directional negotiation model and the current session-scoped SHM lifecycle:
- the next verified residue slice is narrower and fixture-focused:
- several Unix C/Go fixture cleanup helpers still unlink the dead pre-session path
{service}.ipcshminstead of using per-session cleanup - current proven hits:
tests/fixtures/c/test_service.ctests/fixtures/c/test_cache.ctests/fixtures/c/test_hardening.ctests/fixtures/c/test_chaos.ctests/fixtures/c/test_multi_server.ctests/fixtures/c/test_stress.csrc/go/pkg/netipc/service/cgroups/cgroups_unix_test.go
- several Unix C/Go fixture cleanup helpers still unlink the dead pre-session path
- that Unix fixture-cleanup residue slice is now resolved:
- the touched Unix C fixtures now use
nipc_shm_cleanup_stale(TEST_RUN_DIR, service)instead of unlinking the dead{service}.ipcshmpath - the touched Go public cgroups Unix tests now use
posix.ShmCleanupStale(testRunDirUnix, service)instead of removing the dead{service}.ipcshmpath - validation for this slice is green:
cd src/go && go test -count=1 ./pkg/netipc/service/cgroups- result:
ok
- result:
cmake --build build --target test_service test_cache test_hardening test_multi_server test_chaos test_stress- result:
- rebuild passed
- result:
/usr/bin/ctest --test-dir build --output-on-failure -R '^(test_service|test_cache|test_hardening|test_multi_server|test_chaos|test_stress)$'- result:
100% tests passed0 failed
- result:
- the touched Unix C fixtures now use
- one more live Unix fixture contradiction remains after that cleanup pass:
tests/fixtures/c/test_chaos.c:test_shm_chaos()still opens the dead pre-session SHM path{run_dir}/{service}.ipcshm- this is not just stale cleanup text; it likely means the SHM-chaos path is not actually targeting the live per-session SHM file today
- that live SHM-chaos contradiction is now resolved:
tests/fixtures/c/test_chaos.c:test_shm_chaos()now captures the livesession_idfrom the ready client session and opens{run_dir}/{service}-{session_id}.ipcshm- the test no longer treats "SHM file not found" as an acceptable skip on this path
- validation:
cmake --build build --target test_chaos- result:
- rebuild passed
- result:
/usr/bin/ctest --test-dir build --output-on-failure -R '^test_chaos$'- result:
100% tests passed0 failed
- result:
- current residue scan excluding this TODO file is now clean for the main drift markers:
- no remaining old
{service}.ipcshmpath literals - no remaining deleted
CgroupsHandlers/CgroupsServerAPI references - no remaining deleted
raw.NewClient(...)/service::raw::CgroupsClientreferences - no remaining deleted
new_single_kind/with_workers_single_kindreferences
- no remaining old
- broader Unix validation after these cleanup passes is also green:
/usr/bin/ctest --test-dir build --output-on-failure -R '^(test_uds_interop|test_shm_interop|test_service_interop|test_service_shm_interop|test_cache_interop|test_cache_shm_interop|test_shm|test_service|test_cache|test_shm_rust|test_service_rust|test_shm_go|test_service_go|test_cache_go|test_hardening|test_ping_pong|test_multi_server|test_chaos|test_stress)$'- result:
100% tests passed0 failed19/19passedbench/drivers/go/main_windows.go
- result:
- local Go proof for the wrapper/benchmark cleanup is now green:
cd src/go && go test -count=1 ./pkg/netipc/service/cgroups- result:
ok
- result:
cd bench/drivers/go && go test -run '^$' ./...- result:
- compile-only pass
- result:
- real
win11build + matrix proof after those residue fixes is now green:cmake --build build -j4- result:
- build succeeds again after the Rust/Go constructor cleanup
- result:
timeout 1800 ctest --test-dir build --output-on-failure -R "^(test_named_pipe_interop|test_win_shm_interop|test_service_win_interop|test_service_win_shm_interop|test_cache_win_interop|test_cache_win_shm_interop)$"- result:
test_named_pipe_interop: passedtest_win_shm_interop: passedtest_service_win_interop: passedtest_service_win_shm_interop: passedtest_cache_win_interop: passedtest_cache_win_shm_interop: passed- summary:
100% tests passed, 0 tests failed out of 6
- result:
- verified residue scan for the stale constructor names used in this slice is now clean:
- no remaining
raw.NewClient - no remaining
service::raw::CgroupsClient - no remaining
RawClient::new(
- no remaining
- a smaller cross-platform residue cleanup is now also complete:
- the test-only Rust helper
dispatch_single()insrc/crates/netipc/src/service/raw.rsis now explicitly marked as dead-code-tolerant under test builds, so Windows lib-test builds no longer emit the stale unused-function warning - the remaining public docs/spec wording in this slice was normalized away from the older "method-specific" phrasing where it described the public L2 service surface or service contracts:
docs/level1-transport.mddocs/codec.mddocs/level2-typed-api.mddocs/code-organization.mddocs/codec-cgroups-snapshot.md
- the test-only Rust helper
- local Linux validation after that wording/test-helper cleanup is still green:
cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1- result:
299 passed0 failed
- result:
- real
win11validation after that cleanup is also still green:timeout 900 cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1- result:
176 passed0 failed1 ignored
- factual note:
- the previous Windows-only
dispatch_singleunused-function warning is no longer present in this run
- the previous Windows-only
- the Windows guard output still shows the accepted request-resize behavior:
- transparent recovery
- exactly one reconnect
- negotiated request-size growth
- result:
- new verified internal raw-client alignment:
- fact:
- the raw managed servers in Go and Rust were already bound to one
expected_method_code - the remaining client-side drift was that one long-lived raw client context still exposed multiple service-kind calls
- the raw managed servers in Go and Rust were already bound to one
- implementation slice now completed in Go:
- raw Go clients are now created per service kind:
NewSnapshotClient(...)NewIncrementClient(...)NewStringReverseClient(...)
- each client now stores one expected request code and rejects wrong-kind calls as validation failures instead of pretending one client can legitimately serve multiple service kinds
- the cache helpers now bind explicitly to
cgroups-snapshot
- raw Go clients are now created per service kind:
- exact local Unix proof:
cd src/go && go test -count=1 ./pkg/netipc/service/raw- result:
ok
- exact real Windows proof on
win11:cd ~/src/plugin-ipc.git/src/go && go test -count=1 ./pkg/netipc/service/raw- first rerun exposed one Windows-only missed constructor site:
pkg/netipc/service/raw/shm_windows_test.go:334- stale
NewClient(...)
- after correcting that last Windows-only leftover and resyncing:
- result:
ok
- result:
- factual conclusion:
- the Go raw helper layer is now materially aligned with the accepted single-service-kind design on both Unix and Windows
- remaining work is to carry the same invariant through the remaining Rust raw helper surface
- fact:
- a full Rust
cargo test --librun is still blocked by one unrelated transport failure outside this rewrite slice:transport::posix::tests::test_receive_batch_count_exceeds_limit
- remaining heavy work is now concentrated in:
- proving the accepted resize behavior with the full interop/service/cache matrices on Unix and Windows, not just the targeted raw suites
- getting real Windows compile/run proof for the edited Rust/Go/C Windows test surfaces
- reconciling the current C path with the final single-kind + learned-size design language everywhere, then validating all 3 languages together
cgroups.pluginis not an external executable. It runs inside the Netdata daemon:cgroups_main()is started fromsrc/daemon/static_threads_linux.c.
ebpf.pluginis a separate external executable:- built by
add_executable(ebpf.plugin ...)inCMakeLists.txt.
- built by
- Current
cgroups.plugin->ebpf.pluginintegration is a custom SHM + semaphore contract:- producer:
src/collectors/cgroups.plugin/cgroup-discovery.c - shared structs:
src/collectors/cgroups.plugin/sys_fs_cgroup.h - consumer:
src/collectors/ebpf.plugin/ebpf_cgroup.c
- producer:
- The shared payload currently transports cgroup metadata, not PID membership:
- fields:
name,hash,options,enabled,path ebpf.pluginstill reads eachcgroup.procsfile itself.
- fields:
- Netdata already has a stable per-run invocation identifier:
src/libnetdata/log/nd_log-init.c- Netdata reads
NETDATA_INVOCATION_ID, elseINVOCATION_ID, else generates a UUID and exportsNETDATA_INVOCATION_ID.
- External plugins are documented to receive
NETDATA_INVOCATION_ID:src/plugins.d/README.md
- Netdata already exposes plugin environment variables centrally:
src/daemon/environment.c
- Netdata already has the right build roots for all 3 languages:
- C via top-level
CMakeLists.txt - Rust workspace in
src/crates/Cargo.toml - Go module in
src/go/go.mod
- C via top-level
plugin-ipcalready has the exact L3 cgroups snapshot API for this use case:docs/level3-snapshot-api.md
- The typed snapshot schema closely matches Netdata’s current SHM payload:
src/libnetdata/netipc/include/netipc/netipc_protocol.h
- The C API already supports:
- managed server lifecycle
- typed cgroups client/cache
- POSIX transport with negotiated SHM fast path
- Authentication in
plugin-ipcis auint64_t auth_token:src/libnetdata/netipc/include/netipc/netipc_service.hsrc/libnetdata/netipc/include/netipc/netipc_uds.h- Rust/Go implementations use the same concept.
- Phase 1 can replace the metadata transport only.
- Phase 1 will not remove
ebpf.pluginreads ofcgroup.procs. - The default
plugin-ipcresponse size is too small for real Netdata snapshots on large hosts, so Linux integration must use an explicit large response limit. - The best build/distribution model is in-tree vendoring inside Netdata, not an external system dependency.
- Current Netdata payload sizing evidence already proves this:
cgroup_root_maxdefault is1000insrc/collectors/cgroups.plugin/sys_fs_cgroup.c- current per-item SHM body carries
name[256]andpath[FILENAME_MAX + 1]insrc/collectors/cgroups.plugin/sys_fs_cgroup.h FILENAME_MAXon this Linux build environment is4096- this means the current per-item shape is already about
4.3 KiBbefore protocol framing/alignment
- The original written phase plan did not describe a multi-method server.
- Evidence:
TODO-plugin-ipc.history.md- historical phase plan still says:
Define and freeze a minimal v1 typed schema for one RPC method ('increment')
- Evidence:
- The first generated L2 spec also did not need a multi-method server model.
- Evidence:
- initial
docs/level2-typed-api.mdfrom commit1722f95 - handler contract was framed as one typed request view + one response builder per handler callback
- no raw transport-level switch over multiple method codes in that initial text
- initial
- Evidence:
- The history TODO already contained the correct service-oriented discovery model.
- Evidence:
TODO-plugin-ipc.history.md- explicit historical decisions already said:
- discovery is service-oriented, not plugin-oriented
- service names are the stable public contract
- one endpoint per service
- one persistent client context per service
- startup order can remain random
- caller owns reconnect cadence via
refresh(ctx)
- Implication:
- the later multi-method server model was not a missing discussion
- it was drift away from an already-decided service model
- Evidence:
- The first explicit spec drift appears in commit
53b5e5aon2026-03-16.- Evidence:
docs/level2-typed-api.mdin commit53b5e5a- handler contract changed to:
- raw-byte transport handler
switch(method_code)INCREMENTSTRING_REVERSECGROUPS
- this is the first clear documentation model where one server endpoint dispatches multiple request kinds
- Evidence:
- The first strong implementation-level generalization appears the same day in commit
69bb794.- Evidence:
- commit message explicitly says:
Add dispatch_increment(), dispatch_string_reverse(), dispatch_cgroups_snapshot()
docs/getting-started.mdin that commit adds typed helper examples for more than one method family- this widened the implementation and examples toward a generic multi-method dispatch surface
- commit message explicitly says:
- Evidence:
- The drift was then reinforced in public examples in commit
6014b0eon2026-03-17.- Evidence:
docs/getting-started.md- C example registers:
.on_increment.on_cgroups
- Rust example registers:
on_incrementon_cgroups
- Go example registers:
OnIncrementOnSnapshot
- text says:
You register typed callbacks for the supported methods
- Evidence:
- The drift became operationally entrenched in interop in commit
099945bon2026-03-16.- Evidence:
- commit message explicitly says:
Cross-language interop now tests all method types
- interop fixtures for C, Rust, and Go on POSIX and Windows all dispatch:
INCREMENTCGROUPS_SNAPSHOTSTRING_REVERSE
- commit message explicitly says:
- Evidence:
- The drift later propagated into current coverage/TODO planning and the repository README.
- Evidence:
TODO-pending-from-rewrite.mdplanned:snapshot / increment / string-reverse / batch over SHM
README.mdnow says:servers register typed handlers
- Evidence:
- There is currently no evidence in the TODO history that the original direction from the user was:
- one server should serve multiple request kinds
- The strongest historical evidence points the other way:
- the original phase plan explicitly named one RPC method only
- Working theory:
- the drift started when the typed API was generalized from:
- one typed request kind per server
- to
- one generic server dispatching multiple method codes
- then examples, interop fixtures, tests, coverage plans, and README text copied that model until it felt normal
- the drift started when the typed API was generalized from:
-
Windows runtime validation host
- User decision: use
win11over SSH for real Windows proof instead of stopping at source cleanup or cross-compilation from Linux. - Constraint:
- prefer the already-documented
win11workflow from this repository's TODOs/docs - do not guess the Windows execution flow when the repo already documents it
- prefer the already-documented
- Implication:
- touched Windows Rust/Go/C transport/service/interop/cache surfaces should now be proven on a real Windows runtime, not just by static review or Linux-hosted cross-compilation
- the next implementation slice should follow the existing
win11operational guidance already captured in the repo
- User decision: use
-
Authentication source
- User decision: use
NETDATA_INVOCATION_IDfor authentication. - Meaning:
- the auth value changes on every Netdata run
- only plugins launched under the same Netdata instance can authenticate
- Evidence:
src/libnetdata/log/nd_log-init.ccreates/exportsNETDATA_INVOCATION_IDsrc/plugins.d/README.mddocuments it for external plugins
- Implication:
- this is stronger than a machine-stable token for local plugin-to-plugin IPC
- restarts invalidate old clients automatically
- User decision: use
-
Source layout in Netdata
- User decision: native Netdata layout.
- Layout:
- C in
src/libnetdata/netipc/ - Rust in
src/crates/netipc/ - Go in
src/go/pkg/netipc/
- C in
- Implication:
- the library becomes a first-class internal Netdata component in all 3 languages
- future sync from
plugin-ipcupstream will be manual/curated, not subtree-based
-
Invocation ID to auth-token mapping
- User decision: derive the
plugin-ipcuint64_t auth_tokenfromNETDATA_INVOCATION_IDusing a deterministic hash. - Constraint:
- the mapping must be identical in C, Rust, and Go
- Implication:
- only processes launched under the same Netdata run can authenticate
- Netdata restart rotates auth automatically
- User decision: derive the
-
Rollout mode
- User decision: big-bang switch.
- Implication:
- there will be no legacy custom-SHM fallback path for this metadata channel
- Risk:
- any bug in the new path blocks
ebpf.plugincgroup metadata integration immediately
- any bug in the new path blocks
-
Linux response size policy
- User concern/decision direction:
- do not accept a large fixed memory cost such as
16 MiBjust for this IPC path - prefer dynamic behavior that adapts to actual payload size
- allocation should happen only when needed
- do not accept a large fixed memory cost such as
- Implication:
- the current
plugin-ipcresponse budgeting model needs review before integration - response sizing / negotiation may need design changes, not just configuration
- the current
- User concern/decision direction:
-
Snapshot overflow handling direction
- User decision direction:
- reconnect is acceptable for snapshot overflow handling
- growth policy should be power-of-two
- SHM L2 should transparently handle overflow-driven resizing, hidden from both L2 clients and L2 servers
- User design intent:
- the server should not need to know the final safe snapshot size before the first request
- the first real overflow during response preparation should trigger the resize path
- once the server has learned a larger size from a real snapshot, later clients should negotiate into that larger size automatically
- Implication:
- current fixed per-session SHM sizing and current HELLO/HELLO_ACK limit semantics are not sufficient as-is for this Netdata use case
- the growth mechanism likely needs new L2 protocol behavior, not only implementation tweaks
- User decision direction:
-
Pre-integration gating
- User decision:
- implement this transparent SHM resize behavior in
plugin-ipcfirst - do not start Netdata integration before it is done
- require thorough validation first, including full interop matrices across C/Rust/Go on Unix and Windows
- implement this transparent SHM resize behavior in
- Verified evidence that the repo already has the right validation scaffolding:
- POSIX interop tests in
CMakeLists.txt:test_uds_interoptest_shm_interoptest_service_interoptest_service_shm_interoptest_cache_interoptest_cache_shm_interop
- Windows interop tests in
CMakeLists.txt:test_named_pipe_interoptest_win_shm_interoptest_service_win_interoptest_cache_win_interop
- Existing transport-specific integration tests already exist:
- POSIX SHM:
tests/fixtures/c/test_shm.c, Rustsrc/crates/netipc/src/transport/shm_tests.rs - Windows SHM:
tests/fixtures/c/test_win_shm.c, Rustsrc/crates/netipc/src/transport/win_shm.rs, Gosrc/go/pkg/netipc/transport/windows/shm_test.go
- POSIX SHM:
- POSIX interop tests in
- Implication:
- the resize feature must be proven at:
- L1 transport level
- L2 service/client level
- cross-language interop level
- both POSIX and Windows implementations
- the resize feature must be proven at:
- User decision:
-
Design priorities for the resize rewrite
- User decision:
- optimize for long-term correctness, reliability, robustness, and performance
- backward compatibility is not required
- do not optimize for minimizing work now
- prefer the right design even if that means a substantial rewrite
- Implication:
- decisions should favor clean semantics and maintainability over preserving current handshake/transport structure
- a third rewrite is acceptable if it produces a better architecture
- User decision:
-
User design constraints from follow-up discussion
- IPC servers should service a single request kind.
- Sessions should be assumed long-lived:
- connect once
- serve many requests
- disconnect on shutdown or exceptional recovery
-
Benchmark refresh slice disposition
- User decision:
- commit and push the refreshed benchmark slice now
- then investigate the remaining benchmark spreads separately
- Implication:
- commit only the benchmark-fix, benchmark-artifact, and benchmark-doc sync files from this slice
- do not mix this commit with unrelated cleanup or integration work
- Current commit scope
- User decision:
- commit and push the full remaining work from this task now
- Implication:
- stage the remaining drift-removal, SHM-resize, service-kind alignment, test, and doc changes that belong to this task
- avoid unrelated local or user-owned changes outside this task
- Steady-state fast path matters far more than the rare resize path.
- Learned transport sizes are important:
- adapt automatically
- stabilize quickly
- then remain fixed for the lifetime of the process
- reset on restart
- Separate request and response sizing should exist.
- Variable sizing pressure is expected mainly on responses, not requests.
- Artificial hard caps are not acceptable as a design crutch.
- Disconnect-based recovery is acceptable if it is reliable and the system stabilizes.
- Accepted architecture decisions for the SHM resize rewrite
- User accepted:
- L2 service model: single-method-per-server
- Resize signaling path: explicit
LIMIT_EXCEEDEDsignal, then disconnect/reconnect - Auto-resize scope: separate learned request and response sizing, both supported
- Initial size policy: per-server-kind compile-time defaults
- Learned-size lifetime: in-memory only for the current process lifetime, reset on restart
- Implication:
- the current generic multi-method service abstraction is now known design drift
- the rewrite should simplify transport/service code around one request kind per server
- Service discovery and availability model
- User clarified the intended service model explicitly:
- clients connect to a service kind, not to a specific plugin implementation
- each service endpoint serves one request kind only
- example service kinds include:
cgroups-snapshotip-to-asnpid-traffic
- the serving plugin is intentionally abstracted away from clients
- User clarified the intended runtime model explicitly:
- plugins are asynchronous peers
- startup order is not guaranteed
- enrichments from other plugins/services are optional
- a client plugin may start before the service it needs exists
- a service may disappear and reappear during runtime
- clients must reconnect periodically and tolerate service absence
- Implication:
- repository docs/specs/TODOs must describe:
- service-name-based discovery
- service-type ownership independent from plugin identity
- optional dependency semantics
- reconnect / retry behavior for not-yet-available services
- repository docs/specs/TODOs must describe:
- Execution mandate for this phase
- User decision:
- proceed autonomously to remove the drift from implementation and docs
- align code, tests, and examples to the single-service-kind model
- implement the accepted SHM size renegotiation / resize behavior
- remove contradictory wording and stale examples that preserve the wrong model
- Implication:
- this is now a repository-wide consistency and implementation task
- active docs, public APIs, interop fixtures, and validation must converge on the same model before Netdata integration
- Request-kind field semantics
- User clarification:
- request type / method code may remain in wire structures and headers
- its role is validation, not public multi-method dispatch
- a service endpoint expects exactly one request kind
- any other request kind must be rejected
- Implication:
- we can keep method codes in the protocol
- service implementations must bind one endpoint to one expected request kind
- public APIs/tests/docs must not imply that one service endpoint accepts multiple unrelated request kinds
- Payload-vs-service boundary
- User clarification:
- if a service needs arrays of things, batching belongs to that service payload/codec
- batching is not a reason for one L2 endpoint to expose multiple public request kinds
- Implication:
- the public L2 service layer should not keep generic multi-method or generic batch dispatch as part of its contract
INCREMENT,STRING_REVERSE, and batch ping-pong traffic can remain at protocol / transport / benchmark level- the public cgroups snapshot service should be snapshot-only
-
Service naming and endpoint placement
- Context:
- POSIX transport needs a service name and run-dir placement.
- Netdata already has
os_run_dir(true).
- Open question:
- exact service name/versioning strategy for the cgroups snapshot endpoint
- Context:
-
Exact Linux response-size budget
- Context:
- user rejected a large fixed per-connection budget as bad for footprint
- dynamic/adaptive options must be evaluated against the current
plugin-ipcdesign
- Current hard payload evidence:
1000cgroups at roughly4.3 KiBeach already implies multi-megabyte worst-case snapshots
- Open question:
- what protocol / implementation change best preserves low idle footprint while still supporting large snapshots
- Context:
-
Dynamic response sizing model
- Context:
- current
plugin-ipcsession handshake negotiatesagreed_max_response_payload_bytesonce - current implementations then size buffers against that session-wide maximum
- current
- Verified evidence:
- handshake uses
min(client, server)insrc/libnetdata/netipc/src/transport/posix/netipc_uds.c - C client allocates request/response/send buffers eagerly in
src/libnetdata/netipc/src/service/netipc_service.c - C server allocates per-session response buffer sized to the full negotiated maximum in
src/libnetdata/netipc/src/service/netipc_service.c - Linux SHM region size is fixed from negotiated request/response capacities in
src/libnetdata/netipc/src/transport/posix/netipc_shm.c - UDS chunked receive is already dynamically grown with
reallocinsrc/libnetdata/netipc/src/transport/posix/netipc_uds.c - Rust and Go clients are already more dynamic and grow buffers lazily in:
src/crates/netipc/src/service/cgroups.rssrc/go/pkg/netipc/service/cgroups/client.go
- Netdata
ebpf.pluginrefreshes cgroup metadata every 30 seconds:src/collectors/ebpf.plugin/ebpf_process.hsrc/collectors/ebpf.plugin/ebpf_cgroup.c
- handshake uses
- Decision needed:
- choose whether to keep the current protocol and improve allocation policy only, or evolve the protocol to support truly dynamic large snapshots
- Options:
- A. Keep protocol, make implementation adaptive, and use baseline-only transport for the cgroups snapshot service in phase 1
- B. Add paginated snapshot requests/responses
- C. Add out-of-band exact-sized bulk snapshot transfer for large responses
- D. Keep the current fixed session-wide max model and just configure a large cap
- E. Keep SHM for data, but negotiate/create SHM capacity per request instead of per session
- F. Split transport into a tiny control channel plus ephemeral payload channel/object
- G. Add a small size-probe step before fetching the full snapshot
- H. Add true server-streamed snapshot responses (multi-message response sequence)
- I. Allow snapshot responses to return "resize to X bytes and retry", so the client grows once on demand and reuses that larger buffer from then on
- J. Make SHM L2 transparently reconnect and double capacities on overflow, so resizing is hidden from both clients and servers and the server retains the learned larger size for future sessions
- Current preferred direction under discussion:
- J, but it still needs stress-testing against the current HELLO/HELLO_ACK semantics, SHM lifecycle, and L2 retry behavior
- Context:
-
Transparent SHM resize semantics
- Context:
- user direction is to make SHM L2 resizing automatic and transparent to both clients and servers
- reconnect is acceptable and growth should be power-of-two on overflow
- Verified evidence:
- current server sends
NIPC_STATUS_INTERNAL_ERRORon handler/batch failure insrc/libnetdata/netipc/src/service/netipc_service.c - current C/Go/Rust clients treat any non-
OKresponse transport status as bad layout / failure:src/libnetdata/netipc/src/service/netipc_service.csrc/go/pkg/netipc/service/cgroups/client.gosrc/crates/netipc/src/service/cgroups.rs
NIPC_STATUS_LIMIT_EXCEEDEDalready exists insrc/libnetdata/netipc/include/netipc/netipc_protocol.h
- current server sends
- Corrected layering rule from user discussion:
- transport/L2 may handle overflow signaling, reconnect, and shared-memory remap mechanics
- replay detection for mutating RPCs belongs to the request payload and the server business logic, not to transport-level semantic dedupe
- Clarified implication:
- transport should not try to "understand" whether a mutation was already applied
- if a mutating method cares about replay safety, it must carry a request identity / idempotency token in its own payload and the server method must enforce it
- For the Netdata cgroups snapshot use case:
- this is not a blocker, because snapshot is read-only
- Open question:
- whether transparent reconnect-and-retry should be generic transport behavior for all methods, or exposed as a capability that higher layers opt into when their payload semantics make replay safe
- Context:
-
Negotiation semantics for learned SHM size
- Context:
- user correctly rejected the current
min(client, server)rule for learned snapshot sizing - current handshake stores only one scalar per direction, so it cannot distinguish:
- client hard cap
- client initial size
- server learned target size
- user correctly rejected the current
- Verified evidence:
- current HELLO/HELLO_ACK uses fixed
agreed_max_*fields in:src/libnetdata/netipc/src/transport/posix/netipc_uds.csrc/crates/netipc/src/transport/posix.rssrc/crates/netipc/src/transport/windows.rs
- current HELLO/HELLO_ACK uses fixed
- Open question:
- should the protocol split "current operational size" from "hard ceiling", so the server can advertise a learned larger target without losing the client’s ability to refuse absurd allocations
- Context:
-
Request-side vs response-side SHM growth asymmetry
- Verified evidence:
- POSIX SHM send rejects oversize messages locally before the peer can react:
src/libnetdata/netipc/src/transport/posix/netipc_shm.c
- existing tests already cover this class of failure:
tests/fixtures/c/test_shm.ctests/fixtures/c/test_service.c(test_shm_batch_send_overflow_on_negotiated_limit)tests/fixtures/c/test_win_shm.ctests/fixtures/c/test_win_service_guards.c
- POSIX SHM send rejects oversize messages locally before the peer can react:
- Implication:
- response-capacity growth can be learned by the server while building a response
- request-capacity growth cannot be learned the same way, because an oversize request fails client-side before the server sees it
- Open question:
- should the first implementation cover:
- response-side transparent resize only
- or symmetric request+response resize with separate client-learned request sizing semantics
- should the first implementation cover:
- Verified evidence:
-
Netdata lifecycle ownership details
- Context:
cgroups.pluginruns in-daemonebpf.pluginis external
- Open question:
- exact daemon init/shutdown points for starting/stopping the
plugin-ipccgroups server and for initializing theebpf.pluginclient cache
- exact daemon init/shutdown points for starting/stopping the
- Context:
- Audit the current implementation surfaces that still encode multi-method service behavior.
- Define the replacement public model in code terms:
- one service module per service kind
- one endpoint per request kind
- service-specific typed clients/servers/cache helpers
- Redesign SHM resize semantics in implementation terms:
- explicit
LIMIT_EXCEEDED - disconnect/reconnect recovery
- separate learned request/response sizes
- process-lifetime learned sizing
- explicit
- Rewrite the C, Rust, and Go Level 2 service layers to match the corrected model.
- Rewrite interop/service fixtures and validation scripts to test one service kind per server.
- Rewrite public docs/examples/specs to remove contradictory multi-method wording.
- Run targeted tests first, then the full relevant Unix/Windows matrices required to trust the rewrite.
- Summarize any residual risk or remaining ambiguity before starting Netdata integration work.
- Rerun the current Linux and Windows benchmark matrices on the aligned tree.
- Regenerate benchmark artifacts and update all benchmark-derived docs/README summaries.
- Preserve Level 1 transport interoperability work where still valid.
- Preserve codec/message-family work where it remains useful under a service-oriented split.
- Prefer removal/rename of drifted APIs over keeping compatibility shims, because backward compatibility is not required.
- Keep request-kind and outer-envelope metadata available to single-kind handlers only for:
- validating that the endpoint received the expected request kind
- reading transport batch metadata when a single service kind supports batched payloads
- Do not use that metadata to reintroduce generic multi-method dispatch at the public Level 2 surface.
- If a generic Level 2 helper remains for tests/benchmarks, keep it internal and single-kind:
- one expected request kind per endpoint
- no public multi-method callback surface
- no docs/examples presenting it as a production service model
- C, Rust, and Go unit tests for the rewritten service APIs
- POSIX interop matrix for corrected service identities and SHM resize behavior
- Windows interop matrix for corrected service identities and SHM resize behavior
- Explicit tests for:
- late provider startup
- reconnect after provider restart
- service absence as a tolerated state
- SHM resize on response overflow
- learned-size reuse after reconnect
- request-side and response-side learned sizing behavior
- Keep README, docs specs, and active TODOs aligned with:
- service-oriented discovery
- one request kind per endpoint
- optional asynchronous enrichments
- reconnect-driven recovery
- SHM resize / renegotiation behavior
- Finalize remaining design details above.
- Vendor
plugin-ipcinto Netdata in the chosen native layout. - Add a Linux
cgroupstyped server inside Netdata daemon lifecycle. - Replace
ebpf.pluginshared-memory metadata reader withplugin-ipccgroups cache client. - Keep existing PID membership logic in
ebpf.pluginunchanged in phase 1. - Remove the old custom SHM metadata path as part of the big-bang switch.
- Add tests for:
- normal metadata refresh
- stale/restarted Netdata invalidating old clients
- large snapshots
ebpf.pluginrecovery on server restart
- Phase 1 is Linux-only.
- Phase 1 targets
cgroups.plugin->ebpf.pluginmetadata only. - Current
collectors-ipc/ebpf-ipc.*apps/pid SHM remains untouched. NETDATA_INVOCATION_IDmust be available to theebpf.pluginlauncher path and any future external clients.- A deterministic invocation-id hashing helper will be needed in C, Rust, and Go.
- Unit tests for invocation-id to auth-token derivation in C, Rust, and Go.
- Integration test proving only same-run plugins can connect.
- Integration test proving restart rotates auth and old clients fail cleanly.
- Snapshot scale test with high cgroup counts and long names/paths.
ebpf.pluginregression test for existing cgroup discovery semantics.
- Netdata integration design note for the new cgroups metadata transport.
- Developer docs for the new in-tree
netipclayout and per-language use. ebpf.pluginandcgroups.plugininternal docs describing the new IPC path.- Rollout/kill-switch documentation if dual-path rollout is selected.
- Verified benchmark-distortion findings before changing code:
- POSIX
shm-batch-ping-pongforc/rustexceeds the1.2xthreshold:c->c = 64,148,960c->rust = 58,334,803rust->c = 52,277,542rust->rust = 48,220,338
- The full corrected Linux and Windows matrices also showed broader benchmark-driver artifacts:
- Go
lookupbenchmark used a synthetic linear scan instead of the actual cache-style hash lookup. - Rust
lookupbenchmark used a synthetic linear scan too. - Rust cache lookup allocated
name.to_string()on every lookup. - Go and Rust benchmark clients still had hot-loop buffer allocations in batch, pipeline, and ping-pong paths.
- Go
- POSIX
- Implemented first remediation pass:
src/crates/netipc/src/service/raw.rs- replaced the flat
(hash, String)lookup key with nested per-hash maps so Rust cache lookups stop allocating per call
- replaced the flat
bench/drivers/rust/src/main.rs- removed hot-loop allocations from SHM batch client
- removed hot-loop allocations from ping-pong client
- moved pipeline-batch receive buffer allocation out of the outer loop
- replaced lookup linear scan with hash-map lookup
bench/drivers/rust/src/bench_windows.rs- removed the same hot-loop allocations on Windows
- replaced lookup linear scan with hash-map lookup
bench/drivers/go/main.go- removed hot-loop allocations from batch, pipeline, pipeline-batch, and ping-pong clients
- replaced lookup linear scan with hash-map lookup
bench/drivers/go/main_windows.go- removed the same hot-loop allocations on Windows
- replaced lookup linear scan with hash-map lookup
- Validation after the first remediation pass:
cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1299 passed, 0 failed
cd bench/drivers/go && go test -run '^$' ./...- compile-only pass
cd src/go && go test -count=1 ./pkg/netipc/service/raw ./pkg/netipc/service/cgroups- both packages passed
- Targeted Linux rerun after the first remediation pass:
lookupc = 173,132,146rust = 45,886,102go = 47,703,281- fact: the fake benchmark scans are gone; the remaining gap is now in the actual lookup data structures
shm-batch-ping-pong, target0c->c = 62,314,895c->rust = 57,112,806rust->c = 51,620,887rust->rust = 47,356,599- fact: the Rust client and Rust server penalties are both still real
uds-pipeline-d16, target0c->c = 721,232c->rust = 717,024c->go = 572,552rust->c = 719,458rust->rust = 727,197rust->go = 576,525- fact: the remaining delta is mostly a Go server issue, not a client issue
uds-pipeline-batch-d16, target0c->c = 103,250,763c->rust = 91,495,522c->go = 51,623,524rust->c = 102,367,177rust->rust = 89,465,821rust->go = 52,915,850- fact: the earlier client-side benchmark distortion is gone; the remaining large delta is mainly the Go server path
- Next concrete fixes identified from code + rerun evidence:
- Go and Rust cache lookup should mirror the C open-addressing hash table:
- evidence:
- C uses
hash ^ djb2(name)with open addressing insrc/libnetdata/netipc/src/service/netipc_service.c - Go still uses a composite
map[{hash,name}]insrc/go/pkg/netipc/service/raw/cache.go - Rust still uses nested
HashMap<u32, HashMap<String, usize>>insrc/crates/netipc/src/service/raw.rs
- C uses
- implication:
- Go and Rust still pay full runtime string hashing on every lookup while C does not
- evidence:
- Go POSIX UDS transport should mirror the C/Rust vectored send path:
- evidence:
- C uses
sendmsg+ twoiovecs insrc/libnetdata/netipc/src/transport/posix/netipc_uds.c - Rust uses
raw_send_iov()insrc/crates/netipc/src/transport/posix.rs - Go still copies header + payload into a merged scratch buffer in
src/go/pkg/netipc/transport/posix/uds.go
- C uses
- implication:
- Go server responses on UDS still pay an extra memcpy per message on the hot path
- evidence:
- Go and Rust cache lookup should mirror the C open-addressing hash table:
- Next measurement step:
- apply the lookup-index and Go UDS send fixes
- rerun only the affected slices first:
- Linux:
lookup,shm-batch-ping-pong,uds-pipeline-d16,uds-pipeline-batch-d16 - Windows:
lookup,shm-batch-ping-pong,np-pipeline-d16,np-pipeline-batch-d16
- Linux:
- only after the slice reruns are understood should the full matrices and docs be refreshed again.
- Second targeted Linux rerun after rebuilding the Rust release benchmark:
lookupc = 170,976,986rust = 150,660,413go = 121,278,244- fact:
- Rust lookup is now near C after mirroring the C open-addressing structure
- Go lookup improved materially too, but it is still above the
1.2xthreshold versus C
shm-batch-ping-pong, target0c->c = 60,929,552c->rust = 55,151,867rust->c = 49,426,036rust->rust = 45,104,001- fact:
- Rust still has a real server-side penalty on this path
- Rust still has a larger real client-side penalty on this path
uds-pipeline-d16, target0c->c = 713,563c->rust = 720,602rust->c = 722,202rust->rust = 712,371c->go = 548,145rust->go = 563,484- fact:
- Rust is now aligned with C on the non-batch UDS pipeline path
- the remaining delta is almost entirely the Go server path
uds-pipeline-batch-d16, target0c->c = 101,588,680c->rust = 83,396,588rust->c = 99,570,528rust->rust = 86,762,291c->go = 52,899,078rust->go = 51,902,022- fact:
- Rust client-side is now close to C on this path
- Rust server-side still shows a real batch-path penalty
- Go server-side is still the dominant outlier
- Structural batch-path asymmetry verified from code:
- C managed server exposes a whole-request callback:
src/libnetdata/netipc/include/netipc/netipc_service.h:187-192- callback receives
request_hdr, fullrequest_payload, and wholeresponse_buf
- C benchmark server uses that whole-request callback to batch-specialize increment in one loop:
bench/drivers/c/bench_posix.c:164-216- the callback sees
NIPC_FLAG_BATCH, loops all items itself, and emits the whole batch response directly
- Rust managed server exposes only per-item raw dispatch:
src/crates/netipc/src/service/raw.rs:1285-1297- batch handling is then forced through the managed-server loop:
src/crates/netipc/src/service/raw.rs:2002-2047- per item:
batch_item_get()->dispatch_single_internal()->bb.add()
- Go managed server exposes the same per-item dispatch shape:
src/go/pkg/netipc/service/raw/types.go:57-59- batch handling is forced through:
src/go/pkg/netipc/service/raw/client.go:903-946- per item:
BatchItemGet()->dispatchSingle()->bb.Add()
- fact:
- the remaining Rust and Go batch server gaps are not just transport issues
- C can specialize whole-batch increment handling at the callback boundary; Rust and Go cannot
- C managed server exposes a whole-request callback:
- Working theory for the remaining Linux gaps:
shm-batch-ping-pong- Rust still has both client-side and server-side cost versus C
- the server-side part aligns with the batch callback asymmetry above
uds-pipeline-batch-d16- Rust client-side is now nearly aligned with C
- the remaining Rust delta is mainly server-side batch handling overhead
- the much larger Go delta is likely server-side too, with the same structural asymmetry plus extra Go dispatch/runtime overhead
- Decision required before the next implementation step:
- Background:
- The remaining batch-path gap is now tied to the managed-server design.
- Any serious fix must choose whether to optimize only the benchmarks or to change the service/server implementation model.
-
- Batch server optimization strategy
- Evidence:
- C whole-request callback:
src/libnetdata/netipc/include/netipc/netipc_service.h:187-192bench/drivers/c/bench_posix.c:164-216
- Rust per-item batch loop:
src/crates/netipc/src/service/raw.rs:1285-1297src/crates/netipc/src/service/raw.rs:2002-2047
- Go per-item batch loop:
src/go/pkg/netipc/service/raw/types.go:57-59src/go/pkg/netipc/service/raw/client.go:903-946
- C whole-request callback:
- A. Benchmark-only fast path
- Implement dedicated Rust/Go benchmark servers that bypass the managed server for increment batch.
- Pros:
- fastest way to measure the upper bound
- smallest code change
- Implications:
- benchmark numbers improve, but the library/server path stays asymmetric
- Risks:
- hides a real product/library performance issue
- docs and benchmarks stop representing real library behavior
- B. Internal managed-server specialization
- Keep the external single-kind API shape, but add internal fast paths for known service kinds such as increment batch.
- Pros:
- fixes real library behavior
- avoids large public API churn
- aligned with one-service-kind servers
- Implications:
- managed-server internals become aware of service-kind-specific fast paths
- Risks:
- hidden complexity if done ad hoc
- may still leave the public abstraction less explicit than the implementation
- C. Explicit service-kind-specific server APIs
- Redesign Rust/Go managed servers so each service kind gets its own whole-request server callback surface, matching the accepted single-kind architecture.
- Pros:
- cleanest long-term design
- makes the fast path explicit instead of hidden
- best fit for maintainability and performance
- Implications:
- broader API/implementation/test/doc rewrite in Rust and Go
- Risks:
- largest scope before the next measurement
- Recommendation:
1. C- Reason:
- the evidence shows a real API/implementation asymmetry, not just a hot-loop bug
- your accepted single-kind-service design already points in this direction
- Background:
- Priority check raised by Costa:
- Background:
- Current benchmark results are already very high in absolute terms.
- The remaining gaps are real, but fixing them now would require a broader Rust/Go managed-server redesign for batch-heavy paths.
- Facts:
- Clean Linux rerun:
lookupc = 170,976,986rust = 150,660,413go = 121,278,244
shm-batch-ping-pongc->c = 60,929,552rust->rust = 45,104,001
uds-pipeline-batch-d16c->c = 101,588,680rust->rust = 86,762,291go->go = 51,355,370
- Fact:
- these are already very high throughputs in absolute terms
- the remaining work is now mainly about closing relative efficiency gaps, not about making the library viable
- Clean Linux rerun:
- Working theory:
- Deferring the remaining batch-path optimization is reasonable if there are more fundamental correctness, architecture, or product-fit issues still open.
- The benchmark investigation has already done its job by identifying the structural asymmetry and proving where it lives.
- Background:
- Updated decision from Costa:
- continue the benchmark investigation for trust in the framework
- investigate all remaining
>1.20xdifferences - treat the Rust/Go batch-path asymmetry as already identified, and focus next on the remaining unexplained gaps
- Remaining unexplained Linux gaps after excluding the known batch-path issue:
lookupc = 170,976,986rust = 150,660,413go = 121,278,244
uds-pipeline-d16c->c = 713,563c->go = 548,145rust->go = 563,484- fact:
- the Go server remains the unexplained outlier on the non-batch pipeline path
- New concrete finding: Go lookup still pays a by-value item copy on every successful bucket probe
- Evidence:
- actual cache lookup:
src/go/pkg/netipc/service/raw/cache.go:122-130item := c.items[c.buckets[slot].index]copies the wholeCacheItem
- Go lookup benchmark mirrors the same behavior:
bench/drivers/go/main.go:1133-1136bucketItem := cacheItems[lookupIndex[slot].index]copies the whole struct
- Rust uses a reference:
src/crates/netipc/src/service/raw.rs:2376-2379
- C returns a pointer:
src/libnetdata/netipc/src/service/netipc_service.c:1492-1497
- actual cache lookup:
- Implication:
- the current Go lookup gap is still at least partly a real Go implementation issue, not just a benchmark artifact
- Evidence:
- Follow-up measurement on the Go lookup bucket-copy fix:
- Applied:
src/go/pkg/netipc/service/raw/cache.gobench/drivers/go/main.go- changed bucket probes from by-value
CacheItemcopies to pointer/reference access
- Rerun results:
c = 172,638,775rust = 153,518,048go = 115,783,444
- Fact:
- the fix had no material positive effect on Go lookup throughput
- therefore the by-value bucket copy was not the dominant cause of the remaining Go lookup gap
- Applied:
- Go lookup profile after the bucket-copy fix:
- Evidence:
- live perf profile of
bench_posix_go lookup-bench - output row:
lookup,go,go,127,972,744
- visible hot frames from
/tmp/nipc-go-lookup-perf.data:main.runLookupBenchalmost all samplestime.runtimeNanoabout8%runtime.memequalabout2%
- live perf profile of
- Fact:
- no single framework/library helper stands out as the dominant hotspot
- the operation is so small that benchmark loop overhead and inlining dominate the profile
- Working theory:
- the remaining Go lookup gap is not currently a strong signal about the IPC framework itself
- it is at least partly a benchmark-methodology issue for a tiny in-memory operation
- Evidence:
- Go non-batch pipeline server profile:
- Evidence:
- live perf profile of
bench_posix_go uds-ping-pong-serverunderuds-pipeline-d16load from a C client - client result during profile:
uds-pipeline-d16,c,c,567,061,...
- hot frames from
/tmp/nipc-go-server-perf.data:Session.Sendabout39.5%Session.Receiveabout33.8%raw.pollFdabout23.1%- increment dispatch does not materially appear
- live perf profile of
- Fact:
- the remaining Go server gap on
uds-pipeline-d16is not in increment handler logic - it is dominated by the Go UDS server transport/poll path
- the remaining Go server gap on
- Supporting fact:
- Go as a client on the same scenario is only slightly slower than C/Rust:
go->c = 699,976vsc->c = 713,563go->rust = 685,614vsc->rust = 720,602
- implication:
- the big remaining gap is mainly server-side, and
pollFdis the strongest server-only suspect
- the big remaining gap is mainly server-side, and
- Go as a client on the same scenario is only slightly slower than C/Rust:
- Evidence:
- New concrete finding: Go non-batch server gap is transport/poll dominated, not dispatch dominated
- Evidence:
- live perf profile of
bench_posix_go uds-ping-pong-serverunderuds-pipeline-d16load - hot path breakdown from
/tmp/nipc-go-server-perf.data:Session.Sendabout39.5%Session.Receiveabout33.8%raw.pollFdabout23.1%- increment dispatch does not materially appear in the hot path
- live perf profile of
- Working theory:
- the remaining Go server delta on
uds-pipeline-d16is in the Go UDS server transport/wrapper path, especiallypoll + recvmsg + sendmsg, not in the increment handler logic
- the remaining Go server delta on
- Evidence:
- TL;DR:
- rerun the full official benchmark suites on the current worktree for both Linux and Windows
- regenerate the checked-in benchmark artifacts from those reruns
- compare the refreshed Linux and Windows matrices and flag any materially strange language deltas
- review and follow the existing repo TODO guidance for the real Windows
win11benchmark workflow
- Analysis:
- current checked-in benchmark artifacts are from
2026-03-25:benchmarks-posix.mdbenchmarks-windows.mdREADME.md
- the official full-matrix runners are:
- Linux:
tests/run-posix-bench.shtests/generate-benchmarks-posix.sh
- Windows:
tests/run-windows-bench.shtests/generate-benchmarks-windows.sh
- Linux:
- the verified Windows execution guidance already exists in repo TODOs and README:
README.md:342-365TODO-pending-from-rewrite.md:2754-2849
- current runner/generator methodology facts for Windows trustworthiness:
tests/run-windows-bench.shcurrently writes exactly one CSV row per benchmark cell:run_pair()parses one client result and immediately appends it toOUTPUT_CSV- there is no built-in repetition, aggregation, or instability gate
tests/generate-benchmarks-windows.shvalidates completeness and floors, but it trusts each CSV row as final truth:- it has no notion of repeated samples, medians, spread, or outlier detection
- implication:
- a single noisy Windows measurement can currently become the published benchmark artifact if it still parses and keeps throughput above zero
- benchmark methodology references gathered before changing the Windows workflow:
- Google Benchmark user guide:
- repeated benchmarks exist because a single result may not be representative when benchmarks are noisy
- when repetitions are used, mean / median / standard deviation are reported
- source examined:
/tmp/google-benchmark-20260326/docs/user_guide.md
- Criterion.rs analysis and user guide:
- noisy runs should be treated skeptically
- longer measurement time reduces the influence of outliers
- outlier classification is a first-class part of reliable benchmark analysis
- sources examined:
/tmp/criterion-rs-20260326/book/src/user_guide/command_line_output.md/tmp/criterion-rs-20260326/book/src/analysis.md
- Google Benchmark user guide:
- verified workflow facts from those docs:
- real Windows benchmark proof is expected on
win11, not via Linux cross-compilation - login shell may start as
MSYSTEM=MSYS; benchmark runs should set:PATH="/c/Users/costa/.cargo/bin:/c/Program Files/Go/bin:/mingw64/bin:$PATH"MSYSTEM=MINGW64CC=/mingw64/bin/gccCXX=/mingw64/bin/g++
- official Windows benchmark commands are:
bash tests/run-windows-bench.sh benchmarks-windows.csv 5bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md
- real Windows benchmark proof is expected on
- the current local worktree is not clean and includes benchmark-related source edits:
bench/drivers/go/main.gobench/drivers/go/main_windows.gobench/drivers/rust/src/main.rsbench/drivers/rust/src/bench_windows.rs- plus service/transport files that can affect benchmark behavior
- implication:
- the refreshed artifacts must reflect this exact current tree
- benchmark interpretation must distinguish:
- real implementation/runtime asymmetry
- normal platform differences
- measurement distortion or stale artifact drift
- current checked-in benchmark artifacts are from
- Decisions:
- no new user decision required before execution
- using the existing official full-suite runners is the correct path
- using the existing real
win11workflow is the correct Windows path
- Plan:
- run the full Linux benchmark suite locally on the current tree
- regenerate
benchmarks-posix.md - run the full Windows benchmark suite on
win11using the documented native-toolchain environment - regenerate
benchmarks-windows.md - compare refreshed CSVs and summarize the largest cross-language spreads by scenario
- classify strange deltas as:
- expected platform/runtime behavior
- suspicious and possibly measurement-related
- suspicious and likely implementation-related
- update benchmark-derived docs if the refreshed artifacts materially change the published snapshot
- for the Windows trustworthiness fix:
- change the Windows runner to collect multiple measured repetitions per benchmark cell instead of trusting a single sample
- aggregate repeated samples into one publication row using a robust statistic instead of one lucky or unlucky run
- preserve a fail-closed path:
- if repeated Windows samples for a cell diverge beyond a configured spread threshold, fail the run instead of publishing that cell
- keep the published CSV shape stable if possible, so the existing generator/report consumers do not need a schema rewrite just to gain trustworthiness
- Implied decisions:
- benchmark duration remains the documented default
5seconds unless the runner fails and forces a diagnostic rerun - the first full pass should use the official artifact filenames:
benchmarks-posix.csvbenchmarks-posix.mdbenchmarks-windows.csvbenchmarks-windows.md
- if Windows artifacts are produced remotely, copy them back into this repo without resetting unrelated local files
- benchmark duration remains the documented default
- Testing requirements:
- Linux benchmark CSV must contain
201data rows and pass the generator validation - Windows benchmark CSV must contain
201data rows and pass the generator validation - refreshed artifacts must have no duplicate scenario keys and no zero-throughput rows
- Linux benchmark CSV must contain
- Documentation updates required:
- update the checked-in benchmark markdown files to match the refreshed CSVs
- update
README.mdonly if the published generated dates, machine snapshot, or headline benchmark ranges are no longer true after the refresh
- Execution results:
- reviewed Windows benchmark handoff guidance before execution:
README.md:342-365TODO-pending-from-rewrite.md:2754-2849
- Linux benchmark refresh completed successfully on the current worktree:
- command:
cargo build --release --manifest-path src/crates/netipc/Cargo.toml --bin bench_posixbash tests/run-posix-bench.sh benchmarks-posix.csv 5bash tests/generate-benchmarks-posix.sh benchmarks-posix.csv benchmarks-posix.md
- result:
201rows- generator passed
- all configured POSIX floors passed
- command:
- Windows benchmark refresh completed on
win11native MSYS/MinGW toolchain path:- disposable synced tree:
/tmp/plugin-ipc-bench-20260326
- command:
cargo build --release --manifest-path src/crates/netipc/Cargo.toml --bin bench_windowsbash tests/run-windows-bench.sh benchmarks-windows.csv 5bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md
- factual result:
- benchmark runner completed
201rows - generator wrote
benchmarks-windows.md - generator exited non-zero because of one floor violation:
shm-ping-pong rust->c @ max = 850,994- configured floor:
1,000,000
- benchmark runner completed
- disposable synced tree:
- new user requirement after the unstable Windows reruns:
- make Windows benchmarks trustworthy instead of relying on single noisy runs
- allowed direction from user:
- increase duration
- run multiple repetitions
- use any stronger methodology needed, as long as the published Windows benchmark artifacts become trustworthy
- fit-for-purpose clarification:
- Windows benchmark artifacts must be publication-grade on
win11 - single-run outliers must not be able to define the checked-in benchmark matrix
- Windows benchmark artifacts must be publication-grade on
- Windows trustworthiness implementation now applied locally:
tests/run-windows-bench.sh- new default:
5measured samples per Windows benchmark cell - each published CSV row is now the median aggregate of those samples
- the runner now persists per-cell repeated samples in
RUN_DIRduring execution - initial implementation used a blunt raw spread gate:
- fail if
max(sample_throughput) / min(sample_throughput) > 1.35
- fail if
- new default:
tests/generate-benchmarks-windows.sh- markdown output now states that the current Windows report is based on repeated aggregated measurements instead of one single sample
- targeted proof of the new Windows trust method on
win11:- synced the updated Windows runner/generator into the same disposable proof tree:
/tmp/plugin-ipc-bench-20260326
- command:
NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-trust.csv 5
- factual result:
- completed successfully with the new 5-sample median path
- no stability-gate failure
- the previously suspicious rows are now stable:
shm-ping-pong rust->c @ max = 2,527,551shm-ping-pong rust->rust @ 10000 = 9,999
- all reported SHM sample ratios observed during that proof stayed well below the
1.35gate
- implication:
- the old single-shot Windows SHM collapses were publication-methodology failures
- with repeated measurement + median aggregation + spread gating, the same
win11host now produces a stable SHM matrix
- synced the updated Windows runner/generator into the same disposable proof tree:
- first stability-gate refinement after proof runs on
win11:- fact:
- the initial raw
max/mingate was too blunt for legitimate runs with one obvious transient outlier
- the initial raw
- evidence:
- repeated sample file from the first full repeated run:
/tmp/netipc-bench-300472/samples-np-ping-pong-c-go-100000.csv
- measured throughputs:
17,79819,05915,5866,74118,303
- implication:
- one bad transient sample should not discard the whole row if the remaining samples agree tightly
- repeated sample file from the first full repeated run:
- attempted follow-up:
- a Tukey-style outlier fence was tested next
- fact:
- with only
5samples, that approach was too aggressive and incorrectly marked normal edge values as outliers
- with only
- evidence:
- repeated sample file:
/tmp/netipc-bench-287769/samples-np-ping-pong-go-c-0.csv
- measured throughputs:
17,41918,04918,07818,22918,533
- implication:
- the real spread there is only about
1.06x, so that row is stable and should be published
- the real spread there is only about
- repeated sample file:
- fact:
- final trust method now applied locally after those proof runs:
tests/run-windows-bench.sh- keep
5measured samples per published row - publish medians for throughput and latency/CPU columns
- when there are at least
5samples, drop exactly one lowest and one highest throughput sample before the stability check - require the remaining stable core to contain at least
3samples - require stable-core throughput spread:
stable_max / stable_min <= 1.35
- if the raw extremes are noisy but the stable core is good:
- publish the row
- print a warning that records both raw and stable spreads
- keep
tests/generate-benchmarks-windows.sh- methodology text updated to describe the stable-core rule instead of the original raw-spread wording
- second stability-gate refinement after full-suite evidence on
win11:- fact:
- the first repeated full-suite rerun still found a real unstable case at
5smax-throughput duration:snapshot-shm rust->go @ max
- the first repeated full-suite rerun still found a real unstable case at
- evidence:
- repeated sample file:
/tmp/netipc-bench-300472/samples-snapshot-shm-rust-go-0.csv
- measured throughputs:
1,042,824977,680648,337367,4911,027,273
- stable core after dropping one low and one high sample:
648,337977,6801,027,273
- stable-core ratio:
1.584474
- repeated sample file:
- implication:
- repeated measurement alone was not enough for all Windows max-throughput rows
- some max rows needed a longer measurement window, not just more samples
- fact:
- max-throughput duration refinement now applied locally:
tests/run-windows-bench.sh- fixed-rate rows still use the CLI duration default:
5s
- max-throughput rows now use a separate default duration:
NIPC_BENCH_MAX_DURATION=10
- the runner logs both durations at startup
- fixed-rate rows still use the CLI duration default:
- targeted proof on
win11for the previously failing case:- command:
NIPC_BENCH_FIRST_BLOCK=4 NIPC_BENCH_LAST_BLOCK=4 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/snapshot-shm-10s.csv 10
- factual result:
- previously failing
snapshot-shm rust->go @ maxbecame stable:- median throughput
1,053,376 - stable-core ratio
1.018280
- median throughput
- another noisy row also stabilized after trimming one low and one high sample:
snapshot-shm rust->c @ max- raw range:
460,343 .. 1,167,598
- stable-core range:
1,109,218 .. 1,133,875
- stable-core ratio:
1.022229
- previously failing
- command:
- implication:
- the final trustworthy Windows method is now:
- repeated measurement
- median publication
- stable-core gating
- longer max-throughput samples
- the final trustworthy Windows method is now:
- final proof run status after the trust-method changes:
- full-suite rerun now in progress on
win11with the final method:- fixed-rate rows:
5 samples x 5s
- max-throughput rows:
5 samples x 10s
- stability rule:
- publish only if the trimmed stable core stays within
1.35x
- publish only if the trimmed stable core stays within
- fixed-rate rows:
- live confirmed progress:
np-ping-pongblock completed cleanly under the final methodshm-ping-pongblock started cleanly under the final method
- full-suite rerun now in progress on
- first full repeated rerun with the
10smax default found one remaining unstable row late in the suite:- scenario:
np-pipeline-batch-d16 rust->rust @ max
- preserved sample file:
/tmp/netipc-bench-331471/samples-np-pipeline-batch-d16-rust-rust-0.csv
- measured throughputs:
37,400,75731,635,30226,609,20739,324,20224,312,207
- trimmed stable core:
26,609,207 .. 37,400,757
- stable-core ratio:
1.405557
- implication:
- the runner correctly failed closed
- the remaining instability was no longer global Windows SHM noise
- it was narrowed to
np-pipeline-batch @ maxonwin11
- scenario:
- targeted proof for the remaining pipeline-batch max instability:
- command:
NIPC_BENCH_FIRST_BLOCK=9 NIPC_BENCH_LAST_BLOCK=9 NIPC_BENCH_MAX_DURATION=20 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/np-pipeline-batch-20s.csv 5
- factual result:
- the full
np-pipeline-batch-d16matrix passed cleanly at20s - previously failing row became stable:
rust->rust @ max = 34,184,748- stable-core ratio
1.064913
- previously noisy
go->c @ maxalso tightened materially:38,364,026- stable-core ratio
1.024521
- the full
- implication:
- the remaining issue was short-window measurement noise for
np-pipeline-batch @ max - a longer max window fixes it without relaxing the trust gate
- the remaining issue was short-window measurement noise for
- command:
- final Windows trust method now applied locally:
tests/run-windows-bench.sh- fixed-rate rows:
5s
- most max-throughput rows:
10s
np-pipeline-batch-d16 @ max:20s
- runner knobs now include:
NIPC_BENCH_MAX_DURATIONNIPC_BENCH_PIPELINE_BATCH_MAX_DURATION
- fixed-rate rows:
tests/generate-benchmarks-windows.sh- methodology section now documents the
20spipeline-batch max window explicitly
- methodology section now documents the
- final published Windows artifact assembly:
- full repeated rerun output from:
/tmp/plugin-ipc-bench-20260326/benchmarks-windows.csv- used for all stable rows outside
np-pipeline-batch-d16 - notable publishable warning retained from that full rerun:
np-pipeline-d16 go->c @ max- raw range:
111,201 .. 255,780- raw ratio
2.300159
- trimmed stable core:
234,582 .. 241,982- stable ratio
1.031545
- implication:
- the outlier-handling path is doing real work on
win11 - the published median row is still trustworthy because the stable core stayed tight
- the outlier-handling path is doing real work on
- targeted validated
20srerun output from:/tmp/plugin-ipc-bench-20260326/np-pipeline-batch-20s.csv- used to replace the incomplete/unstable
np-pipeline-batch-d16block
- locally assembled final CSV:
202lines total201data rows- scenario counts all correct
- local validation:
bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md- result:
- all configured Windows floors pass
- report generation passes cleanly
- full repeated rerun output from:
- follow-up approved by Costa after the first trustworthy publish:
- run one fresh full Windows suite on
win11with the current default methodology - objective:
- remove the remaining "assembled artifact" caveat if the one-shot full run now passes end to end
- execution rule:
- sync the current local benchmark-related sources to the disposable
win11proof tree first - only replace the checked-in Windows CSV/MD if that single fresh rerun passes with all floors green
- sync the current local benchmark-related sources to the disposable
- run one fresh full Windows suite on
- current fresh-proof-tree rerun on
win11uses a new disposable tree based onorigin/mainplus the current local benchmark-related worktree files overlaid onto it:- fresh tree:
/tmp/plugin-ipc-bench-20260327-fullrun-150313
- factual setup issue discovered before the real rerun:
tests/run-windows-bench.shbuilds the C and Go benchmark binaries itself, but it only consumes an already-built Rust benchmark binary- on a fresh disposable tree, the first launch printed:
Rust benchmark binary not found: .../src/crates/netipc/target/release/bench_windows.exe (Rust tests will be skipped)
- implication:
- a fresh tree needs an explicit Rust build before the full Windows benchmark suite, or the run degrades to a 2-language matrix and is not publishable
- corrective action applied on
win11before the real rerun:cargo build --release --manifest-path src/crates/netipc/Cargo.toml --bin bench_windows
- real rerun then restarted from the same fresh tree with diagnostics enabled:
NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260327-fullrun-150313/full-after-fix.csv 5
- live evidence from the ongoing one-shot full rerun:
- no new diagnostics summary file has appeared so far
- block
1(np-ping-pong) is already materially clean end to end:np-ping-pong c->c @ max = 19,627,stable_ratio=1.018133np-ping-pong rust->c @ max = 19,880,stable_ratio=1.045638np-ping-pong go->go @ max = 19,195, with one low and one high outlier trimmed,stable_ratio=1.098122- all published
10000/srows reached target cleanly:- examples:
rust->c = 9,999,stable_ratio=1.000000rust->rust = 9,999,stable_ratio=1.000000go->go = 10,000,stable_ratio=1.000000
- examples:
- the first published
1000/srows are also landing at target:go->c = 1,000,stable_ratio=1.000000go->go = 1,000,stable_ratio=1.000000
- the rerun has already crossed into the historically suspicious SHM block without reproducing the old collapse:
shm-ping-pong c->c @ max = 2,565,990,stable_ratio=1.042022shm-ping-pong rust->c @ max = 2,443,021,stable_ratio=1.089130shm-ping-pong c->rust @ max = 2,611,306,stable_ratio=1.071212shm-ping-pong rust->rust @ max = 2,617,581,stable_ratio=1.027963shm-ping-pong go->rust @ max = 2,327,904,stable_ratio=1.012447
- factual interim conclusion:
- the current one-shot full rerun is already materially stronger evidence than the older failing full runs
- the earlier full-suite
shm-ping-pong rust->ccollapse is not reproducing on the samewin11host after the current lifecycle and Windows SHM fixes
- live continuation coordinates for the long one-shot rerun:
win11source tree:/tmp/plugin-ipc-bench-20260327-fullrun-150313
- live output files:
- CSV:
/tmp/plugin-ipc-bench-20260327-fullrun-150313/full-after-fix.csv
- log:
/tmp/plugin-ipc-bench-20260327-fullrun-150313/full-after-fix.log
- CSV:
- last verified progress in this session:
75lines in the CSV (74data rows)- blocks
1and2completed cleanly - block
3(snapshot-baseline) had started and was publishing stable@ maxrows:c->c = 19,872,stable_ratio=1.029521rust->c = 19,291,stable_ratio=1.043116
- no new diagnostics summary file had appeared yet
- later checkpoint from the same still-running one-shot rerun:
121lines in the CSV (120data rows)- blocks
1through4had already cleared cleanly and the run had advanced deep into block5(np-batch-ping-pong) - live batch evidence:
np-batch-ping-pong c->go @ max = 7,699,399,stable_ratio=1.045676np-batch-ping-pong rust->go @ max = 7,532,805,stable_ratio=1.018880np-batch-ping-pong go->go @ max = 7,152,856,stable_ratio=1.030591np-batch-ping-pong c->c @ 100000/s = 7,693,465,stable_ratio=1.011300np-batch-ping-pong rust->c @ 100000/s = 7,497,010,stable_ratio=1.015083
- no new diagnostics summary file had appeared yet at this checkpoint either
- completed outcome of the clean one-shot Windows rerun:
- the long
win11one-shot rerun finished cleanly - final CSV size:
202logical lines201data rows
- no new diagnostics summary file was produced during this rerun
tests/generate-benchmarks-windows.shpassed onwin11against the final CSV:All performance floors met
- the final generated report was copied back into the repo as:
benchmarks-windows.csvbenchmarks-windows.md
- the same generator also passed locally after copying the artifacts back:
bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md- result:
All performance floors met
- user-approved follow-up after the successful one-shot rerun:
- commit the Windows artifact refresh and the TODO update as a separate git commit
- do not include unrelated dirty files from the broader worktree
- user-approved follow-up after the local commit:
- push commit
768cca3toorigin/main - do not include any of the remaining unrelated dirty files
- push commit
- implication:
- the remaining "assembled artifact" caveat is now removed
- the checked-in Windows artifacts now come from a single clean one-shot full rerun on
win11
- stable final Windows max-throughput spreads from that clean one-shot artifact:
shm-ping-pong:- best:
rust->rust = 2,617,581
- worst:
go->go = 2,113,834
- spread:
1.238x
- conclusion:
- no strange SHM collapse remains in the final clean artifact
- best:
lookup:- best:
rust = 176,259,707
- worst:
go = 98,385,649
- spread:
1.792x
- best:
np-pipeline-d16:- best:
go->rust = 240,205
- worst:
c->go = 216,940
- spread:
1.107x
- best:
np-pipeline-batch-d16:- best:
go->c = 39,065,948
- worst:
c->go = 27,896,181
- spread:
1.400x
- best:
- the long
- fresh tree:
- first one-shot full rerun attempt with the current defaults did not produce a clean replacement artifact:
- partial output path:
/tmp/plugin-ipc-bench-20260326/benchmarks-windows-oneshot.csv
- factual failure observed during block
1:np-ping-pong rust->rust @ 1000/s- Rust client exited non-zero
- streamed client output reported:
client: 4207 errors- partial line:
np-ping-pong,rust,rust,159,75.500,177.400,177.400,5.6,0.0,5.6
- implication:
- the one-shot rerun cannot replace the current published Windows artifact
- before attempting another full rerun, the new failure should be isolated on block
1to determine whether it is reproducible or a one-off transport/runtime glitch
- partial output path:
- isolated recheck of block
1completed cleanly on the samewin11proof tree:- command:
NIPC_BENCH_FIRST_BLOCK=1 NIPC_BENCH_LAST_BLOCK=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/block1-recheck.csv 5
- output path:
/tmp/plugin-ipc-bench-20260326/block1-recheck.csv
- factual result:
- all
36block-1 measurements completed with exit code0 - the previously failing row completed cleanly:
np-ping-pong rust->rust @ 1000/s = 1000p50=66.200usp95=248.200usp99=369.500usstable_ratio=1.000000
- all
- implication:
- the first one-shot block-1 failure is not immediately reproducible
- this currently looks like a transient host/runtime glitch, not established deterministic instability in the
rust->rust @ 1000/spair - the next valid check is another clean one-shot full Windows rerun with the same default methodology
- command:
- second one-shot full rerun with the current defaults also failed to produce a clean replacement artifact:
- partial output path:
/tmp/plugin-ipc-bench-20260326/benchmarks-windows-oneshot-2.csv
- factual failure observed during block
2:shm-ping-pong rust->c @ max- repeated-sample file:
/tmp/netipc-bench-410987/samples-shm-ping-pong-rust-c-0.csv
- repeated throughputs:
618,076618,1601,951,0362,303,7142,476,081
- stable-core gate result:
stable_min=618,160stable_max=2,303,714stable_ratio=3.726728- configured max:
1.35
- implication:
- the current default methodology still does not guarantee a clean one-shot full Windows run on
win11 - the blocker has moved from a random-looking block-1 client failure to a concrete SHM max-throughput instability event
- the current default methodology still does not guarantee a clean one-shot full Windows run on
- partial output path:
- focused reproduction of the same SHM pair in isolation did not reproduce the collapse:
- direct pair under the same synced
win11tree:- C server:
bench_windows_c.exe shm-ping-pong-server - Rust client:
bench_windows.exe shm-ping-pong-client
- C server:
- isolated
rust -> c @ maxrepeated10times with10ssamples:- throughput range:
2,446,407 .. 2,578,450
- all
10runs stayed in the fast band
- throughput range:
- isolated
rust -> c @ maxrepeated10times with20ssamples:- throughput range:
2,363,335 .. 2,589,588
- all
10runs stayed in the fast band
- throughput range:
- implication:
- the SHM collapse is not a simple deterministic
rust client -> c serverbug - longer isolated samples are stable, but that alone does not explain the one-shot full-run failure
- the SHM collapse is not a simple deterministic
- direct pair under the same synced
- sequence test also failed to reproduce the SHM collapse:
- setup:
- one
c -> c @ maxSHM prime run - followed immediately by
5directrust -> c @ maxSHM runs - repeated for
5cycles on the sameRUN_DIR
- one
- factual result:
- all
25post-primerust -> cruns stayed in the fast band:2,357,337 .. 2,664,284
- all
- implication:
- the failure is not explained by a simple "previous
c -> cSHM row poisons the nextrust -> crow" theory - current best description:
- rare transient host/runtime glitch during full-matrix execution on
win11 - not immediately reproducible in dedicated pair or simple sequence tests
- rare transient host/runtime glitch during full-matrix execution on
- the failure is not explained by a simple "previous
- setup:
- pending user decision before more Windows runner code changes:
- context:
- Costa asked for trustworthy Windows benchmarks
- current state is better than before, but a clean one-shot full run is still not guaranteed
- user constraint raised during decision review:
- automatic retries must not hide real failures or real bugs
- if retries are ever used, first-attempt failures must remain visible and reportable
- user decision:
- keep the main Windows benchmark publication path fail-closed
- do not add silent self-healing retries to publish mode
- add a separate diagnostic mode that can rerun failed rows in isolation
- diagnostic mode must preserve and report the original first-attempt failure evidence side by side with any diagnostic rerun evidence
- option A:
- add automatic per-row retry on Windows when a row fails because of client error or stability-gate failure
- keep the current
5-sample median +1.35stable-core gate inside each attempt - implications:
- one transient bad row no longer destroys a 2-hour full run
- a row is still published only if a full fresh attempt passes the same gate
- risks:
- published rows may come from retry attempt
2or3, not from the first pass - the report and logs must say that retries happened, or the methodology becomes misleading
- published rows may come from retry attempt
- option B:
- keep fail-closed behavior, but increase Windows SHM max collection further:
- for example
20sper sample and/or7-9repeats
- for example
- implications:
- simpler story than retries
- every accepted row is still strictly one attempt
- risks:
- much longer full-suite runtime
- evidence so far does not prove that longer duration alone fixes the rare full-run glitch
- keep fail-closed behavior, but increase Windows SHM max collection further:
- option C:
- keep the current runner and accept targeted reruns / assembled Windows artifacts when one-shot full runs glitch
- implications:
- fastest operationally
- still produces trustworthy rows when each replacement row is validated carefully
- risks:
- no clean single-command reproduction
- more manual work and more caveats around publication
- accepted direction:
- strict publish mode plus separate diagnostic reruns
- rationale:
- failures stay visible
- diagnostic reruns can still accelerate root-cause work without turning the publication path into silent self-healing
- context:
- implemented Windows diagnostic mode for failed rows:
- file:
tests/run-windows-bench.sh
- new behavior:
- publish mode remains fail-closed by default
- opt-in diagnostics via:
NIPC_BENCH_DIAGNOSE_FAILURES=1
- when a row fails in publish mode:
- the original failure remains authoritative
- the original
RUN_DIRand first-attempt sample file remain preserved - the same row is rerun in an isolated diagnostic subdirectory under the preserved
RUN_DIR - diagnostic rerun output is recorded in:
${RUN_DIR}/diagnostics-summary.txt
- diagnostic reruns never write rows into the publish CSV
- implementation details:
- row-level measurement state is now tracked explicitly:
- failure reason
- sample-file path
- aggregate throughput/latency/CPU values
- stability metrics
- diagnostic reruns restore the original first-failure state after logging the isolated rerun evidence
- row-level measurement state is now tracked explicitly:
- file:
- forced validation of the new diagnostic mode on
win11:- purpose:
- prove that publish mode still fails closed
- prove that diagnostic reruns preserve the original evidence and create side-by-side isolated rerun evidence
- command:
NIPC_BENCH_FIRST_BLOCK=7 NIPC_BENCH_LAST_BLOCK=7 NIPC_BENCH_DIAGNOSE_FAILURES=1 NIPC_BENCH_REPETITIONS=3 NIPC_BENCH_MAX_DURATION=1 NIPC_BENCH_PIPELINE_BATCH_MAX_DURATION=1 NIPC_BENCH_MAX_THROUGHPUT_RATIO=0.9 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/diag-lookup-2.csv 1
- factual result:
- runner exited non-zero as expected
- publish CSV remained header-only:
/tmp/plugin-ipc-bench-20260326/diag-lookup-2.csv
- preserved original run dir:
/tmp/netipc-bench-425494
- diagnostic summary created:
/tmp/netipc-bench-425494/diagnostics-summary.txt
- distinct diagnostic rerun dirs created per failed row:
/tmp/netipc-bench-425494/diagnostics/001-lookup-c-c-0/tmp/netipc-bench-425494/diagnostics/002-lookup-rust-rust-0/tmp/netipc-bench-425494/diagnostics/003-lookup-go-go-0
- implication:
- the new mode preserves truth in publish mode
- it also gives immediate isolated rerun evidence for investigation without silently healing the benchmark artifact
- purpose:
- next-step approval from Costa:
- commit and push the strict publish + diagnostic-mode runner changes
- then proceed immediately to the real Windows SHM investigation using the new diagnostic mode on the actual failing slice
- commit / push completed for the diagnostic-mode runner change:
- commit:
870fc93
- subject:
bench: add Windows diagnostic reruns
- pushed to:
origin/main
- commit:
- real Windows SHM investigation with the new diagnostic mode:
- command:
NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-diagnose.csv 5
- factual result:
- block
2completed successfully with exit code0 - no diagnostic rerun triggered for any SHM row
- the previously suspicious row completed cleanly:
shm-ping-pong rust->c @ max = 2,465,857stable_ratio=1.021516
- the full SHM max matrix stayed stable:
c->c = 2,461,053rust->c = 2,465,857go->c = 2,162,135c->rust = 2,597,936rust->rust = 2,530,435go->rust = 2,065,765c->go = 2,570,619rust->go = 2,254,772go->go = 2,079,323
- all
100000/s,10000/s, and1000/sSHM rows also completed stably in the same block run
- block
- implication:
- the Windows SHM instability still does not reproduce when block
2runs in isolation under the real runner - current strongest working theory:
- the failure depends on broader full-suite context on
win11 - not on the standalone SHM block itself
- the failure depends on broader full-suite context on
- the Windows SHM instability still does not reproduce when block
- command:
- targeted confirmation of the Windows SHM anomaly:
- command:
NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-confirm.csv 5
- confirmed max-throughput rerun on the same
win11tree:c->c = 2,396,963rust->c = 1,708,649go->c = 886,451c->rust = 2,566,391rust->rust = 2,563,582go->rust = 2,053,507c->go = 2,539,899rust->go = 2,215,733go->go = 2,047,115
- factual conclusion:
- the original
rust->cfull-suite collapse is not stable - max-throughput Windows SHM rows can swing materially between reruns on
win11 - target-rate Windows SHM rows remain stable near their requested rates
- implication:
- the strange Windows SHM max delta is currently a measurement-stability / host-noise issue, not proven deterministic language regression
- the original
- command:
- reviewed Windows benchmark handoff guidance before execution:
- Refreshed max-throughput spread summary:
- Linux:
lookup:- fastest
c->c = 167,974,040 - slowest
go->go = 127,908,975 - spread:
1.31x - improvement versus checked-in previous artifact:
1.77x -> 1.31x
- fastest
shm-ping-pong:- fastest
rust->rust = 3,486,454 - slowest
go->go = 1,725,340 - spread:
2.02x - note:
- this widened versus the previous checked-in artifact because
go->gomax throughput dropped materially
- this widened versus the previous checked-in artifact because
- fastest
shm-batch-ping-pong:- fastest
c->c = 61,778,266 - slowest
go->go = 31,810,209 - spread:
1.94x
- fastest
uds-pipeline-d16:- fastest
rust->c = 712,544 - slowest
rust->go = 550,630 - spread:
1.29x
- fastest
uds-pipeline-batch-d16:- fastest
c->c = 99,746,787 - slowest
go->go = 50,690,629 - spread:
1.97x
- fastest
- Windows:
lookup:- fastest
rust->rust = 178,835,588 - slowest
go->go = 97,109,788 - spread:
1.84x
- fastest
shm-ping-pongfull suite:- fastest
c->rust = 2,650,754 - slowest
rust->c = 850,994 - spread:
3.11x - but targeted confirmation disproved
rust->cas a stable deterministic outlier
- fastest
shm-batch-ping-pong:- fastest
c->c = 52,520,469 - slowest
go->go = 34,390,650 - spread:
1.53x
- fastest
np-pipeline-batch-d16:- fastest
go->rust = 38,249,582 - slowest
go->go = 24,333,588 - spread:
1.57x
- fastest
- Linux:
- Strange delta findings that remain real after the refresh:
- Linux
uds-pipeline-d16:- Go server remains the clear slow case across clients:
c->go = 559,691rust->go = 550,630go->go = 553,858- versus C/Rust servers near
686k-713k
- implication:
- this is a stable Go-server transport/runtime cost, not client-specific noise
- Go server remains the clear slow case across clients:
- Linux
uds-pipeline-batch-d16:- server choice dominates:
- C server:
96.2M-99.7M - Rust server:
84.1M-86.3M - Go server:
50.7M-51.3M
- C server:
- implication:
- the known batch-path server asymmetry is still real
- server choice dominates:
- Linux
shm-batch-ping-pong:- C server stays strongest
- Rust server is mid-band
- Go server is slowest
- implication:
- still consistent with real server-side implementation overhead, not runner corruption
- Linux / Windows
lookup:- Linux:
c = 167.97Mrust = 146.15Mgo = 127.91M
- Windows:
rust = 178.84Mc = 125.60Mgo = 97.11M
- implication:
- lookup is now measuring runtime/data-structure efficiency more than IPC transport behavior
- the previous fake linear-scan distortion is gone, but cross-language runtime overhead remains visible
- Linux:
- Linux
- Strange delta finding that is currently suspicious but not yet proven real:
- Windows
shm-ping-pong @ max:- full-suite run made
rust->cmiss the floor - immediate confirmation run moved the collapse to
go->cinstead - conclusion:
- this is currently a max-throughput measurement-stability issue on
win11 - do not interpret a single bad max row there as a stable language-specific regression without targeted rerun confirmation
- this is currently a max-throughput measurement-stability issue on
- full-suite run made
- second isolated Windows SHM rerun on the same
win11tree reinforced the same conclusion:- command:
NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-rerun.csv 5
@maxrows:c->c = 2,516,450rust->c = 2,430,413go->c = 2,179,591c->rust = 2,497,180rust->rust = 2,473,159go->rust = 2,114,944c->go = 2,571,394rust->go = 2,282,433go->go = 2,100,658
- implication:
- the full-suite
rust->ccollapse to850,994is definitely not stable
- the full-suite
- additional warning sign from the same isolated rerun:
- some
target_rps=10000rows also became unstable:c->rust = 5,073rust->rust = 4,098- while other rows in the same block stayed near
10,000
- implication:
- the Windows SHM benchmark instability is not limited to one language pair or only to the first full-suite run
- some
- command:
- Windows
- Post-commit diagnostic runner work (
870fc93 bench: add Windows diagnostic reruns):- committed and pushed:
- commit:
870fc93 - pushed to
origin/main
- commit:
- immediate next investigation on
win11:- goal:
- identify the smallest Windows benchmark context that reproduces the earlier full-suite SHM collapse
- standalone SHM block with diagnostics enabled:
- command:
NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-diagnose.csv 5
- result:
- exited
0 - no diagnostics triggered
- exited
- key
shm-ping-pong @ maxrows:c->c = 2,461,053withstable_ratio=1.018190rust->c = 2,465,857withstable_ratio=1.021516go->c = 2,162,135withstable_ratio=1.017540c->rust = 2,597,936withstable_ratio=1.016334rust->rust = 2,530,435withstable_ratio=1.020250go->rust = 2,065,765withstable_ratio=1.029206c->go = 2,571,619withstable_ratio=1.013998rust->go = 2,254,772withstable_ratio=1.022145go->go = 2,079,323withstable_ratio=1.010925
- factual conclusion:
- block
2alone is stable under the real repeated-median runner - the earlier full-suite
rust->ccollapse is not a standalone SHM bug
- block
- command:
- combined
NP -> SHMprefix with diagnostics enabled:- command:
NIPC_BENCH_FIRST_BLOCK=1 NIPC_BENCH_LAST_BLOCK=2 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/np-shm-diagnose.csv 5
- result:
- exited
0 - no diagnostics triggered
- total measurements:
72
- exited
- key
np-ping-pong @ maxrows:c->c = 19,411rust->c = 19,735go->c = 18,744c->rust = 20,188rust->rust = 20,301go->rust = 19,277c->go = 19,383rust->go = 18,558go->go = 19,241
- key
shm-ping-pong @ maxrows:c->c = 2,522,584rust->c = 2,522,004go->c = 2,071,095c->rust = 2,580,971rust->rust = 2,511,775go->rust = 2,308,182c->go = 2,657,019rust->go = 2,273,563go->go = 2,109,132
- factual conclusion:
- the failure does not reproduce with blocks
1-2 - the earlier bad
rust->cfull-suite row requires broader full-suite context than just theNP -> SHMtransition
- the failure does not reproduce with blocks
- command:
- goal:
- updated working theory:
- speculation:
- a later block, or cumulative state from multiple later blocks, is needed to trigger the rare full-suite Windows instability
- not supported by evidence anymore:
- standalone SHM bug
- simple
NP -> SHMtransition bug
- speculation:
- next diagnostic step:
- extend the prefix to block
3and repeat:NIPC_BENCH_FIRST_BLOCK=1 NIPC_BENCH_LAST_BLOCK=3 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/np-shm-batch-diagnose.csv 5
- extend the prefix to block
- committed and pushed:
- Current deep-dive findings after extending the prefix to block
3:- factual setup:
- command executed on
win11:NIPC_BENCH_FIRST_BLOCK=1 NIPC_BENCH_LAST_BLOCK=3 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/np-shm-shmbatch-diagnose.csv 5
- result:
- exited non-zero
- publish CSV was partial, not empty
- remote CSV line count:
89total lines88data rows
- expected for blocks
1-3:90data rows
- missing rows:
shm-ping-pong,rust,go,0shm-ping-pong,c,c,10000
- command executed on
- factual evidence that the run continued after failures:
- the partial CSV still contains later rows after the first failed
shm-ping-pong rust->go @ max - it also contains all
snapshot-baselinerows for block3 - implication:
- the runner is correctly fail-recording rows while continuing the remaining matrix
- the partial CSV still contains later rows after the first failed
- factual evidence that first failure was an intermittent runtime failure, not a stable throughput regression:
- preserved first-attempt sample file for
shm-ping-pong rust->go @ maxhas only4completed repeats in/tmp/netipc-bench-410987/samples-shm-ping-pong-rust-go-0.csv - those four repeats were all healthy:
2,183,3402,290,9652,295,0262,240,149
- implication:
- the row failed because one repeat died mid-row, not because all repeats drifted slow
- preserved first-attempt sample file for
- factual evidence of a runner/server lifecycle bug:
- the runner hard-kills every benchmark server after each sample in tests/run-windows-bench.sh:247 to tests/run-windows-bench.sh:263
- the Windows benchmark servers are implemented to stop themselves after
duration+3seconds and then run normal teardown / CPU reporting: - implication:
- the runner is violating the server lifecycle contract on Windows
- hard kill can bypass
nipc_server_destroy()/server.Stop()cleanup - this directly explains:
- transient client timeouts
"in use by live server"collisions on the next repeat- immediate success when the same row is rerun in isolation
- factual evidence that Windows SHM naming is sensitive to leaked sessions:
- Windows server session IDs restart from
1for every server process in src/libnetdata/netipc/src/service/netipc_service_win.c:933 - new sessions increment from that counter in src/libnetdata/netipc/src/service/netipc_service_win.c:1008
- Windows SHM object names include
run_dir + service_name + auth_token + session_idin src/libnetdata/netipc/include/netipc/netipc_win_shm.h:8 to src/libnetdata/netipc/include/netipc/netipc_win_shm.h:11 - stale cleanup is intentionally a no-op on Windows in src/libnetdata/netipc/include/netipc/netipc_win_shm.h:215 to src/libnetdata/netipc/include/netipc/netipc_win_shm.h:220 and src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:781
- implication:
- if a previous sample's server/session stays alive briefly, the next sample for the same service can collide on named pipe and/or SHM object names
- Windows server session IDs restart from
- factual evidence of a separate diagnostic bookkeeping bug:
- the root run dir found on disk for the latest run was
/tmp/netipc-bench-410987 - it contains files only up to SHM
@ maxrows - later successful rows from the same run are present in the output CSV but their sample files are not present under that root
- the runner warning printed a different root path (
/tmp/netipc-bench-456611) that does not exist on disk - implication:
- diagnostic mode currently preserves the truth of the first failure in the terminal output
- but it does not yet preserve the filesystem evidence reliably enough
- the root run dir found on disk for the latest run was
- factual evidence of a Windows SHM transport hardening gap:
- C Windows SHM create path does not check
GetLastError() == ERROR_ALREADY_EXISTSafterCreateFileMappingW/CreateEventWin src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:198 to src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:266 - Rust and Go Windows SHM server-create paths also do not appear to check for existing named objects:
- implication:
- a leaked Windows SHM object may be treated as a successful create instead of an explicit collision
- this can turn cleanup problems into nondeterministic runtime behavior
- C Windows SHM create path does not check
- factual setup:
- Decision needed before code:
1. AFix the Windows benchmark runner only.- scope:
- replace hard-kill shutdown with graceful server stop / wait and hard-kill fallback only on timeout
- make per-repeat server/client output files unique
- fix diagnostic bookkeeping so preserved run dirs and summaries always match the actual run
- benefits:
- directly targets the strongest evidence
- smallest code-change surface
- most likely enough to make the benchmark harness trustworthy
- implications:
- benchmark methodology changes only, not transport semantics
- if Windows SHM object-collision handling is also weak, the benchmark harness may become stable while the product bug remains latent
- risks:
- could leave a real Windows transport bug hidden until another scenario hits it outside the benchmark harness
- scope:
1. BFix the Windows benchmark runner and harden Windows SHM object creation in C, Rust, and Go.- scope:
- everything in
1. A - plus explicit
ERROR_ALREADY_EXISTShandling for Windows SHM mappings/events and clearer collision errors
- everything in
- benefits:
- addresses both the likely benchmark root cause and a real transport safety gap
- makes leaked object collisions explicit instead of nondeterministic
- implications:
- larger change across multiple language implementations
- requires more testing
- risks:
- broader patch, more review surface, more chance of side effects if the three implementations are not kept perfectly aligned
- scope:
1. CContinue diagnosis without code changes.- scope:
- more targeted reruns and more artifact collection
- benefits:
- lowest code risk
- implications:
- more benchmark time burned with a runner we already know is violating server lifecycle on Windows
- risks:
- low leverage
- likely delays the obvious fix
- scope:
- recommendation:
1. B- reasoning:
- the hard-kill runner behavior is the strongest causal explanation for the benchmark instability
- but the Windows SHM create path also has a real hardening gap
- if the goal is "Windows benchmarks trustworthy", fixing only the runner is probably enough for the harness, but not enough for the underlying transport robustness
- user decision:
1. B- accepted scope:
- fix the Windows benchmark runner lifecycle and diagnostics bookkeeping
- harden Windows SHM object creation in C, Rust, and Go to detect existing named objects explicitly
- implementation and verification after
1. B:- local code changes completed:
- runner:
tests/run-windows-bench.sh
- Windows SHM hardening:
src/libnetdata/netipc/include/netipc/netipc_win_shm.hsrc/libnetdata/netipc/src/transport/windows/netipc_win_shm.csrc/crates/netipc/src/transport/win_shm.rssrc/go/pkg/netipc/transport/windows/shm.go
- regression coverage:
tests/fixtures/c/test_win_shm.csrc/crates/netipc/src/transport/win_shm.rssrc/go/pkg/netipc/transport/windows/shm_test.go
- runner:
- factual runner behavior after the patch:
- the Windows runner now:
- uses a unique per-repeat runtime/artifact directory instead of reusing the same
RUN_DIRfor every repeat - waits for benchmark servers to stop themselves before killing them
- preserves the root run dir on any measurement-command failure, not only on stability-gate failures
- records the first-attempt artifact directory in the diagnostics summary
- uses a unique per-repeat runtime/artifact directory instead of reusing the same
- the Windows runner now:
- factual transport behavior after the patch:
- C, Rust, and Go Windows SHM server-create paths now reject existing named mappings/events explicitly instead of treating them as successful creates
- new error surface:
- C:
NIPC_WIN_SHM_ERR_ADDR_IN_USE
- Rust:
WinShmError::AddrInUse
- Go:
ErrWinShmAddrInUse
- C:
- first verification on
win11:- focused Windows SHM duplicate-create coverage now passes in all three implementations:
- Go:
cd src/go && GOOS=windows GOARCH=amd64 go test -run TestWinShmServerCreateRejectsExistingObjects -count=1 ./pkg/netipc/transport/windows
- Rust:
cargo test --manifest-path src/crates/netipc/Cargo.toml test_server_create_rejects_existing_objects_windows -- --test-threads=1
- C:
cmake --build build -j4 --target test_win_shmctest --test-dir build --output-on-failure -R '^test_win_shm$'
- Go:
- result:
- all passed
- focused Windows SHM duplicate-create coverage now passes in all three implementations:
- factual new issue exposed by the stricter runner:
- extending the real benchmark rerun to
NIPC_BENCH_FIRST_BLOCK=1 NIPC_BENCH_LAST_BLOCK=3no longer reproduced the old random SHM collapse first - instead, it exposed a deterministic Rust benchmark-driver shutdown bug:
- every row using a Rust Windows benchmark server failed with:
Server rust (...) did not exit cleanly within 10s; forcing kill
- preserved server output contained only:
READY
- implication:
- the stricter runner removed the old benchmark-driver hard-kill masking and surfaced a real Rust benchmark-driver lifecycle bug
- every row using a Rust Windows benchmark server failed with:
- root cause:
bench/drivers/rust/src/bench_windows.rsstill used the old Windows stop pattern:- only
running_flag.store(false, ...) - no wake connection
- only
- this is the same Windows accept-loop issue already fixed earlier in the Rust Windows tests:
ConnectNamedPipe()stays blocked until a connection wakes it
- fix:
bench/drivers/rust/src/bench_windows.rsnow mirrors the tested shutdown pattern:- after
duration+3, setrunning_flag = false - then issue a dummy
NpSession::connect(...)so the blocked accept loop can observe shutdown and exit cleanly
- after
- direct proof on
win11:- command:
timeout 20 src/crates/netipc/target/release/bench_windows.exe np-ping-pong-server /tmp/plugin-ipc-bench-20260327 rust-stop-check 1
- result:
READYSERVER_CPU_SEC=0.000000
- implication:
- the Rust Windows benchmark server now exits on its own instead of hanging until killed
- command:
- extending the real benchmark rerun to
- focused real benchmark proof after all fixes:
- command:
NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260327/block-2-after-fix.csv 5
- result:
- exited
0 - no diagnostic reruns were needed
- all
36shm-ping-pongrows published
- exited
- key evidence:
- the previously suspicious Windows row is now stable:
shm-ping-pong rust->c @ max = 2,458,786- stable ratio:
1.009920
- all SHM
@ maxrows completed inside the stability gate:c->c = 2,505,981withstable_ratio=1.038817rust->c = 2,458,786withstable_ratio=1.009920c->rust = 2,588,642withstable_ratio=1.028021rust->rust = 2,649,571withstable_ratio=1.018367rust->go = 2,242,750withstable_ratio=1.045399
- the previously suspicious fixed-rate rows are now also stable:
rust->c @ 100000/s = 99,997withstable_ratio=1.000010rust->c @ 10000/s = 9,999withstable_ratio=1.000000rust->rust @ 10000/s = 9,999withstable_ratio=1.000000
- the previously suspicious Windows row is now stable:
- factual conclusion from the focused SHM rerun:
- the Windows SHM benchmark instability is materially reduced after:
- runner lifecycle fixes
- per-repeat runtime isolation
- explicit Windows SHM collision detection
- Rust benchmark-server wake-on-stop fix
- the earlier
rust->cSHM collapse no longer reproduces in the real benchmark block that used to be suspicious
- the Windows SHM benchmark instability is materially reduced after:
- command:
- partial full-suite proof after the focused fixes:
- command started on
win11:NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260327/full-after-fix.csv 5
- factual behavior before manual interruption:
- no diagnostics were emitted
- no forced-kill Rust benchmark-server failures reappeared
- the run cleared the exact NP area where the stricter runner had previously exposed the Rust benchmark-server shutdown bug:
np-ping-pong @ maxrows for Rust servers completed cleanlynp-ping-pong @ 100000/srows for Rust servers completed cleanlynp-ping-pong @ 10000/srows were still running cleanly when the run was stopped intentionally for time
- reason for interruption:
- no new technical blocker remained
- the rest of the work was wall-clock runtime only
- command started on
- local code changes completed: