Skip to content

Latest commit

 

History

History
2518 lines (2458 loc) · 140 KB

File metadata and controls

2518 lines (2458 loc) · 140 KB

Purpose

Fit-for-purpose goal: integrate plugin-ipc into ~/src/netdata/netdata/ so Netdata can immediately replace the current Linux cgroups.plugin -> ebpf.plugin custom metadata transport with typed IPC that is reliable, maintainable, testable, and ready for guarded production rollout.

TL;DR

  • Analyze how plugin-ipc should be integrated into the Netdata repo and build.
  • Before any Netdata integration, implement transparent SHM resizing in plugin-ipc itself.
  • Validate that feature thoroughly first, including full C/Rust/Go interop matrices on Unix and Windows.
  • Use it first to replace the current cgroups.plugin -> ebpf.plugin metadata channel on Linux.
  • Make the library available to C, Rust, and Go code inside Netdata.
  • Record integration design decisions before implementation.
  • User-approved local workspace cleanup in this slice:
    • remove the generated Go test / helper binaries after the push
    • affected files:
      • src/go/cgroups.test.exe
      • src/go/main
      • src/go/raw.test.exe
      • src/go/windows.test.exe
  • User-directed benchmark follow-up now in scope:
    • treat the Linux shm-batch-ping-pong C/Rust spread as two independent problems:
      • Rust server penalty versus C server with the same C client
      • Rust client penalty versus C client with the same C server
    • worst-case rust -> rust is the compounded result of both penalties
    • objective:
      • identify the exact Rust-side hot paths responsible for the server-side and client-side losses
      • fix Rust until the Linux C/Rust SHM batch path is materially closer to the C baseline
    • scope expansion approved by the user:
      • do the same benchmark-delta investigation across all material language/client/server combinations
      • identify every real implementation issue behind the benchmark gaps
      • fix the implementation issues, not just explain them
      • keep benchmark artifacts and benchmark-derived docs in sync after each validated fix
    • first verified benchmark-delta findings:
      • POSIX shm-batch-ping-pong with client ∈ {c,rust} and server ∈ {c,rust} still has a real Rust penalty on both sides:
        • c -> c = 64,148,960
        • c -> rust = 58,334,803
        • rust -> c = 52,277,542
        • rust -> rust = 48,220,338
        • implication:
          • Rust server penalty is real
          • Rust client penalty is larger
          • rust -> rust is the compounded case
      • benchmark-driver distortion is also real and must be fixed before deeper transport conclusions:
        • Go lookup benchmark does a synthetic linear scan instead of using the actual O(1) cache structure:
          • bench/drivers/go/main.go
        • Rust lookup benchmark also does a synthetic linear scan:
          • bench/drivers/rust/src/main.rs
        • Rust actual cache lookup currently allocates name.to_string() on every lookup:
          • src/crates/netipc/src/service/raw.rs
        • Go and Rust batch / pipeline clients still do avoidable hot-loop allocations that C avoids or minimizes:
          • Go:
            • bench/drivers/go/main.go
          • Rust:
            • bench/drivers/rust/src/main.rs
  • Current execution scope:
    • remove the multi-method service drift from docs, code, tests, and public APIs
    • align the implementation to one-service-kind-per-endpoint
    • implement the accepted SHM resize / renegotiation behavior
    • eliminate contradictory wording and examples across the repository
    • refresh the Linux and Windows benchmark matrices on the current tree
    • update benchmark artifacts and all benchmark-derived docs so everything is in sync
    • investigate the remaining benchmark spreads and identify whether they reflect real transport/runtime inefficiency, measurement distortion, or pair-specific implementation overhead
    • correct the benchmark build path so C benchmark results are generated from optimized C libraries, not from a local Debug CMake tree
    • Current implementation status:
    • docs/specs/TODOs now explicitly state service-oriented discovery and one request kind per endpoint
    • Go public cgroups APIs and Go raw service/tests were rewritten to the single-kind model
    • cd src/go && go test -count=1 ./pkg/netipc/service/raw now passes after aligning the raw client/server with learned SHM req/resp capacities and transparent overflow-driven reconnect/retry
    • cd src/go && go test -count=1 ./pkg/netipc/service/cgroups now passes
    • Rust public cgroups facade now uses the single-kind raw server constructor instead of the old multi-handler bundle
    • targeted Rust verification now passes:
      • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib service::cgroups:: -- --test-threads=1
    • Rust raw Unix tests no longer use the old mixed pingpong_handlers() helper
    • the Rust raw service subset now passes after binding increment-only and string-reverse-only endpoints explicitly and teaching the raw client/server the learned SHM req/resp resize path:
      • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib service::raw::tests:: -- --test-threads=1
    • Go raw L2 now tracks learned request/response capacities, treats STATUS_LIMIT_EXCEEDED as an overflow signal, reconnects, renegotiates larger capacities, and retries transparently for overflow-safe calls
    • Rust raw L2 now tracks learned request/response capacities, treats STATUS_LIMIT_EXCEEDED as an overflow signal, reconnects, renegotiates larger capacities, and retries transparently for overflow-safe calls
    • Go and Rust transport listeners now expose payload-limit setters so the server can advertise learned capacities to later clients before accept():
      • Go POSIX: src/go/pkg/netipc/transport/posix/uds.go
      • Go Windows: src/go/pkg/netipc/transport/windows/pipe.go
      • Rust POSIX: src/crates/netipc/src/transport/posix.rs
      • Rust Windows: src/crates/netipc/src/transport/windows.rs
    • src/crates/netipc/src/service/raw.rs no longer exposes the generic Handlers bundle or the transitional new_single_kind / with_workers_single_kind constructors
    • src/crates/netipc/src/service/raw.rs now models managed servers as single-kind endpoints directly:
      • ManagedServer::new(..., expected_method_code, handler)
      • ManagedServer::with_workers(..., expected_method_code, handler, worker_count)
    • Rust POSIX and Windows benchmark drivers now use the single-kind raw service surface instead of the deleted multi-handler Handlers bundle:
      • bench/drivers/rust/src/main.rs
      • bench/drivers/rust/src/bench_windows.rs
    • src/crates/netipc/src/service/raw_unix_tests.rs and src/crates/netipc/src/service/raw_windows_tests.rs now use that single-kind raw service surface directly instead of feeding a generic handler bundle into the raw server
    • verified source-level residue scan for src/crates/netipc/src/service/raw_windows_tests.rs is now clean:
      • no remaining Handlers
      • no remaining test_cgroups_handlers()
      • no remaining increment_handlers()
    • verified source-level residue scan for src/crates/netipc/src/service/raw.rs and src/crates/netipc/src/service/raw_unix_tests.rs is now clean:
      • no remaining Handlers
      • no remaining new_single_kind
      • no remaining with_workers_single_kind
    • C public naming drift was reduced from plural handler bundles to singular service-handler naming
    • tests/fixtures/c/test_win_service.c is now snapshot-only; it no longer starts a typed snapshot service and then exercises increment / string-reverse / batch calls against it
    • source-level cleanup of the remaining Windows C fixtures is only partial so far:
      • the obvious typed snapshot .on_increment / .on_string_reverse bundle drift was removed from:
        • tests/fixtures/c/test_win_service_extra.c
        • tests/fixtures/c/test_win_stress.c
        • tests/fixtures/c/test_win_service_guards.c
        • tests/fixtures/c/test_win_service_guards_extra.c
      • but real win11 compilation later proved these files still contain stale calls to removed C APIs and stale raw-server assumptions
    • verified source-level residue scan across the touched Windows C fixtures is therefore not enough on its own:
      • it proves only that the obvious typed-handler bundle names were removed
      • it does not prove runtime or even compile-time correctness on Windows
    • verified source-level residue scan for the touched Windows Go raw helpers/tests is now clean:
      • no remaining Handlers{...} bundle initializers
      • no remaining winTestHandlers() / winFailingHandlers() helpers
      • no remaining server.handlers references in the Windows raw tests
    • Windows Go package cross-compile proof now passes from this Linux host:
      • cd src/go && GOOS=windows GOARCH=amd64 go test -c ./pkg/netipc/service/raw
      • cd src/go && GOOS=windows GOARCH=amd64 go test -c ./pkg/netipc/service/cgroups
    • the Unix interop/service/cache matrix now passes end-to-end after the resize rewrite:
      • /usr/bin/ctest --test-dir build --output-on-failure -R '^(test_uds_interop|test_shm_interop|test_service_interop|test_service_shm_interop|test_cache_interop|test_cache_shm_interop)$'
    • the broader Unix shm/service/cache slice across C, Rust, and Go now also passes:
      • /usr/bin/ctest --test-dir build --output-on-failure -R '^(test_shm|test_service|test_cache|test_shm_rust|test_service_rust|test_shm_go|test_service_go|test_cache_go)$'
    • the previously exposed POSIX UDS mismatch is now resolved:
      • Rust cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1 now passes 299/299
      • the stale transport tests were rewritten to match the accepted directional negotiation semantics:
        • requests are sender-driven
        • responses are server-driven
      • C test_uds now proves directional negotiation explicitly and keeps direct receive-limit coverage through a raw malformed-response path
      • the broader non-fuzz Unix CTest sweep now passes end-to-end:
        • /usr/bin/ctest --test-dir build --output-on-failure -E '^(fuzz_protocol_30s|go_FuzzDecodeHeader|go_FuzzDecodeChunkHeader|go_FuzzDecodeHello|go_FuzzDecodeHelloAck|go_FuzzDecodeCgroupsRequest|go_FuzzDecodeCgroupsResponse|go_FuzzBatchDirDecode|go_FuzzBatchItemGet)$'
        • result: 28/28 passed
    • the public docs now match the accepted directional handshake semantics:
      • docs/level1-wire-envelope.md explicitly says request limits are sender-driven and response limits are server-driven
      • docs/getting-started.md no longer documents the deleted Rust CgroupsHandlers / CgroupsServer surface
    • Windows transport test sources were aligned to the same directional contract:
      • Go src/go/pkg/netipc/transport/windows/pipe_integration_test.go no longer expects the old min-style negotiation
      • Rust src/crates/netipc/src/transport/windows.rs now contains a matching directional negotiation test
      • Go Windows transport tests still have cross-compile proof from this Linux host:
        • cd src/go && GOOS=windows GOARCH=amd64 go test -c ./pkg/netipc/transport/windows
    • local source checks are clean for the touched Windows C files:
      • git diff --check -- tests/fixtures/c/test_win_stress.c tests/fixtures/c/test_win_service_guards.c tests/fixtures/c/test_win_service_guards_extra.c TODO-netdata-plugin-ipc-integration.md
    • local source checks are also clean for the touched Go/Rust raw files:
      • git diff --check -- src/crates/netipc/src/service/raw.rs src/crates/netipc/src/service/raw_unix_tests.rs src/go/pkg/netipc/service/raw/client.go src/go/pkg/netipc/service/raw/client_windows.go src/go/pkg/netipc/service/raw/shm_unix_test.go src/go/pkg/netipc/service/raw/helpers_windows_test.go src/go/pkg/netipc/service/raw/more_windows_test.go src/go/pkg/netipc/service/raw/shm_windows_test.go TODO-netdata-plugin-ipc-integration.md
    • limitation:
      • this Linux host does not have x86_64-w64-mingw32-gcc
      • so local source cleanup alone is not enough for the edited Windows C fixtures
      • the same host limitation means the raw_windows_tests.rs source cleanup is not backed by a real Windows Rust compile/run proof from this environment either
      • the touched Windows Go packages now have cross-compile proof, but still do not have a real Windows runtime proof from this environment
    • current verified Windows runtime status from the real win11 workflow:
      • the documented ssh win11 + MSYSTEM=MINGW64 toolchain path works and has been used for real validation
      • after syncing the local tree, cmake --build build -j4 on win11 exposed real stale C fixture/API mismatches that were not visible from Linux source scans alone
      • the first verified win11 failure classes were:
        • stale removed client helpers:
          • nipc_client_call_increment
          • nipc_client_call_increment_batch
          • nipc_client_call_string_reverse
        • stale internal error enum usage:
          • NIPC_ERR_INTERNAL_ERROR
        • stale raw-server handler signature assumptions:
          • old bool raw handlers instead of nipc_error_t (*)(..., const nipc_header_t *, ...)
        • stale nipc_server_init(...) argument ordering under the internal test macro path
        • stale client struct field assumptions such as client.request_buf_size
      • those compile-time failures have now been corrected locally and revalidated on win11:
        • test_win_service_extra.exe now builds and passes on win11
      • the remaining active Windows C problem is now narrower and runtime-only:
      • after correcting the stale Windows C fixture/API mismatches and the baseline request-overflow signaling gap, test_win_service_guards.exe now passes on win11:
        • === Results: 141 passed, 0 failed ===
        • the previous apparent timeout was not a persistent runtime hang:
          • later reruns completed normally once the stale one-item batch test drift was removed
        • the last real guard-binary contradiction was:
          • a one-item increment "batch" test still expecting reconnect/growth
        • that expectation was wrong under the accepted semantics:
          • one-item increment batches are normalized to the plain increment path
          • the guard was rewritten to use a real 2-item batch for baseline request-resize coverage
      • the rest of the edited Windows C runtime slice has now been validated on win11 too:
        • test_win_service.exe:
          • === Results: 80 passed, 0 failed ===
        • test_win_service_extra.exe:
          • === Results: 82 passed, 0 failed ===
        • test_win_service_guards_extra.exe:
          • === Results: 93 passed, 0 failed ===
        • test_win_stress.exe:
          • === Results: 1 passed, 0 failed ===
        • a combined rerun of all edited Windows C binaries also passed cleanly on win11
      • the earlier test_win_service.exe timeout is not currently reproducible as a deterministic bug:
        • it timed out once in a combined slice and once in an early soak run
        • after the stale guard/test contradictions were removed, a focused rerun passed
        • a subsequent combined rerun passed
        • a targeted 3-run win11 soak of test_win_service.exe also passed 3/3
        • working theory:
          • that earlier timeout was a transient host/process stall, not a currently reproducible library correctness bug
      • a real L2 behavior gap was exposed and fixed during this win11 investigation:
        • on baseline request overflow, the server session loop now emits a zero-payload LIMIT_EXCEEDED response before disconnecting, instead of silently breaking the session
        • this fix was needed for transparent request-side resize/reconnect to work on Windows baseline transport at all
      • current remaining Windows Rust runtime blocker:
        • focused win11 run:
          • timeout 120 cargo test --manifest-path src/crates/netipc/Cargo.toml test_cache_round_trip_windows -- --nocapture --test-threads=1
        • current observed behavior:
          • build completes
          • test process prints:
            • running 1 test
            • test service::cgroups::windows_tests::test_cache_round_trip_windows ...
          • then stalls without completing
        • strongest current evidence:
          • Rust raw Windows tests already implement reliable Windows shutdown by:
            • storing the service name + wake client config
            • setting running_flag = false
            • issuing a dummy NpSession::connect(...) to wake the blocking ConnectNamedPipe()
          • cgroups Windows tests and Rust Windows interop binaries still use the weaker pattern:
            • only running_flag = false
            • no wake connection
          • the Windows accept loop in src/crates/netipc/src/service/raw.rs blocks in listener.accept(), which ultimately blocks in ConnectNamedPipe(), so running_flag = false alone is not sufficient to stop the server reliably on Windows
        • working theory:
          • the cache test body may already be completing
          • the stall is very likely in Windows server shutdown/join, not in snapshot/cache decoding itself
      • that Rust Windows blocker is now verified fixed on win11:
        • fix:
          • cgroups Windows tests and Rust Windows interop binaries now use the same reliable Windows stop pattern already used by the Rust raw Windows tests:
            • set running_flag = false
            • then issue a wake connection so the blocking ConnectNamedPipe() returns and the accept loop can observe shutdown
        • focused proof:
          • timeout 120 cargo test --manifest-path src/crates/netipc/Cargo.toml test_cache_round_trip_windows -- --nocapture --test-threads=1
          • result:
            • test service::cgroups::windows_tests::test_cache_round_trip_windows ... ok
        • full Rust Windows lib proof:
          • timeout 900 cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1
          • result:
            • 176 passed
            • 0 failed
            • 1 ignored
        • factual conclusion:
          • the live bug was stale Windows shutdown/test-fixture behavior, not a current Rust cache decode/refresh correctness issue
      • broader real Windows interop/service/cache proof is now also green on win11:
        • command:
          • timeout 1800 ctest --test-dir build --output-on-failure -R "^(test_named_pipe_interop|test_win_shm_interop|test_service_win_interop|test_service_win_shm_interop|test_cache_win_interop|test_cache_win_shm_interop)$"
        • result:
          • test_named_pipe_interop: passed
          • test_win_shm_interop: passed
          • test_service_win_interop: passed
          • test_service_win_shm_interop: passed
          • test_cache_win_interop: passed
          • test_cache_win_shm_interop: passed
          • summary:
            • 100% tests passed, 0 tests failed out of 6
    • targeted C rebuild and runtime verification now passes:
      • cmake --build build --target test_service test_hardening test_ping_pong
      • /usr/bin/ctest --test-dir build --output-on-failure -R '^(test_service|test_hardening|test_ping_pong)$'
    • the latest naming / contract cleanup slice is now backed by both local Linux and real win11 proof:
      • local Linux rerun:
        • /usr/bin/ctest --test-dir build --output-on-failure -R '^(test_hardening|test_ping_pong)$'
        • result:
          • 100% tests passed, 0 failed
      • after syncing this slice's edited files to win11, targeted rebuild passed:
        • cmake --build build -j4 --target test_win_service test_win_service_extra test_win_service_guards test_win_service_guards_extra
      • direct win11 runtime proof for the edited guard binaries passed:
        • ./test_win_service_guards.exe
          • result:
            • === Results: 141 passed, 0 failed ===
        • ./test_win_service_guards_extra.exe
          • result:
            • === Results: 93 passed, 0 failed ===
      • direct win11 runtime proof for the edited service binaries also passed via CTest:
        • ctest --test-dir build --output-on-failure -R "^(test_win_service|test_win_service_extra)$"
        • result:
          • test_win_service: passed
          • test_win_service_extra: passed
    • benchmark refresh on the current tree is now complete and synced:
      • factual root cause of the benchmark blocker:
        • the C and Rust batch benchmark clients still generated random batch sizes in the range 1..1000
        • the actual batch protocol normalizes item_count == 1 to the non-batch path
        • Go was already correct and generated 2..1000, which is why the same C batch server still interoperated with the Go client
      • fixed in:
        • bench/drivers/c/bench_posix.c
        • bench/drivers/c/bench_windows.c
        • bench/drivers/rust/src/main.rs
        • bench/drivers/rust/src/bench_windows.rs
        • bench/drivers/go/main.go
        • tests/run-posix-bench.sh
        • tests/run-windows-bench.sh
      • specific fixes:
        • batch benchmark generators now use 2..1000 items for real batch scenarios
        • Windows benchmark failure reporting now defines server_out before calling dump_server_output
      • targeted proof after the fix:
        • the previously failing pairs now succeed locally and on win11:
          • uds-batch-ping-pong c->c
          • uds-batch-ping-pong rust->c
          • shm-batch-ping-pong c->c
          • shm-batch-ping-pong rust->c
          • np-batch-ping-pong c->c
          • np-batch-ping-pong rust->c
      • clean official reruns:
        • Linux:
          • bash tests/run-posix-bench.sh benchmarks-posix.csv 5
          • result:
            • Total measurements: 201
        • Windows:
          • ssh win11 'cd /tmp/plugin-ipc-bench-fixed && ... && bash tests/run-windows-bench.sh benchmarks-windows.csv 5'
          • result:
            • Total measurements: 201
      • clean generated artifacts:
        • bash tests/generate-benchmarks-posix.sh benchmarks-posix.csv benchmarks-posix.md
          • result:
            • All performance floors met
        • ssh win11 'cd /tmp/plugin-ipc-bench-fixed && ... && bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md'
          • result:
            • All performance floors met
    • the follow-up benchmark spread investigation has now established a real benchmark-build bug on POSIX:
      • the local benchmark runner used:
        • C from build/bin/bench_posix_c
        • Rust from src/crates/netipc/target/release/bench_posix
        • Go from build/bin/bench_posix_go
      • the local CMake tree used for the C benchmark was configured as:
        • build/CMakeCache.txt:
          • CMAKE_BUILD_TYPE:STRING=Debug
      • the benchmark target itself added -O2, but the C libraries it linked against were still unoptimized:
        • build/CMakeFiles/bench_posix_c.dir/flags.make:
          • C_FLAGS = -g -std=gnu11 -O2
        • build/CMakeFiles/netipc_protocol.dir/flags.make:
          • C_FLAGS = -g -std=gnu11
        • build/CMakeFiles/netipc_service.dir/flags.make:
          • C_FLAGS = -g -std=gnu11
      • a dedicated optimized benchmark tree proved this materially changes the published POSIX rows:
        • release build setup:
          • cmake -S . -B build-release -DCMAKE_BUILD_TYPE=Release
          • cmake --build build-release --target bench_posix_c bench_posix_go -j8
        • direct targeted reruns:
          • published shm-batch-ping-pong c->c:
            • 25,947,290
          • optimized C libs shm-batch-ping-pong c(rel)->c(rel):
            • 63,699,472
          • published uds-pipeline-batch-d16 c->c:
            • 49,512,090
          • optimized C libs uds-pipeline-batch-d16 c(rel)->c(rel):
            • 103,212,623
        • mixed-language targeted reruns also moved sharply upward when the C side used optimized libraries:
          • intended shm-batch-ping-pong c(rel)->rust:
            • 57,122,454
          • intended shm-batch-ping-pong rust->c(rel):
            • 52,041,263
          • intended uds-pipeline-batch-d16 c(rel)->rust:
            • 91,093,895
          • intended uds-pipeline-batch-d16 rust->c(rel):
            • 101,978,294
      • implemented fix:
        • tests/run-posix-bench.sh now configures and uses a dedicated optimized benchmark tree:
          • default: build-bench-posix
          • build type: Release
        • tests/run-windows-bench.sh now configures and uses a dedicated optimized benchmark tree:
          • default: build-bench-windows
          • build type: Release
          • explicit MinGW toolchain export on win11
      • factual conclusion:
        • the old checked-in POSIX benchmark report was distorted by linking the C benchmark binary against Debug-built C libraries
        • the current checked-in POSIX and Windows benchmark artifacts now come from the corrected dedicated benchmark build paths
    • the Windows benchmark tree is not affected by the same local Debug-build distortion:
      • ssh win11 '... grep CMAKE_BUILD_TYPE build/CMakeCache.txt'
        • CMAKE_BUILD_TYPE:STRING=RelWithDebInfo
      • the previously suspicious Windows SHM batch outlier did not survive the corrected rerun:
        • old checked-in row:
          • shm-batch-ping-pong c->rust = 9,282,667
        • corrected clean rerun row:
          • shm-batch-ping-pong c->rust = 55,868,058
      • final artifact sanity checks:
        • benchmarks-posix.csv
          • rows: 201
          • duplicate keys: 0
          • zero-throughput rows: 0
        • benchmarks-windows.csv
          • rows: 201
          • duplicate keys: 0
          • zero-throughput rows: 0
      • checked-in benchmark docs are now synced to the refreshed artifacts:
        • benchmarks-posix.csv
        • benchmarks-posix.md
        • benchmarks-windows.csv
        • benchmarks-windows.md
        • README.md
      • corrected max-throughput ranges from the current checked-in artifacts:
        • POSIX:
          • uds-ping-pong: 182,963 to 231,160
          • shm-ping-pong: 2,460,317 to 3,450,961
          • uds-batch-ping-pong: 27,182,404 to 40,240,940
          • shm-batch-ping-pong: 31,250,784 to 64,148,960
          • uds-pipeline-d16: 568,373 to 735,829
          • uds-pipeline-batch-d16: 51,960,946 to 102,954,841
          • snapshot-baseline: 158,948 to 205,624
          • snapshot-shm: 1,006,053 to 1,738,616
          • lookup: 114,556,227 to 203,279,430
        • Windows:
          • np-ping-pong: 18,241 to 21,039
          • shm-ping-pong: 2,099,392 to 2,715,487
          • np-batch-ping-pong: 7,013,700 to 8,550,220
          • shm-batch-ping-pong: 36,494,096 to 58,768,397
          • np-pipeline-d16: 245,420 to 270,488
          • np-pipeline-batch-d16: 28,977,365 to 41,270,903
          • snapshot-baseline: 16,090 to 20,967
          • snapshot-shm: 857,823 to 1,262,493
          • lookup: 107,472,315 to 164,305,717
    • current remaining raw Rust drift is now narrower and well-scoped:
      • the raw managed server already enforces one expected_method_code
      • the raw client surface still exposes a generic constructor and mixed call surface under the stale internal name CgroupsClient
      • the next cleanup slice is to bind the raw Rust client constructors to one service kind and migrate the raw Rust tests to those constructors, matching the already-correct Go raw design
    • raw Rust client drift is now removed from the active service surface:
      • src/crates/netipc/src/service/raw.rs now exposes RawClient instead of the stale internal multi-kind name CgroupsClient
      • the raw client is now created only through service-kind-specific constructors:
        • RawClient::new_snapshot(...)
        • RawClient::new_increment(...)
        • RawClient::new_string_reverse(...)
      • request kind remains only as envelope validation on the raw client
      • the raw Rust Unix/Windows tests now create snapshot, increment, and string-reverse clients explicitly instead of reusing one generic constructor across service kinds
    • local Linux Rust proof for that slice is now green:
      • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib service::raw::tests:: -- --test-threads=1
        • result:
          • 75 passed
          • 0 failed
      • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1
        • result:
          • 299 passed
          • 0 failed
    • real win11 Rust proof for that slice is now green too:
      • timeout 900 cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1
        • result:
          • 176 passed
          • 0 failed
          • 1 ignored
    • the broader win11 interop/service/cache matrix initially exposed two more stale constructor residues outside the Rust raw tests:
      • Rust benchmark drivers still imported the deleted raw CgroupsClient instead of using the public snapshot facade
        • fixed in:
          • bench/drivers/rust/src/main.rs
          • bench/drivers/rust/src/bench_windows.rs
      • Go public cgroups wrappers still called the deleted generic raw constructor:
        • raw.NewClient(...)
        • fixed in:
          • src/go/pkg/netipc/service/cgroups/client.go
          • src/go/pkg/netipc/service/cgroups/client_windows.go
      • Go benchmark drivers still hand-rolled the stale raw dispatch signature instead of using the single-kind increment adapter
        • fixed in:
          • bench/drivers/go/main.go
    • the next verified contradiction slice was documentation-heavy and is now resolved:
      • low-level SHM / handshake docs now describe the accepted directional negotiation model and the current session-scoped SHM lifecycle:
        • request limits are sender-driven
        • response limits are server-driven
        • SHM capacities are fixed per session
        • larger learned capacities require a reconnect and a new session, not in-place SHM resize
      • docs/level1-wire-envelope.md no longer says handshake rule 6 takes the minimum of client and server values
      • docs/level1-windows-np.md now documents per-session Windows SHM object names with session_id, aligned with both code and docs/level1-windows-shm.md
      • public L2 comments/docs no longer claim a blanket "retry ONCE":
        • ordinary failures still retry once
        • overflow-driven resize recovery may reconnect more than once while capacities grow
      • Unix test/script cleanup helpers no longer remove the stale pre-session path {service}.ipcshm; they now use per-session cleanup that matches {service}-{session_id}.ipcshm
      • validation for this slice is green:
        • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib service::raw::tests:: -- --test-threads=1
          • result:
            • 75 passed
            • 0 failed
        • cd src/go && go test -count=1 ./pkg/netipc/service/raw
          • result:
            • ok
        • /usr/bin/ctest --test-dir build --output-on-failure -R '^(test_service_interop|test_cache_interop|test_shm_interop)$'
          • result:
            • 100% tests passed
            • 0 failed
    • the next verified residue slice is narrower and fixture-focused:
      • several Unix C/Go fixture cleanup helpers still unlink the dead pre-session path {service}.ipcshm instead of using per-session cleanup
      • current proven hits:
        • tests/fixtures/c/test_service.c
        • tests/fixtures/c/test_cache.c
        • tests/fixtures/c/test_hardening.c
        • tests/fixtures/c/test_chaos.c
        • tests/fixtures/c/test_multi_server.c
        • tests/fixtures/c/test_stress.c
        • src/go/pkg/netipc/service/cgroups/cgroups_unix_test.go
    • that Unix fixture-cleanup residue slice is now resolved:
      • the touched Unix C fixtures now use nipc_shm_cleanup_stale(TEST_RUN_DIR, service) instead of unlinking the dead {service}.ipcshm path
      • the touched Go public cgroups Unix tests now use posix.ShmCleanupStale(testRunDirUnix, service) instead of removing the dead {service}.ipcshm path
      • validation for this slice is green:
        • cd src/go && go test -count=1 ./pkg/netipc/service/cgroups
          • result:
            • ok
        • cmake --build build --target test_service test_cache test_hardening test_multi_server test_chaos test_stress
          • result:
            • rebuild passed
        • /usr/bin/ctest --test-dir build --output-on-failure -R '^(test_service|test_cache|test_hardening|test_multi_server|test_chaos|test_stress)$'
          • result:
            • 100% tests passed
            • 0 failed
    • one more live Unix fixture contradiction remains after that cleanup pass:
      • tests/fixtures/c/test_chaos.c:test_shm_chaos() still opens the dead pre-session SHM path {run_dir}/{service}.ipcshm
      • this is not just stale cleanup text; it likely means the SHM-chaos path is not actually targeting the live per-session SHM file today
    • that live SHM-chaos contradiction is now resolved:
      • tests/fixtures/c/test_chaos.c:test_shm_chaos() now captures the live session_id from the ready client session and opens {run_dir}/{service}-{session_id}.ipcshm
      • the test no longer treats "SHM file not found" as an acceptable skip on this path
      • validation:
        • cmake --build build --target test_chaos
          • result:
            • rebuild passed
        • /usr/bin/ctest --test-dir build --output-on-failure -R '^test_chaos$'
          • result:
            • 100% tests passed
            • 0 failed
    • current residue scan excluding this TODO file is now clean for the main drift markers:
      • no remaining old {service}.ipcshm path literals
      • no remaining deleted CgroupsHandlers / CgroupsServer API references
      • no remaining deleted raw.NewClient(...) / service::raw::CgroupsClient references
      • no remaining deleted new_single_kind / with_workers_single_kind references
    • broader Unix validation after these cleanup passes is also green:
      • /usr/bin/ctest --test-dir build --output-on-failure -R '^(test_uds_interop|test_shm_interop|test_service_interop|test_service_shm_interop|test_cache_interop|test_cache_shm_interop|test_shm|test_service|test_cache|test_shm_rust|test_service_rust|test_shm_go|test_service_go|test_cache_go|test_hardening|test_ping_pong|test_multi_server|test_chaos|test_stress)$'
        • result:
          • 100% tests passed
          • 0 failed
          • 19/19 passed
          • bench/drivers/go/main_windows.go
    • local Go proof for the wrapper/benchmark cleanup is now green:
      • cd src/go && go test -count=1 ./pkg/netipc/service/cgroups
        • result:
          • ok
      • cd bench/drivers/go && go test -run '^$' ./...
        • result:
          • compile-only pass
    • real win11 build + matrix proof after those residue fixes is now green:
      • cmake --build build -j4
        • result:
          • build succeeds again after the Rust/Go constructor cleanup
      • timeout 1800 ctest --test-dir build --output-on-failure -R "^(test_named_pipe_interop|test_win_shm_interop|test_service_win_interop|test_service_win_shm_interop|test_cache_win_interop|test_cache_win_shm_interop)$"
        • result:
          • test_named_pipe_interop: passed
          • test_win_shm_interop: passed
          • test_service_win_interop: passed
          • test_service_win_shm_interop: passed
          • test_cache_win_interop: passed
          • test_cache_win_shm_interop: passed
          • summary:
            • 100% tests passed, 0 tests failed out of 6
    • verified residue scan for the stale constructor names used in this slice is now clean:
      • no remaining raw.NewClient
      • no remaining service::raw::CgroupsClient
      • no remaining RawClient::new(
    • a smaller cross-platform residue cleanup is now also complete:
      • the test-only Rust helper dispatch_single() in src/crates/netipc/src/service/raw.rs is now explicitly marked as dead-code-tolerant under test builds, so Windows lib-test builds no longer emit the stale unused-function warning
      • the remaining public docs/spec wording in this slice was normalized away from the older "method-specific" phrasing where it described the public L2 service surface or service contracts:
        • docs/level1-transport.md
        • docs/codec.md
        • docs/level2-typed-api.md
        • docs/code-organization.md
        • docs/codec-cgroups-snapshot.md
    • local Linux validation after that wording/test-helper cleanup is still green:
      • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1
        • result:
          • 299 passed
          • 0 failed
    • real win11 validation after that cleanup is also still green:
      • timeout 900 cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1
        • result:
          • 176 passed
          • 0 failed
          • 1 ignored
        • factual note:
          • the previous Windows-only dispatch_single unused-function warning is no longer present in this run
        • the Windows guard output still shows the accepted request-resize behavior:
          • transparent recovery
          • exactly one reconnect
          • negotiated request-size growth
    • new verified internal raw-client alignment:
      • fact:
        • the raw managed servers in Go and Rust were already bound to one expected_method_code
        • the remaining client-side drift was that one long-lived raw client context still exposed multiple service-kind calls
      • implementation slice now completed in Go:
        • raw Go clients are now created per service kind:
          • NewSnapshotClient(...)
          • NewIncrementClient(...)
          • NewStringReverseClient(...)
        • each client now stores one expected request code and rejects wrong-kind calls as validation failures instead of pretending one client can legitimately serve multiple service kinds
        • the cache helpers now bind explicitly to cgroups-snapshot
      • exact local Unix proof:
        • cd src/go && go test -count=1 ./pkg/netipc/service/raw
        • result:
          • ok
      • exact real Windows proof on win11:
        • cd ~/src/plugin-ipc.git/src/go && go test -count=1 ./pkg/netipc/service/raw
        • first rerun exposed one Windows-only missed constructor site:
          • pkg/netipc/service/raw/shm_windows_test.go:334
          • stale NewClient(...)
        • after correcting that last Windows-only leftover and resyncing:
          • result:
            • ok
      • factual conclusion:
        • the Go raw helper layer is now materially aligned with the accepted single-service-kind design on both Unix and Windows
        • remaining work is to carry the same invariant through the remaining Rust raw helper surface
    • a full Rust cargo test --lib run is still blocked by one unrelated transport failure outside this rewrite slice:
      • transport::posix::tests::test_receive_batch_count_exceeds_limit
    • remaining heavy work is now concentrated in:
      • proving the accepted resize behavior with the full interop/service/cache matrices on Unix and Windows, not just the targeted raw suites
      • getting real Windows compile/run proof for the edited Rust/Go/C Windows test surfaces
      • reconciling the current C path with the final single-kind + learned-size design language everywhere, then validating all 3 languages together

Analysis

Verified facts about Netdata today

  • cgroups.plugin is not an external executable. It runs inside the Netdata daemon:
    • cgroups_main() is started from src/daemon/static_threads_linux.c.
  • ebpf.plugin is a separate external executable:
    • built by add_executable(ebpf.plugin ...) in CMakeLists.txt.
  • Current cgroups.plugin -> ebpf.plugin integration is a custom SHM + semaphore contract:
    • producer: src/collectors/cgroups.plugin/cgroup-discovery.c
    • shared structs: src/collectors/cgroups.plugin/sys_fs_cgroup.h
    • consumer: src/collectors/ebpf.plugin/ebpf_cgroup.c
  • The shared payload currently transports cgroup metadata, not PID membership:
    • fields: name, hash, options, enabled, path
    • ebpf.plugin still reads each cgroup.procs file itself.
  • Netdata already has a stable per-run invocation identifier:
    • src/libnetdata/log/nd_log-init.c
    • Netdata reads NETDATA_INVOCATION_ID, else INVOCATION_ID, else generates a UUID and exports NETDATA_INVOCATION_ID.
  • External plugins are documented to receive NETDATA_INVOCATION_ID:
    • src/plugins.d/README.md
  • Netdata already exposes plugin environment variables centrally:
    • src/daemon/environment.c
  • Netdata already has the right build roots for all 3 languages:
    • C via top-level CMakeLists.txt
    • Rust workspace in src/crates/Cargo.toml
    • Go module in src/go/go.mod

Verified facts about plugin-ipc today

  • plugin-ipc already has the exact L3 cgroups snapshot API for this use case:
    • docs/level3-snapshot-api.md
  • The typed snapshot schema closely matches Netdata’s current SHM payload:
    • src/libnetdata/netipc/include/netipc/netipc_protocol.h
  • The C API already supports:
    • managed server lifecycle
    • typed cgroups client/cache
    • POSIX transport with negotiated SHM fast path
  • Authentication in plugin-ipc is a uint64_t auth_token:
    • src/libnetdata/netipc/include/netipc/netipc_service.h
    • src/libnetdata/netipc/include/netipc/netipc_uds.h
    • Rust/Go implementations use the same concept.

Important integration implications

  • Phase 1 can replace the metadata transport only.
  • Phase 1 will not remove ebpf.plugin reads of cgroup.procs.
  • The default plugin-ipc response size is too small for real Netdata snapshots on large hosts, so Linux integration must use an explicit large response limit.
  • The best build/distribution model is in-tree vendoring inside Netdata, not an external system dependency.
  • Current Netdata payload sizing evidence already proves this:
    • cgroup_root_max default is 1000 in src/collectors/cgroups.plugin/sys_fs_cgroup.c
    • current per-item SHM body carries name[256] and path[FILENAME_MAX + 1] in src/collectors/cgroups.plugin/sys_fs_cgroup.h
    • FILENAME_MAX on this Linux build environment is 4096
    • this means the current per-item shape is already about 4.3 KiB before protocol framing/alignment

Verified design-drift findings

  • The original written phase plan did not describe a multi-method server.
    • Evidence:
      • TODO-plugin-ipc.history.md
      • historical phase plan still says:
        • Define and freeze a minimal v1 typed schema for one RPC method ('increment')
  • The first generated L2 spec also did not need a multi-method server model.
    • Evidence:
      • initial docs/level2-typed-api.md from commit 1722f95
      • handler contract was framed as one typed request view + one response builder per handler callback
      • no raw transport-level switch over multiple method codes in that initial text
  • The history TODO already contained the correct service-oriented discovery model.
    • Evidence:
      • TODO-plugin-ipc.history.md
      • explicit historical decisions already said:
        • discovery is service-oriented, not plugin-oriented
        • service names are the stable public contract
        • one endpoint per service
        • one persistent client context per service
        • startup order can remain random
        • caller owns reconnect cadence via refresh(ctx)
    • Implication:
      • the later multi-method server model was not a missing discussion
      • it was drift away from an already-decided service model
  • The first explicit spec drift appears in commit 53b5e5a on 2026-03-16.
    • Evidence:
      • docs/level2-typed-api.md in commit 53b5e5a
      • handler contract changed to:
        • raw-byte transport handler
        • switch(method_code)
        • INCREMENT
        • STRING_REVERSE
        • CGROUPS
      • this is the first clear documentation model where one server endpoint dispatches multiple request kinds
  • The first strong implementation-level generalization appears the same day in commit 69bb794.
    • Evidence:
      • commit message explicitly says:
        • Add dispatch_increment(), dispatch_string_reverse(), dispatch_cgroups_snapshot()
      • docs/getting-started.md in that commit adds typed helper examples for more than one method family
      • this widened the implementation and examples toward a generic multi-method dispatch surface
  • The drift was then reinforced in public examples in commit 6014b0e on 2026-03-17.
    • Evidence:
      • docs/getting-started.md
      • C example registers:
        • .on_increment
        • .on_cgroups
      • Rust example registers:
        • on_increment
        • on_cgroups
      • Go example registers:
        • OnIncrement
        • OnSnapshot
      • text says:
        • You register typed callbacks for the supported methods
  • The drift became operationally entrenched in interop in commit 099945b on 2026-03-16.
    • Evidence:
      • commit message explicitly says:
        • Cross-language interop now tests all method types
      • interop fixtures for C, Rust, and Go on POSIX and Windows all dispatch:
        • INCREMENT
        • CGROUPS_SNAPSHOT
        • STRING_REVERSE
  • The drift later propagated into current coverage/TODO planning and the repository README.
    • Evidence:
      • TODO-pending-from-rewrite.md planned:
        • snapshot / increment / string-reverse / batch over SHM
      • README.md now says:
        • servers register typed handlers

Current factual conclusion from the drift investigation

  • There is currently no evidence in the TODO history that the original direction from the user was:
    • one server should serve multiple request kinds
  • The strongest historical evidence points the other way:
    • the original phase plan explicitly named one RPC method only
  • Working theory:
    • the drift started when the typed API was generalized from:
      • one typed request kind per server
      • to
      • one generic server dispatching multiple method codes
    • then examples, interop fixtures, tests, coverage plans, and README text copied that model until it felt normal

Decisions

Made

  1. Windows runtime validation host

    • User decision: use win11 over SSH for real Windows proof instead of stopping at source cleanup or cross-compilation from Linux.
    • Constraint:
      • prefer the already-documented win11 workflow from this repository's TODOs/docs
      • do not guess the Windows execution flow when the repo already documents it
    • Implication:
      • touched Windows Rust/Go/C transport/service/interop/cache surfaces should now be proven on a real Windows runtime, not just by static review or Linux-hosted cross-compilation
      • the next implementation slice should follow the existing win11 operational guidance already captured in the repo
  2. Authentication source

    • User decision: use NETDATA_INVOCATION_ID for authentication.
    • Meaning:
      • the auth value changes on every Netdata run
      • only plugins launched under the same Netdata instance can authenticate
    • Evidence:
      • src/libnetdata/log/nd_log-init.c creates/exports NETDATA_INVOCATION_ID
      • src/plugins.d/README.md documents it for external plugins
    • Implication:
      • this is stronger than a machine-stable token for local plugin-to-plugin IPC
      • restarts invalidate old clients automatically
  3. Source layout in Netdata

    • User decision: native Netdata layout.
    • Layout:
      • C in src/libnetdata/netipc/
      • Rust in src/crates/netipc/
      • Go in src/go/pkg/netipc/
    • Implication:
      • the library becomes a first-class internal Netdata component in all 3 languages
      • future sync from plugin-ipc upstream will be manual/curated, not subtree-based
  4. Invocation ID to auth-token mapping

    • User decision: derive the plugin-ipc uint64_t auth_token from NETDATA_INVOCATION_ID using a deterministic hash.
    • Constraint:
      • the mapping must be identical in C, Rust, and Go
    • Implication:
      • only processes launched under the same Netdata run can authenticate
      • Netdata restart rotates auth automatically
  5. Rollout mode

    • User decision: big-bang switch.
    • Implication:
      • there will be no legacy custom-SHM fallback path for this metadata channel
    • Risk:
      • any bug in the new path blocks ebpf.plugin cgroup metadata integration immediately
  6. Linux response size policy

    • User concern/decision direction:
      • do not accept a large fixed memory cost such as 16 MiB just for this IPC path
      • prefer dynamic behavior that adapts to actual payload size
      • allocation should happen only when needed
    • Implication:
      • the current plugin-ipc response budgeting model needs review before integration
      • response sizing / negotiation may need design changes, not just configuration
  7. Snapshot overflow handling direction

    • User decision direction:
      • reconnect is acceptable for snapshot overflow handling
      • growth policy should be power-of-two
      • SHM L2 should transparently handle overflow-driven resizing, hidden from both L2 clients and L2 servers
    • User design intent:
      • the server should not need to know the final safe snapshot size before the first request
      • the first real overflow during response preparation should trigger the resize path
      • once the server has learned a larger size from a real snapshot, later clients should negotiate into that larger size automatically
    • Implication:
      • current fixed per-session SHM sizing and current HELLO/HELLO_ACK limit semantics are not sufficient as-is for this Netdata use case
      • the growth mechanism likely needs new L2 protocol behavior, not only implementation tweaks
  8. Pre-integration gating

    • User decision:
      • implement this transparent SHM resize behavior in plugin-ipc first
      • do not start Netdata integration before it is done
      • require thorough validation first, including full interop matrices across C/Rust/Go on Unix and Windows
    • Verified evidence that the repo already has the right validation scaffolding:
      • POSIX interop tests in CMakeLists.txt:
        • test_uds_interop
        • test_shm_interop
        • test_service_interop
        • test_service_shm_interop
        • test_cache_interop
        • test_cache_shm_interop
      • Windows interop tests in CMakeLists.txt:
        • test_named_pipe_interop
        • test_win_shm_interop
        • test_service_win_interop
        • test_cache_win_interop
      • Existing transport-specific integration tests already exist:
        • POSIX SHM: tests/fixtures/c/test_shm.c, Rust src/crates/netipc/src/transport/shm_tests.rs
        • Windows SHM: tests/fixtures/c/test_win_shm.c, Rust src/crates/netipc/src/transport/win_shm.rs, Go src/go/pkg/netipc/transport/windows/shm_test.go
    • Implication:
      • the resize feature must be proven at:
        • L1 transport level
        • L2 service/client level
        • cross-language interop level
        • both POSIX and Windows implementations
  9. Design priorities for the resize rewrite

    • User decision:
      • optimize for long-term correctness, reliability, robustness, and performance
      • backward compatibility is not required
      • do not optimize for minimizing work now
      • prefer the right design even if that means a substantial rewrite
    • Implication:
      • decisions should favor clean semantics and maintainability over preserving current handshake/transport structure
      • a third rewrite is acceptable if it produces a better architecture
  10. User design constraints from follow-up discussion

    • IPC servers should service a single request kind.
    • Sessions should be assumed long-lived:
      • connect once
      • serve many requests
      • disconnect on shutdown or exceptional recovery
  11. Benchmark refresh slice disposition

  • User decision:
    • commit and push the refreshed benchmark slice now
    • then investigate the remaining benchmark spreads separately
  • Implication:
    • commit only the benchmark-fix, benchmark-artifact, and benchmark-doc sync files from this slice
    • do not mix this commit with unrelated cleanup or integration work
  1. Current commit scope
  • User decision:
    • commit and push the full remaining work from this task now
  • Implication:
    • stage the remaining drift-removal, SHM-resize, service-kind alignment, test, and doc changes that belong to this task
    • avoid unrelated local or user-owned changes outside this task
  • Steady-state fast path matters far more than the rare resize path.
  • Learned transport sizes are important:
    • adapt automatically
    • stabilize quickly
    • then remain fixed for the lifetime of the process
    • reset on restart
  • Separate request and response sizing should exist.
  • Variable sizing pressure is expected mainly on responses, not requests.
  • Artificial hard caps are not acceptable as a design crutch.
  • Disconnect-based recovery is acceptable if it is reliable and the system stabilizes.
  1. Accepted architecture decisions for the SHM resize rewrite
  • User accepted:
    • L2 service model: single-method-per-server
    • Resize signaling path: explicit LIMIT_EXCEEDED signal, then disconnect/reconnect
    • Auto-resize scope: separate learned request and response sizing, both supported
    • Initial size policy: per-server-kind compile-time defaults
    • Learned-size lifetime: in-memory only for the current process lifetime, reset on restart
  • Implication:
    • the current generic multi-method service abstraction is now known design drift
    • the rewrite should simplify transport/service code around one request kind per server
  1. Service discovery and availability model
  • User clarified the intended service model explicitly:
    • clients connect to a service kind, not to a specific plugin implementation
    • each service endpoint serves one request kind only
    • example service kinds include:
      • cgroups-snapshot
      • ip-to-asn
      • pid-traffic
    • the serving plugin is intentionally abstracted away from clients
  • User clarified the intended runtime model explicitly:
    • plugins are asynchronous peers
    • startup order is not guaranteed
    • enrichments from other plugins/services are optional
    • a client plugin may start before the service it needs exists
    • a service may disappear and reappear during runtime
    • clients must reconnect periodically and tolerate service absence
  • Implication:
    • repository docs/specs/TODOs must describe:
      • service-name-based discovery
      • service-type ownership independent from plugin identity
      • optional dependency semantics
      • reconnect / retry behavior for not-yet-available services
  1. Execution mandate for this phase
  • User decision:
    • proceed autonomously to remove the drift from implementation and docs
    • align code, tests, and examples to the single-service-kind model
    • implement the accepted SHM size renegotiation / resize behavior
    • remove contradictory wording and stale examples that preserve the wrong model
  • Implication:
    • this is now a repository-wide consistency and implementation task
    • active docs, public APIs, interop fixtures, and validation must converge on the same model before Netdata integration
  1. Request-kind field semantics
  • User clarification:
    • request type / method code may remain in wire structures and headers
    • its role is validation, not public multi-method dispatch
    • a service endpoint expects exactly one request kind
    • any other request kind must be rejected
  • Implication:
    • we can keep method codes in the protocol
    • service implementations must bind one endpoint to one expected request kind
    • public APIs/tests/docs must not imply that one service endpoint accepts multiple unrelated request kinds
  1. Payload-vs-service boundary
  • User clarification:
    • if a service needs arrays of things, batching belongs to that service payload/codec
    • batching is not a reason for one L2 endpoint to expose multiple public request kinds
  • Implication:
    • the public L2 service layer should not keep generic multi-method or generic batch dispatch as part of its contract
    • INCREMENT, STRING_REVERSE, and batch ping-pong traffic can remain at protocol / transport / benchmark level
    • the public cgroups snapshot service should be snapshot-only

Pending

  1. Service naming and endpoint placement

    • Context:
      • POSIX transport needs a service name and run-dir placement.
      • Netdata already has os_run_dir(true).
    • Open question:
      • exact service name/versioning strategy for the cgroups snapshot endpoint
  2. Exact Linux response-size budget

    • Context:
      • user rejected a large fixed per-connection budget as bad for footprint
      • dynamic/adaptive options must be evaluated against the current plugin-ipc design
    • Current hard payload evidence:
      • 1000 cgroups at roughly 4.3 KiB each already implies multi-megabyte worst-case snapshots
    • Open question:
      • what protocol / implementation change best preserves low idle footprint while still supporting large snapshots
  3. Dynamic response sizing model

    • Context:
      • current plugin-ipc session handshake negotiates agreed_max_response_payload_bytes once
      • current implementations then size buffers against that session-wide maximum
    • Verified evidence:
      • handshake uses min(client, server) in src/libnetdata/netipc/src/transport/posix/netipc_uds.c
      • C client allocates request/response/send buffers eagerly in src/libnetdata/netipc/src/service/netipc_service.c
      • C server allocates per-session response buffer sized to the full negotiated maximum in src/libnetdata/netipc/src/service/netipc_service.c
      • Linux SHM region size is fixed from negotiated request/response capacities in src/libnetdata/netipc/src/transport/posix/netipc_shm.c
      • UDS chunked receive is already dynamically grown with realloc in src/libnetdata/netipc/src/transport/posix/netipc_uds.c
      • Rust and Go clients are already more dynamic and grow buffers lazily in:
        • src/crates/netipc/src/service/cgroups.rs
        • src/go/pkg/netipc/service/cgroups/client.go
      • Netdata ebpf.plugin refreshes cgroup metadata every 30 seconds:
        • src/collectors/ebpf.plugin/ebpf_process.h
        • src/collectors/ebpf.plugin/ebpf_cgroup.c
    • Decision needed:
      • choose whether to keep the current protocol and improve allocation policy only, or evolve the protocol to support truly dynamic large snapshots
    • Options:
      • A. Keep protocol, make implementation adaptive, and use baseline-only transport for the cgroups snapshot service in phase 1
      • B. Add paginated snapshot requests/responses
      • C. Add out-of-band exact-sized bulk snapshot transfer for large responses
      • D. Keep the current fixed session-wide max model and just configure a large cap
      • E. Keep SHM for data, but negotiate/create SHM capacity per request instead of per session
      • F. Split transport into a tiny control channel plus ephemeral payload channel/object
      • G. Add a small size-probe step before fetching the full snapshot
      • H. Add true server-streamed snapshot responses (multi-message response sequence)
      • I. Allow snapshot responses to return "resize to X bytes and retry", so the client grows once on demand and reuses that larger buffer from then on
      • J. Make SHM L2 transparently reconnect and double capacities on overflow, so resizing is hidden from both clients and servers and the server retains the learned larger size for future sessions
    • Current preferred direction under discussion:
      • J, but it still needs stress-testing against the current HELLO/HELLO_ACK semantics, SHM lifecycle, and L2 retry behavior
  4. Transparent SHM resize semantics

    • Context:
      • user direction is to make SHM L2 resizing automatic and transparent to both clients and servers
      • reconnect is acceptable and growth should be power-of-two on overflow
    • Verified evidence:
      • current server sends NIPC_STATUS_INTERNAL_ERROR on handler/batch failure in src/libnetdata/netipc/src/service/netipc_service.c
      • current C/Go/Rust clients treat any non-OK response transport status as bad layout / failure:
        • src/libnetdata/netipc/src/service/netipc_service.c
        • src/go/pkg/netipc/service/cgroups/client.go
        • src/crates/netipc/src/service/cgroups.rs
      • NIPC_STATUS_LIMIT_EXCEEDED already exists in src/libnetdata/netipc/include/netipc/netipc_protocol.h
    • Corrected layering rule from user discussion:
      • transport/L2 may handle overflow signaling, reconnect, and shared-memory remap mechanics
      • replay detection for mutating RPCs belongs to the request payload and the server business logic, not to transport-level semantic dedupe
    • Clarified implication:
      • transport should not try to "understand" whether a mutation was already applied
      • if a mutating method cares about replay safety, it must carry a request identity / idempotency token in its own payload and the server method must enforce it
    • For the Netdata cgroups snapshot use case:
      • this is not a blocker, because snapshot is read-only
    • Open question:
      • whether transparent reconnect-and-retry should be generic transport behavior for all methods, or exposed as a capability that higher layers opt into when their payload semantics make replay safe
  5. Negotiation semantics for learned SHM size

    • Context:
      • user correctly rejected the current min(client, server) rule for learned snapshot sizing
      • current handshake stores only one scalar per direction, so it cannot distinguish:
        • client hard cap
        • client initial size
        • server learned target size
    • Verified evidence:
      • current HELLO/HELLO_ACK uses fixed agreed_max_* fields in:
        • src/libnetdata/netipc/src/transport/posix/netipc_uds.c
        • src/crates/netipc/src/transport/posix.rs
        • src/crates/netipc/src/transport/windows.rs
    • Open question:
      • should the protocol split "current operational size" from "hard ceiling", so the server can advertise a learned larger target without losing the client’s ability to refuse absurd allocations
  6. Request-side vs response-side SHM growth asymmetry

    • Verified evidence:
      • POSIX SHM send rejects oversize messages locally before the peer can react:
        • src/libnetdata/netipc/src/transport/posix/netipc_shm.c
      • existing tests already cover this class of failure:
        • tests/fixtures/c/test_shm.c
        • tests/fixtures/c/test_service.c (test_shm_batch_send_overflow_on_negotiated_limit)
        • tests/fixtures/c/test_win_shm.c
        • tests/fixtures/c/test_win_service_guards.c
    • Implication:
      • response-capacity growth can be learned by the server while building a response
      • request-capacity growth cannot be learned the same way, because an oversize request fails client-side before the server sees it
    • Open question:
      • should the first implementation cover:
        • response-side transparent resize only
        • or symmetric request+response resize with separate client-learned request sizing semantics
  7. Netdata lifecycle ownership details

    • Context:
      • cgroups.plugin runs in-daemon
      • ebpf.plugin is external
    • Open question:
      • exact daemon init/shutdown points for starting/stopping the plugin-ipc cgroups server and for initializing the ebpf.plugin client cache

Plan

  1. Audit the current implementation surfaces that still encode multi-method service behavior.
  2. Define the replacement public model in code terms:
    • one service module per service kind
    • one endpoint per request kind
    • service-specific typed clients/servers/cache helpers
  3. Redesign SHM resize semantics in implementation terms:
    • explicit LIMIT_EXCEEDED
    • disconnect/reconnect recovery
    • separate learned request/response sizes
    • process-lifetime learned sizing
  4. Rewrite the C, Rust, and Go Level 2 service layers to match the corrected model.
  5. Rewrite interop/service fixtures and validation scripts to test one service kind per server.
  6. Rewrite public docs/examples/specs to remove contradictory multi-method wording.
  7. Run targeted tests first, then the full relevant Unix/Windows matrices required to trust the rewrite.
  8. Summarize any residual risk or remaining ambiguity before starting Netdata integration work.
  9. Rerun the current Linux and Windows benchmark matrices on the aligned tree.
  10. Regenerate benchmark artifacts and update all benchmark-derived docs/README summaries.

Implied decisions

  • Preserve Level 1 transport interoperability work where still valid.
  • Preserve codec/message-family work where it remains useful under a service-oriented split.
  • Prefer removal/rename of drifted APIs over keeping compatibility shims, because backward compatibility is not required.
  • Keep request-kind and outer-envelope metadata available to single-kind handlers only for:
    • validating that the endpoint received the expected request kind
    • reading transport batch metadata when a single service kind supports batched payloads
  • Do not use that metadata to reintroduce generic multi-method dispatch at the public Level 2 surface.
  • If a generic Level 2 helper remains for tests/benchmarks, keep it internal and single-kind:
    • one expected request kind per endpoint
    • no public multi-method callback surface
    • no docs/examples presenting it as a production service model

Testing requirements

  • C, Rust, and Go unit tests for the rewritten service APIs
  • POSIX interop matrix for corrected service identities and SHM resize behavior
  • Windows interop matrix for corrected service identities and SHM resize behavior
  • Explicit tests for:
    • late provider startup
    • reconnect after provider restart
    • service absence as a tolerated state
    • SHM resize on response overflow
    • learned-size reuse after reconnect
    • request-side and response-side learned sizing behavior

Documentation updates required

  • Keep README, docs specs, and active TODOs aligned with:
    • service-oriented discovery
    • one request kind per endpoint
    • optional asynchronous enrichments
    • reconnect-driven recovery
    • SHM resize / renegotiation behavior
  1. Finalize remaining design details above.
  2. Vendor plugin-ipc into Netdata in the chosen native layout.
  3. Add a Linux cgroups typed server inside Netdata daemon lifecycle.
  4. Replace ebpf.plugin shared-memory metadata reader with plugin-ipc cgroups cache client.
  5. Keep existing PID membership logic in ebpf.plugin unchanged in phase 1.
  6. Remove the old custom SHM metadata path as part of the big-bang switch.
  7. Add tests for:
    • normal metadata refresh
    • stale/restarted Netdata invalidating old clients
    • large snapshots
    • ebpf.plugin recovery on server restart

Implied decisions

  • Phase 1 is Linux-only.
  • Phase 1 targets cgroups.plugin -> ebpf.plugin metadata only.
  • Current collectors-ipc/ebpf-ipc.* apps/pid SHM remains untouched.
  • NETDATA_INVOCATION_ID must be available to the ebpf.plugin launcher path and any future external clients.
  • A deterministic invocation-id hashing helper will be needed in C, Rust, and Go.

Testing requirements

  • Unit tests for invocation-id to auth-token derivation in C, Rust, and Go.
  • Integration test proving only same-run plugins can connect.
  • Integration test proving restart rotates auth and old clients fail cleanly.
  • Snapshot scale test with high cgroup counts and long names/paths.
  • ebpf.plugin regression test for existing cgroup discovery semantics.

Documentation updates required

  • Netdata integration design note for the new cgroups metadata transport.
  • Developer docs for the new in-tree netipc layout and per-language use.
  • ebpf.plugin and cgroups.plugin internal docs describing the new IPC path.
  • Rollout/kill-switch documentation if dual-path rollout is selected.

Benchmark remediation progress

  • Verified benchmark-distortion findings before changing code:
    • POSIX shm-batch-ping-pong for c/rust exceeds the 1.2x threshold:
      • c->c = 64,148,960
      • c->rust = 58,334,803
      • rust->c = 52,277,542
      • rust->rust = 48,220,338
    • The full corrected Linux and Windows matrices also showed broader benchmark-driver artifacts:
      • Go lookup benchmark used a synthetic linear scan instead of the actual cache-style hash lookup.
      • Rust lookup benchmark used a synthetic linear scan too.
      • Rust cache lookup allocated name.to_string() on every lookup.
      • Go and Rust benchmark clients still had hot-loop buffer allocations in batch, pipeline, and ping-pong paths.
  • Implemented first remediation pass:
    • src/crates/netipc/src/service/raw.rs
      • replaced the flat (hash, String) lookup key with nested per-hash maps so Rust cache lookups stop allocating per call
    • bench/drivers/rust/src/main.rs
      • removed hot-loop allocations from SHM batch client
      • removed hot-loop allocations from ping-pong client
      • moved pipeline-batch receive buffer allocation out of the outer loop
      • replaced lookup linear scan with hash-map lookup
    • bench/drivers/rust/src/bench_windows.rs
      • removed the same hot-loop allocations on Windows
      • replaced lookup linear scan with hash-map lookup
    • bench/drivers/go/main.go
      • removed hot-loop allocations from batch, pipeline, pipeline-batch, and ping-pong clients
      • replaced lookup linear scan with hash-map lookup
    • bench/drivers/go/main_windows.go
      • removed the same hot-loop allocations on Windows
      • replaced lookup linear scan with hash-map lookup
  • Validation after the first remediation pass:
    • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1
      • 299 passed, 0 failed
    • cd bench/drivers/go && go test -run '^$' ./...
      • compile-only pass
    • cd src/go && go test -count=1 ./pkg/netipc/service/raw ./pkg/netipc/service/cgroups
      • both packages passed
  • Targeted Linux rerun after the first remediation pass:
    • lookup
      • c = 173,132,146
      • rust = 45,886,102
      • go = 47,703,281
      • fact: the fake benchmark scans are gone; the remaining gap is now in the actual lookup data structures
    • shm-batch-ping-pong, target 0
      • c->c = 62,314,895
      • c->rust = 57,112,806
      • rust->c = 51,620,887
      • rust->rust = 47,356,599
      • fact: the Rust client and Rust server penalties are both still real
    • uds-pipeline-d16, target 0
      • c->c = 721,232
      • c->rust = 717,024
      • c->go = 572,552
      • rust->c = 719,458
      • rust->rust = 727,197
      • rust->go = 576,525
      • fact: the remaining delta is mostly a Go server issue, not a client issue
    • uds-pipeline-batch-d16, target 0
      • c->c = 103,250,763
      • c->rust = 91,495,522
      • c->go = 51,623,524
      • rust->c = 102,367,177
      • rust->rust = 89,465,821
      • rust->go = 52,915,850
      • fact: the earlier client-side benchmark distortion is gone; the remaining large delta is mainly the Go server path
  • Next concrete fixes identified from code + rerun evidence:
    • Go and Rust cache lookup should mirror the C open-addressing hash table:
      • evidence:
        • C uses hash ^ djb2(name) with open addressing in src/libnetdata/netipc/src/service/netipc_service.c
        • Go still uses a composite map[{hash,name}] in src/go/pkg/netipc/service/raw/cache.go
        • Rust still uses nested HashMap<u32, HashMap<String, usize>> in src/crates/netipc/src/service/raw.rs
      • implication:
        • Go and Rust still pay full runtime string hashing on every lookup while C does not
    • Go POSIX UDS transport should mirror the C/Rust vectored send path:
      • evidence:
        • C uses sendmsg + two iovecs in src/libnetdata/netipc/src/transport/posix/netipc_uds.c
        • Rust uses raw_send_iov() in src/crates/netipc/src/transport/posix.rs
        • Go still copies header + payload into a merged scratch buffer in src/go/pkg/netipc/transport/posix/uds.go
      • implication:
        • Go server responses on UDS still pay an extra memcpy per message on the hot path
  • Next measurement step:
    • apply the lookup-index and Go UDS send fixes
    • rerun only the affected slices first:
      • Linux: lookup, shm-batch-ping-pong, uds-pipeline-d16, uds-pipeline-batch-d16
      • Windows: lookup, shm-batch-ping-pong, np-pipeline-d16, np-pipeline-batch-d16
    • only after the slice reruns are understood should the full matrices and docs be refreshed again.
  • Second targeted Linux rerun after rebuilding the Rust release benchmark:
    • lookup
      • c = 170,976,986
      • rust = 150,660,413
      • go = 121,278,244
      • fact:
        • Rust lookup is now near C after mirroring the C open-addressing structure
        • Go lookup improved materially too, but it is still above the 1.2x threshold versus C
    • shm-batch-ping-pong, target 0
      • c->c = 60,929,552
      • c->rust = 55,151,867
      • rust->c = 49,426,036
      • rust->rust = 45,104,001
      • fact:
        • Rust still has a real server-side penalty on this path
        • Rust still has a larger real client-side penalty on this path
    • uds-pipeline-d16, target 0
      • c->c = 713,563
      • c->rust = 720,602
      • rust->c = 722,202
      • rust->rust = 712,371
      • c->go = 548,145
      • rust->go = 563,484
      • fact:
        • Rust is now aligned with C on the non-batch UDS pipeline path
        • the remaining delta is almost entirely the Go server path
    • uds-pipeline-batch-d16, target 0
      • c->c = 101,588,680
      • c->rust = 83,396,588
      • rust->c = 99,570,528
      • rust->rust = 86,762,291
      • c->go = 52,899,078
      • rust->go = 51,902,022
      • fact:
        • Rust client-side is now close to C on this path
        • Rust server-side still shows a real batch-path penalty
        • Go server-side is still the dominant outlier
  • Structural batch-path asymmetry verified from code:
    • C managed server exposes a whole-request callback:
      • src/libnetdata/netipc/include/netipc/netipc_service.h:187-192
      • callback receives request_hdr, full request_payload, and whole response_buf
    • C benchmark server uses that whole-request callback to batch-specialize increment in one loop:
      • bench/drivers/c/bench_posix.c:164-216
      • the callback sees NIPC_FLAG_BATCH, loops all items itself, and emits the whole batch response directly
    • Rust managed server exposes only per-item raw dispatch:
      • src/crates/netipc/src/service/raw.rs:1285-1297
      • batch handling is then forced through the managed-server loop:
        • src/crates/netipc/src/service/raw.rs:2002-2047
        • per item: batch_item_get() -> dispatch_single_internal() -> bb.add()
    • Go managed server exposes the same per-item dispatch shape:
      • src/go/pkg/netipc/service/raw/types.go:57-59
      • batch handling is forced through:
        • src/go/pkg/netipc/service/raw/client.go:903-946
        • per item: BatchItemGet() -> dispatchSingle() -> bb.Add()
    • fact:
      • the remaining Rust and Go batch server gaps are not just transport issues
      • C can specialize whole-batch increment handling at the callback boundary; Rust and Go cannot
  • Working theory for the remaining Linux gaps:
    • shm-batch-ping-pong
      • Rust still has both client-side and server-side cost versus C
      • the server-side part aligns with the batch callback asymmetry above
    • uds-pipeline-batch-d16
      • Rust client-side is now nearly aligned with C
      • the remaining Rust delta is mainly server-side batch handling overhead
      • the much larger Go delta is likely server-side too, with the same structural asymmetry plus extra Go dispatch/runtime overhead
  • Decision required before the next implementation step:
    • Background:
      • The remaining batch-path gap is now tied to the managed-server design.
      • Any serious fix must choose whether to optimize only the benchmarks or to change the service/server implementation model.
      1. Batch server optimization strategy
      • Evidence:
        • C whole-request callback:
          • src/libnetdata/netipc/include/netipc/netipc_service.h:187-192
          • bench/drivers/c/bench_posix.c:164-216
        • Rust per-item batch loop:
          • src/crates/netipc/src/service/raw.rs:1285-1297
          • src/crates/netipc/src/service/raw.rs:2002-2047
        • Go per-item batch loop:
          • src/go/pkg/netipc/service/raw/types.go:57-59
          • src/go/pkg/netipc/service/raw/client.go:903-946
      • A. Benchmark-only fast path
        • Implement dedicated Rust/Go benchmark servers that bypass the managed server for increment batch.
        • Pros:
          • fastest way to measure the upper bound
          • smallest code change
        • Implications:
          • benchmark numbers improve, but the library/server path stays asymmetric
        • Risks:
          • hides a real product/library performance issue
          • docs and benchmarks stop representing real library behavior
      • B. Internal managed-server specialization
        • Keep the external single-kind API shape, but add internal fast paths for known service kinds such as increment batch.
        • Pros:
          • fixes real library behavior
          • avoids large public API churn
          • aligned with one-service-kind servers
        • Implications:
          • managed-server internals become aware of service-kind-specific fast paths
        • Risks:
          • hidden complexity if done ad hoc
          • may still leave the public abstraction less explicit than the implementation
      • C. Explicit service-kind-specific server APIs
        • Redesign Rust/Go managed servers so each service kind gets its own whole-request server callback surface, matching the accepted single-kind architecture.
        • Pros:
          • cleanest long-term design
          • makes the fast path explicit instead of hidden
          • best fit for maintainability and performance
        • Implications:
          • broader API/implementation/test/doc rewrite in Rust and Go
        • Risks:
          • largest scope before the next measurement
      • Recommendation:
        • 1. C
        • Reason:
          • the evidence shows a real API/implementation asymmetry, not just a hot-loop bug
          • your accepted single-kind-service design already points in this direction
  • Priority check raised by Costa:
    • Background:
      • Current benchmark results are already very high in absolute terms.
      • The remaining gaps are real, but fixing them now would require a broader Rust/Go managed-server redesign for batch-heavy paths.
    • Facts:
      • Clean Linux rerun:
        • lookup
          • c = 170,976,986
          • rust = 150,660,413
          • go = 121,278,244
        • shm-batch-ping-pong
          • c->c = 60,929,552
          • rust->rust = 45,104,001
        • uds-pipeline-batch-d16
          • c->c = 101,588,680
          • rust->rust = 86,762,291
          • go->go = 51,355,370
      • Fact:
        • these are already very high throughputs in absolute terms
        • the remaining work is now mainly about closing relative efficiency gaps, not about making the library viable
    • Working theory:
      • Deferring the remaining batch-path optimization is reasonable if there are more fundamental correctness, architecture, or product-fit issues still open.
      • The benchmark investigation has already done its job by identifying the structural asymmetry and proving where it lives.
  • Updated decision from Costa:
    • continue the benchmark investigation for trust in the framework
    • investigate all remaining >1.20x differences
    • treat the Rust/Go batch-path asymmetry as already identified, and focus next on the remaining unexplained gaps
  • Remaining unexplained Linux gaps after excluding the known batch-path issue:
    • lookup
      • c = 170,976,986
      • rust = 150,660,413
      • go = 121,278,244
    • uds-pipeline-d16
      • c->c = 713,563
      • c->go = 548,145
      • rust->go = 563,484
      • fact:
        • the Go server remains the unexplained outlier on the non-batch pipeline path
  • New concrete finding: Go lookup still pays a by-value item copy on every successful bucket probe
    • Evidence:
      • actual cache lookup:
        • src/go/pkg/netipc/service/raw/cache.go:122-130
        • item := c.items[c.buckets[slot].index] copies the whole CacheItem
      • Go lookup benchmark mirrors the same behavior:
        • bench/drivers/go/main.go:1133-1136
        • bucketItem := cacheItems[lookupIndex[slot].index] copies the whole struct
      • Rust uses a reference:
        • src/crates/netipc/src/service/raw.rs:2376-2379
      • C returns a pointer:
        • src/libnetdata/netipc/src/service/netipc_service.c:1492-1497
    • Implication:
      • the current Go lookup gap is still at least partly a real Go implementation issue, not just a benchmark artifact
  • Follow-up measurement on the Go lookup bucket-copy fix:
    • Applied:
      • src/go/pkg/netipc/service/raw/cache.go
      • bench/drivers/go/main.go
      • changed bucket probes from by-value CacheItem copies to pointer/reference access
    • Rerun results:
      • c = 172,638,775
      • rust = 153,518,048
      • go = 115,783,444
    • Fact:
      • the fix had no material positive effect on Go lookup throughput
      • therefore the by-value bucket copy was not the dominant cause of the remaining Go lookup gap
  • Go lookup profile after the bucket-copy fix:
    • Evidence:
      • live perf profile of bench_posix_go lookup-bench
      • output row:
        • lookup,go,go,127,972,744
      • visible hot frames from /tmp/nipc-go-lookup-perf.data:
        • main.runLookupBench almost all samples
        • time.runtimeNano about 8%
        • runtime.memequal about 2%
    • Fact:
      • no single framework/library helper stands out as the dominant hotspot
      • the operation is so small that benchmark loop overhead and inlining dominate the profile
    • Working theory:
      • the remaining Go lookup gap is not currently a strong signal about the IPC framework itself
      • it is at least partly a benchmark-methodology issue for a tiny in-memory operation
  • Go non-batch pipeline server profile:
    • Evidence:
      • live perf profile of bench_posix_go uds-ping-pong-server under uds-pipeline-d16 load from a C client
      • client result during profile:
        • uds-pipeline-d16,c,c,567,061,...
      • hot frames from /tmp/nipc-go-server-perf.data:
        • Session.Send about 39.5%
        • Session.Receive about 33.8%
        • raw.pollFd about 23.1%
        • increment dispatch does not materially appear
    • Fact:
      • the remaining Go server gap on uds-pipeline-d16 is not in increment handler logic
      • it is dominated by the Go UDS server transport/poll path
    • Supporting fact:
      • Go as a client on the same scenario is only slightly slower than C/Rust:
        • go->c = 699,976 vs c->c = 713,563
        • go->rust = 685,614 vs c->rust = 720,602
      • implication:
        • the big remaining gap is mainly server-side, and pollFd is the strongest server-only suspect
  • New concrete finding: Go non-batch server gap is transport/poll dominated, not dispatch dominated
    • Evidence:
      • live perf profile of bench_posix_go uds-ping-pong-server under uds-pipeline-d16 load
      • hot path breakdown from /tmp/nipc-go-server-perf.data:
        • Session.Send about 39.5%
        • Session.Receive about 33.8%
        • raw.pollFd about 23.1%
        • increment dispatch does not materially appear in the hot path
    • Working theory:
      • the remaining Go server delta on uds-pipeline-d16 is in the Go UDS server transport/wrapper path, especially poll + recvmsg + sendmsg, not in the increment handler logic

Benchmark refresh slice (2026-03-26)

  • TL;DR:
    • rerun the full official benchmark suites on the current worktree for both Linux and Windows
    • regenerate the checked-in benchmark artifacts from those reruns
    • compare the refreshed Linux and Windows matrices and flag any materially strange language deltas
    • review and follow the existing repo TODO guidance for the real Windows win11 benchmark workflow
  • Analysis:
    • current checked-in benchmark artifacts are from 2026-03-25:
      • benchmarks-posix.md
      • benchmarks-windows.md
      • README.md
    • the official full-matrix runners are:
      • Linux:
        • tests/run-posix-bench.sh
        • tests/generate-benchmarks-posix.sh
      • Windows:
        • tests/run-windows-bench.sh
        • tests/generate-benchmarks-windows.sh
    • the verified Windows execution guidance already exists in repo TODOs and README:
      • README.md:342-365
      • TODO-pending-from-rewrite.md:2754-2849
    • current runner/generator methodology facts for Windows trustworthiness:
      • tests/run-windows-bench.sh currently writes exactly one CSV row per benchmark cell:
        • run_pair() parses one client result and immediately appends it to OUTPUT_CSV
        • there is no built-in repetition, aggregation, or instability gate
      • tests/generate-benchmarks-windows.sh validates completeness and floors, but it trusts each CSV row as final truth:
        • it has no notion of repeated samples, medians, spread, or outlier detection
      • implication:
        • a single noisy Windows measurement can currently become the published benchmark artifact if it still parses and keeps throughput above zero
    • benchmark methodology references gathered before changing the Windows workflow:
      • Google Benchmark user guide:
        • repeated benchmarks exist because a single result may not be representative when benchmarks are noisy
        • when repetitions are used, mean / median / standard deviation are reported
        • source examined:
          • /tmp/google-benchmark-20260326/docs/user_guide.md
      • Criterion.rs analysis and user guide:
        • noisy runs should be treated skeptically
        • longer measurement time reduces the influence of outliers
        • outlier classification is a first-class part of reliable benchmark analysis
        • sources examined:
          • /tmp/criterion-rs-20260326/book/src/user_guide/command_line_output.md
          • /tmp/criterion-rs-20260326/book/src/analysis.md
    • verified workflow facts from those docs:
      • real Windows benchmark proof is expected on win11, not via Linux cross-compilation
      • login shell may start as MSYSTEM=MSYS; benchmark runs should set:
        • PATH="/c/Users/costa/.cargo/bin:/c/Program Files/Go/bin:/mingw64/bin:$PATH"
        • MSYSTEM=MINGW64
        • CC=/mingw64/bin/gcc
        • CXX=/mingw64/bin/g++
      • official Windows benchmark commands are:
        • bash tests/run-windows-bench.sh benchmarks-windows.csv 5
        • bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md
    • the current local worktree is not clean and includes benchmark-related source edits:
      • bench/drivers/go/main.go
      • bench/drivers/go/main_windows.go
      • bench/drivers/rust/src/main.rs
      • bench/drivers/rust/src/bench_windows.rs
      • plus service/transport files that can affect benchmark behavior
    • implication:
      • the refreshed artifacts must reflect this exact current tree
      • benchmark interpretation must distinguish:
        • real implementation/runtime asymmetry
        • normal platform differences
        • measurement distortion or stale artifact drift
  • Decisions:
    • no new user decision required before execution
    • using the existing official full-suite runners is the correct path
    • using the existing real win11 workflow is the correct Windows path
  • Plan:
    • run the full Linux benchmark suite locally on the current tree
    • regenerate benchmarks-posix.md
    • run the full Windows benchmark suite on win11 using the documented native-toolchain environment
    • regenerate benchmarks-windows.md
    • compare refreshed CSVs and summarize the largest cross-language spreads by scenario
    • classify strange deltas as:
      • expected platform/runtime behavior
      • suspicious and possibly measurement-related
      • suspicious and likely implementation-related
    • update benchmark-derived docs if the refreshed artifacts materially change the published snapshot
    • for the Windows trustworthiness fix:
      • change the Windows runner to collect multiple measured repetitions per benchmark cell instead of trusting a single sample
      • aggregate repeated samples into one publication row using a robust statistic instead of one lucky or unlucky run
      • preserve a fail-closed path:
        • if repeated Windows samples for a cell diverge beyond a configured spread threshold, fail the run instead of publishing that cell
      • keep the published CSV shape stable if possible, so the existing generator/report consumers do not need a schema rewrite just to gain trustworthiness
  • Implied decisions:
    • benchmark duration remains the documented default 5 seconds unless the runner fails and forces a diagnostic rerun
    • the first full pass should use the official artifact filenames:
      • benchmarks-posix.csv
      • benchmarks-posix.md
      • benchmarks-windows.csv
      • benchmarks-windows.md
    • if Windows artifacts are produced remotely, copy them back into this repo without resetting unrelated local files
  • Testing requirements:
    • Linux benchmark CSV must contain 201 data rows and pass the generator validation
    • Windows benchmark CSV must contain 201 data rows and pass the generator validation
    • refreshed artifacts must have no duplicate scenario keys and no zero-throughput rows
  • Documentation updates required:
    • update the checked-in benchmark markdown files to match the refreshed CSVs
    • update README.md only if the published generated dates, machine snapshot, or headline benchmark ranges are no longer true after the refresh
  • Execution results:
    • reviewed Windows benchmark handoff guidance before execution:
      • README.md:342-365
      • TODO-pending-from-rewrite.md:2754-2849
    • Linux benchmark refresh completed successfully on the current worktree:
      • command:
        • cargo build --release --manifest-path src/crates/netipc/Cargo.toml --bin bench_posix
        • bash tests/run-posix-bench.sh benchmarks-posix.csv 5
        • bash tests/generate-benchmarks-posix.sh benchmarks-posix.csv benchmarks-posix.md
      • result:
        • 201 rows
        • generator passed
        • all configured POSIX floors passed
    • Windows benchmark refresh completed on win11 native MSYS/MinGW toolchain path:
      • disposable synced tree:
        • /tmp/plugin-ipc-bench-20260326
      • command:
        • cargo build --release --manifest-path src/crates/netipc/Cargo.toml --bin bench_windows
        • bash tests/run-windows-bench.sh benchmarks-windows.csv 5
        • bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md
      • factual result:
        • benchmark runner completed 201 rows
        • generator wrote benchmarks-windows.md
        • generator exited non-zero because of one floor violation:
          • shm-ping-pong rust->c @ max = 850,994
          • configured floor: 1,000,000
    • new user requirement after the unstable Windows reruns:
      • make Windows benchmarks trustworthy instead of relying on single noisy runs
      • allowed direction from user:
        • increase duration
        • run multiple repetitions
        • use any stronger methodology needed, as long as the published Windows benchmark artifacts become trustworthy
      • fit-for-purpose clarification:
        • Windows benchmark artifacts must be publication-grade on win11
        • single-run outliers must not be able to define the checked-in benchmark matrix
    • Windows trustworthiness implementation now applied locally:
      • tests/run-windows-bench.sh
        • new default: 5 measured samples per Windows benchmark cell
        • each published CSV row is now the median aggregate of those samples
        • the runner now persists per-cell repeated samples in RUN_DIR during execution
        • initial implementation used a blunt raw spread gate:
          • fail if max(sample_throughput) / min(sample_throughput) > 1.35
      • tests/generate-benchmarks-windows.sh
        • markdown output now states that the current Windows report is based on repeated aggregated measurements instead of one single sample
    • targeted proof of the new Windows trust method on win11:
      • synced the updated Windows runner/generator into the same disposable proof tree:
        • /tmp/plugin-ipc-bench-20260326
      • command:
        • NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-trust.csv 5
      • factual result:
        • completed successfully with the new 5-sample median path
        • no stability-gate failure
        • the previously suspicious rows are now stable:
          • shm-ping-pong rust->c @ max = 2,527,551
          • shm-ping-pong rust->rust @ 10000 = 9,999
        • all reported SHM sample ratios observed during that proof stayed well below the 1.35 gate
      • implication:
        • the old single-shot Windows SHM collapses were publication-methodology failures
        • with repeated measurement + median aggregation + spread gating, the same win11 host now produces a stable SHM matrix
    • first stability-gate refinement after proof runs on win11:
      • fact:
        • the initial raw max/min gate was too blunt for legitimate runs with one obvious transient outlier
      • evidence:
        • repeated sample file from the first full repeated run:
          • /tmp/netipc-bench-300472/samples-np-ping-pong-c-go-100000.csv
        • measured throughputs:
          • 17,798
          • 19,059
          • 15,586
          • 6,741
          • 18,303
        • implication:
          • one bad transient sample should not discard the whole row if the remaining samples agree tightly
      • attempted follow-up:
        • a Tukey-style outlier fence was tested next
      • fact:
        • with only 5 samples, that approach was too aggressive and incorrectly marked normal edge values as outliers
      • evidence:
        • repeated sample file:
          • /tmp/netipc-bench-287769/samples-np-ping-pong-go-c-0.csv
        • measured throughputs:
          • 17,419
          • 18,049
          • 18,078
          • 18,229
          • 18,533
        • implication:
          • the real spread there is only about 1.06x, so that row is stable and should be published
    • final trust method now applied locally after those proof runs:
      • tests/run-windows-bench.sh
        • keep 5 measured samples per published row
        • publish medians for throughput and latency/CPU columns
        • when there are at least 5 samples, drop exactly one lowest and one highest throughput sample before the stability check
        • require the remaining stable core to contain at least 3 samples
        • require stable-core throughput spread:
          • stable_max / stable_min <= 1.35
        • if the raw extremes are noisy but the stable core is good:
          • publish the row
          • print a warning that records both raw and stable spreads
      • tests/generate-benchmarks-windows.sh
        • methodology text updated to describe the stable-core rule instead of the original raw-spread wording
    • second stability-gate refinement after full-suite evidence on win11:
      • fact:
        • the first repeated full-suite rerun still found a real unstable case at 5s max-throughput duration:
          • snapshot-shm rust->go @ max
      • evidence:
        • repeated sample file:
          • /tmp/netipc-bench-300472/samples-snapshot-shm-rust-go-0.csv
        • measured throughputs:
          • 1,042,824
          • 977,680
          • 648,337
          • 367,491
          • 1,027,273
        • stable core after dropping one low and one high sample:
          • 648,337
          • 977,680
          • 1,027,273
        • stable-core ratio:
          • 1.584474
      • implication:
        • repeated measurement alone was not enough for all Windows max-throughput rows
        • some max rows needed a longer measurement window, not just more samples
    • max-throughput duration refinement now applied locally:
      • tests/run-windows-bench.sh
        • fixed-rate rows still use the CLI duration default:
          • 5s
        • max-throughput rows now use a separate default duration:
          • NIPC_BENCH_MAX_DURATION=10
        • the runner logs both durations at startup
      • targeted proof on win11 for the previously failing case:
        • command:
          • NIPC_BENCH_FIRST_BLOCK=4 NIPC_BENCH_LAST_BLOCK=4 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/snapshot-shm-10s.csv 10
        • factual result:
          • previously failing snapshot-shm rust->go @ max became stable:
            • median throughput 1,053,376
            • stable-core ratio 1.018280
          • another noisy row also stabilized after trimming one low and one high sample:
            • snapshot-shm rust->c @ max
            • raw range:
              • 460,343 .. 1,167,598
            • stable-core range:
              • 1,109,218 .. 1,133,875
            • stable-core ratio:
              • 1.022229
      • implication:
        • the final trustworthy Windows method is now:
          • repeated measurement
          • median publication
          • stable-core gating
          • longer max-throughput samples
    • final proof run status after the trust-method changes:
      • full-suite rerun now in progress on win11 with the final method:
        • fixed-rate rows:
          • 5 samples x 5s
        • max-throughput rows:
          • 5 samples x 10s
        • stability rule:
          • publish only if the trimmed stable core stays within 1.35x
      • live confirmed progress:
        • np-ping-pong block completed cleanly under the final method
        • shm-ping-pong block started cleanly under the final method
    • first full repeated rerun with the 10s max default found one remaining unstable row late in the suite:
      • scenario:
        • np-pipeline-batch-d16 rust->rust @ max
      • preserved sample file:
        • /tmp/netipc-bench-331471/samples-np-pipeline-batch-d16-rust-rust-0.csv
      • measured throughputs:
        • 37,400,757
        • 31,635,302
        • 26,609,207
        • 39,324,202
        • 24,312,207
      • trimmed stable core:
        • 26,609,207 .. 37,400,757
      • stable-core ratio:
        • 1.405557
      • implication:
        • the runner correctly failed closed
        • the remaining instability was no longer global Windows SHM noise
        • it was narrowed to np-pipeline-batch @ max on win11
    • targeted proof for the remaining pipeline-batch max instability:
      • command:
        • NIPC_BENCH_FIRST_BLOCK=9 NIPC_BENCH_LAST_BLOCK=9 NIPC_BENCH_MAX_DURATION=20 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/np-pipeline-batch-20s.csv 5
      • factual result:
        • the full np-pipeline-batch-d16 matrix passed cleanly at 20s
        • previously failing row became stable:
          • rust->rust @ max = 34,184,748
          • stable-core ratio 1.064913
        • previously noisy go->c @ max also tightened materially:
          • 38,364,026
          • stable-core ratio 1.024521
      • implication:
        • the remaining issue was short-window measurement noise for np-pipeline-batch @ max
        • a longer max window fixes it without relaxing the trust gate
    • final Windows trust method now applied locally:
      • tests/run-windows-bench.sh
        • fixed-rate rows:
          • 5s
        • most max-throughput rows:
          • 10s
        • np-pipeline-batch-d16 @ max:
          • 20s
        • runner knobs now include:
          • NIPC_BENCH_MAX_DURATION
          • NIPC_BENCH_PIPELINE_BATCH_MAX_DURATION
      • tests/generate-benchmarks-windows.sh
        • methodology section now documents the 20s pipeline-batch max window explicitly
    • final published Windows artifact assembly:
      • full repeated rerun output from:
        • /tmp/plugin-ipc-bench-20260326/benchmarks-windows.csv
        • used for all stable rows outside np-pipeline-batch-d16
        • notable publishable warning retained from that full rerun:
          • np-pipeline-d16 go->c @ max
          • raw range:
            • 111,201 .. 255,780
            • raw ratio 2.300159
          • trimmed stable core:
            • 234,582 .. 241,982
            • stable ratio 1.031545
          • implication:
            • the outlier-handling path is doing real work on win11
            • the published median row is still trustworthy because the stable core stayed tight
      • targeted validated 20s rerun output from:
        • /tmp/plugin-ipc-bench-20260326/np-pipeline-batch-20s.csv
        • used to replace the incomplete/unstable np-pipeline-batch-d16 block
      • locally assembled final CSV:
        • 202 lines total
        • 201 data rows
        • scenario counts all correct
      • local validation:
        • bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md
        • result:
          • all configured Windows floors pass
          • report generation passes cleanly
    • follow-up approved by Costa after the first trustworthy publish:
      • run one fresh full Windows suite on win11 with the current default methodology
      • objective:
        • remove the remaining "assembled artifact" caveat if the one-shot full run now passes end to end
      • execution rule:
        • sync the current local benchmark-related sources to the disposable win11 proof tree first
        • only replace the checked-in Windows CSV/MD if that single fresh rerun passes with all floors green
    • current fresh-proof-tree rerun on win11 uses a new disposable tree based on origin/main plus the current local benchmark-related worktree files overlaid onto it:
      • fresh tree:
        • /tmp/plugin-ipc-bench-20260327-fullrun-150313
      • factual setup issue discovered before the real rerun:
        • tests/run-windows-bench.sh builds the C and Go benchmark binaries itself, but it only consumes an already-built Rust benchmark binary
        • on a fresh disposable tree, the first launch printed:
          • Rust benchmark binary not found: .../src/crates/netipc/target/release/bench_windows.exe (Rust tests will be skipped)
        • implication:
          • a fresh tree needs an explicit Rust build before the full Windows benchmark suite, or the run degrades to a 2-language matrix and is not publishable
      • corrective action applied on win11 before the real rerun:
        • cargo build --release --manifest-path src/crates/netipc/Cargo.toml --bin bench_windows
      • real rerun then restarted from the same fresh tree with diagnostics enabled:
        • NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260327-fullrun-150313/full-after-fix.csv 5
      • live evidence from the ongoing one-shot full rerun:
        • no new diagnostics summary file has appeared so far
        • block 1 (np-ping-pong) is already materially clean end to end:
          • np-ping-pong c->c @ max = 19,627, stable_ratio=1.018133
          • np-ping-pong rust->c @ max = 19,880, stable_ratio=1.045638
          • np-ping-pong go->go @ max = 19,195, with one low and one high outlier trimmed, stable_ratio=1.098122
          • all published 10000/s rows reached target cleanly:
            • examples:
              • rust->c = 9,999, stable_ratio=1.000000
              • rust->rust = 9,999, stable_ratio=1.000000
              • go->go = 10,000, stable_ratio=1.000000
          • the first published 1000/s rows are also landing at target:
            • go->c = 1,000, stable_ratio=1.000000
            • go->go = 1,000, stable_ratio=1.000000
        • the rerun has already crossed into the historically suspicious SHM block without reproducing the old collapse:
          • shm-ping-pong c->c @ max = 2,565,990, stable_ratio=1.042022
          • shm-ping-pong rust->c @ max = 2,443,021, stable_ratio=1.089130
          • shm-ping-pong c->rust @ max = 2,611,306, stable_ratio=1.071212
          • shm-ping-pong rust->rust @ max = 2,617,581, stable_ratio=1.027963
          • shm-ping-pong go->rust @ max = 2,327,904, stable_ratio=1.012447
        • factual interim conclusion:
          • the current one-shot full rerun is already materially stronger evidence than the older failing full runs
          • the earlier full-suite shm-ping-pong rust->c collapse is not reproducing on the same win11 host after the current lifecycle and Windows SHM fixes
      • live continuation coordinates for the long one-shot rerun:
        • win11 source tree:
          • /tmp/plugin-ipc-bench-20260327-fullrun-150313
        • live output files:
          • CSV:
            • /tmp/plugin-ipc-bench-20260327-fullrun-150313/full-after-fix.csv
          • log:
            • /tmp/plugin-ipc-bench-20260327-fullrun-150313/full-after-fix.log
        • last verified progress in this session:
          • 75 lines in the CSV (74 data rows)
          • blocks 1 and 2 completed cleanly
          • block 3 (snapshot-baseline) had started and was publishing stable @ max rows:
            • c->c = 19,872, stable_ratio=1.029521
            • rust->c = 19,291, stable_ratio=1.043116
          • no new diagnostics summary file had appeared yet
      • later checkpoint from the same still-running one-shot rerun:
        • 121 lines in the CSV (120 data rows)
        • blocks 1 through 4 had already cleared cleanly and the run had advanced deep into block 5 (np-batch-ping-pong)
        • live batch evidence:
          • np-batch-ping-pong c->go @ max = 7,699,399, stable_ratio=1.045676
          • np-batch-ping-pong rust->go @ max = 7,532,805, stable_ratio=1.018880
          • np-batch-ping-pong go->go @ max = 7,152,856, stable_ratio=1.030591
          • np-batch-ping-pong c->c @ 100000/s = 7,693,465, stable_ratio=1.011300
          • np-batch-ping-pong rust->c @ 100000/s = 7,497,010, stable_ratio=1.015083
        • no new diagnostics summary file had appeared yet at this checkpoint either
      • completed outcome of the clean one-shot Windows rerun:
        • the long win11 one-shot rerun finished cleanly
        • final CSV size:
          • 202 logical lines
          • 201 data rows
        • no new diagnostics summary file was produced during this rerun
        • tests/generate-benchmarks-windows.sh passed on win11 against the final CSV:
          • All performance floors met
        • the final generated report was copied back into the repo as:
          • benchmarks-windows.csv
          • benchmarks-windows.md
        • the same generator also passed locally after copying the artifacts back:
          • bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md
          • result:
            • All performance floors met
        • user-approved follow-up after the successful one-shot rerun:
          • commit the Windows artifact refresh and the TODO update as a separate git commit
          • do not include unrelated dirty files from the broader worktree
        • user-approved follow-up after the local commit:
          • push commit 768cca3 to origin/main
          • do not include any of the remaining unrelated dirty files
        • implication:
          • the remaining "assembled artifact" caveat is now removed
          • the checked-in Windows artifacts now come from a single clean one-shot full rerun on win11
        • stable final Windows max-throughput spreads from that clean one-shot artifact:
          • shm-ping-pong:
            • best:
              • rust->rust = 2,617,581
            • worst:
              • go->go = 2,113,834
            • spread:
              • 1.238x
            • conclusion:
              • no strange SHM collapse remains in the final clean artifact
          • lookup:
            • best:
              • rust = 176,259,707
            • worst:
              • go = 98,385,649
            • spread:
              • 1.792x
          • np-pipeline-d16:
            • best:
              • go->rust = 240,205
            • worst:
              • c->go = 216,940
            • spread:
              • 1.107x
          • np-pipeline-batch-d16:
            • best:
              • go->c = 39,065,948
            • worst:
              • c->go = 27,896,181
            • spread:
              • 1.400x
    • first one-shot full rerun attempt with the current defaults did not produce a clean replacement artifact:
      • partial output path:
        • /tmp/plugin-ipc-bench-20260326/benchmarks-windows-oneshot.csv
      • factual failure observed during block 1:
        • np-ping-pong rust->rust @ 1000/s
        • Rust client exited non-zero
        • streamed client output reported:
          • client: 4207 errors
          • partial line:
            • np-ping-pong,rust,rust,159,75.500,177.400,177.400,5.6,0.0,5.6
      • implication:
        • the one-shot rerun cannot replace the current published Windows artifact
        • before attempting another full rerun, the new failure should be isolated on block 1 to determine whether it is reproducible or a one-off transport/runtime glitch
    • isolated recheck of block 1 completed cleanly on the same win11 proof tree:
      • command:
        • NIPC_BENCH_FIRST_BLOCK=1 NIPC_BENCH_LAST_BLOCK=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/block1-recheck.csv 5
      • output path:
        • /tmp/plugin-ipc-bench-20260326/block1-recheck.csv
      • factual result:
        • all 36 block-1 measurements completed with exit code 0
        • the previously failing row completed cleanly:
          • np-ping-pong rust->rust @ 1000/s = 1000
          • p50=66.200us
          • p95=248.200us
          • p99=369.500us
          • stable_ratio=1.000000
      • implication:
        • the first one-shot block-1 failure is not immediately reproducible
        • this currently looks like a transient host/runtime glitch, not established deterministic instability in the rust->rust @ 1000/s pair
        • the next valid check is another clean one-shot full Windows rerun with the same default methodology
    • second one-shot full rerun with the current defaults also failed to produce a clean replacement artifact:
      • partial output path:
        • /tmp/plugin-ipc-bench-20260326/benchmarks-windows-oneshot-2.csv
      • factual failure observed during block 2:
        • shm-ping-pong rust->c @ max
        • repeated-sample file:
          • /tmp/netipc-bench-410987/samples-shm-ping-pong-rust-c-0.csv
        • repeated throughputs:
          • 618,076
          • 618,160
          • 1,951,036
          • 2,303,714
          • 2,476,081
        • stable-core gate result:
          • stable_min=618,160
          • stable_max=2,303,714
          • stable_ratio=3.726728
          • configured max: 1.35
      • implication:
        • the current default methodology still does not guarantee a clean one-shot full Windows run on win11
        • the blocker has moved from a random-looking block-1 client failure to a concrete SHM max-throughput instability event
    • focused reproduction of the same SHM pair in isolation did not reproduce the collapse:
      • direct pair under the same synced win11 tree:
        • C server: bench_windows_c.exe shm-ping-pong-server
        • Rust client: bench_windows.exe shm-ping-pong-client
      • isolated rust -> c @ max repeated 10 times with 10s samples:
        • throughput range:
          • 2,446,407 .. 2,578,450
        • all 10 runs stayed in the fast band
      • isolated rust -> c @ max repeated 10 times with 20s samples:
        • throughput range:
          • 2,363,335 .. 2,589,588
        • all 10 runs stayed in the fast band
      • implication:
        • the SHM collapse is not a simple deterministic rust client -> c server bug
        • longer isolated samples are stable, but that alone does not explain the one-shot full-run failure
    • sequence test also failed to reproduce the SHM collapse:
      • setup:
        • one c -> c @ max SHM prime run
        • followed immediately by 5 direct rust -> c @ max SHM runs
        • repeated for 5 cycles on the same RUN_DIR
      • factual result:
        • all 25 post-prime rust -> c runs stayed in the fast band:
          • 2,357,337 .. 2,664,284
      • implication:
        • the failure is not explained by a simple "previous c -> c SHM row poisons the next rust -> c row" theory
        • current best description:
          • rare transient host/runtime glitch during full-matrix execution on win11
          • not immediately reproducible in dedicated pair or simple sequence tests
    • pending user decision before more Windows runner code changes:
      • context:
        • Costa asked for trustworthy Windows benchmarks
        • current state is better than before, but a clean one-shot full run is still not guaranteed
      • user constraint raised during decision review:
        • automatic retries must not hide real failures or real bugs
        • if retries are ever used, first-attempt failures must remain visible and reportable
      • user decision:
        • keep the main Windows benchmark publication path fail-closed
        • do not add silent self-healing retries to publish mode
        • add a separate diagnostic mode that can rerun failed rows in isolation
        • diagnostic mode must preserve and report the original first-attempt failure evidence side by side with any diagnostic rerun evidence
      • option A:
        • add automatic per-row retry on Windows when a row fails because of client error or stability-gate failure
        • keep the current 5-sample median + 1.35 stable-core gate inside each attempt
        • implications:
          • one transient bad row no longer destroys a 2-hour full run
          • a row is still published only if a full fresh attempt passes the same gate
        • risks:
          • published rows may come from retry attempt 2 or 3, not from the first pass
          • the report and logs must say that retries happened, or the methodology becomes misleading
      • option B:
        • keep fail-closed behavior, but increase Windows SHM max collection further:
          • for example 20s per sample and/or 7-9 repeats
        • implications:
          • simpler story than retries
          • every accepted row is still strictly one attempt
        • risks:
          • much longer full-suite runtime
          • evidence so far does not prove that longer duration alone fixes the rare full-run glitch
      • option C:
        • keep the current runner and accept targeted reruns / assembled Windows artifacts when one-shot full runs glitch
        • implications:
          • fastest operationally
          • still produces trustworthy rows when each replacement row is validated carefully
        • risks:
          • no clean single-command reproduction
          • more manual work and more caveats around publication
      • accepted direction:
        • strict publish mode plus separate diagnostic reruns
        • rationale:
          • failures stay visible
          • diagnostic reruns can still accelerate root-cause work without turning the publication path into silent self-healing
    • implemented Windows diagnostic mode for failed rows:
      • file:
        • tests/run-windows-bench.sh
      • new behavior:
        • publish mode remains fail-closed by default
        • opt-in diagnostics via:
          • NIPC_BENCH_DIAGNOSE_FAILURES=1
        • when a row fails in publish mode:
          • the original failure remains authoritative
          • the original RUN_DIR and first-attempt sample file remain preserved
          • the same row is rerun in an isolated diagnostic subdirectory under the preserved RUN_DIR
          • diagnostic rerun output is recorded in:
            • ${RUN_DIR}/diagnostics-summary.txt
          • diagnostic reruns never write rows into the publish CSV
      • implementation details:
        • row-level measurement state is now tracked explicitly:
          • failure reason
          • sample-file path
          • aggregate throughput/latency/CPU values
          • stability metrics
        • diagnostic reruns restore the original first-failure state after logging the isolated rerun evidence
    • forced validation of the new diagnostic mode on win11:
      • purpose:
        • prove that publish mode still fails closed
        • prove that diagnostic reruns preserve the original evidence and create side-by-side isolated rerun evidence
      • command:
        • NIPC_BENCH_FIRST_BLOCK=7 NIPC_BENCH_LAST_BLOCK=7 NIPC_BENCH_DIAGNOSE_FAILURES=1 NIPC_BENCH_REPETITIONS=3 NIPC_BENCH_MAX_DURATION=1 NIPC_BENCH_PIPELINE_BATCH_MAX_DURATION=1 NIPC_BENCH_MAX_THROUGHPUT_RATIO=0.9 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/diag-lookup-2.csv 1
      • factual result:
        • runner exited non-zero as expected
        • publish CSV remained header-only:
          • /tmp/plugin-ipc-bench-20260326/diag-lookup-2.csv
        • preserved original run dir:
          • /tmp/netipc-bench-425494
        • diagnostic summary created:
          • /tmp/netipc-bench-425494/diagnostics-summary.txt
        • distinct diagnostic rerun dirs created per failed row:
          • /tmp/netipc-bench-425494/diagnostics/001-lookup-c-c-0
          • /tmp/netipc-bench-425494/diagnostics/002-lookup-rust-rust-0
          • /tmp/netipc-bench-425494/diagnostics/003-lookup-go-go-0
      • implication:
        • the new mode preserves truth in publish mode
        • it also gives immediate isolated rerun evidence for investigation without silently healing the benchmark artifact
    • next-step approval from Costa:
      • commit and push the strict publish + diagnostic-mode runner changes
      • then proceed immediately to the real Windows SHM investigation using the new diagnostic mode on the actual failing slice
    • commit / push completed for the diagnostic-mode runner change:
      • commit:
        • 870fc93
      • subject:
        • bench: add Windows diagnostic reruns
      • pushed to:
        • origin/main
    • real Windows SHM investigation with the new diagnostic mode:
      • command:
        • NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-diagnose.csv 5
      • factual result:
        • block 2 completed successfully with exit code 0
        • no diagnostic rerun triggered for any SHM row
        • the previously suspicious row completed cleanly:
          • shm-ping-pong rust->c @ max = 2,465,857
          • stable_ratio=1.021516
        • the full SHM max matrix stayed stable:
          • c->c = 2,461,053
          • rust->c = 2,465,857
          • go->c = 2,162,135
          • c->rust = 2,597,936
          • rust->rust = 2,530,435
          • go->rust = 2,065,765
          • c->go = 2,570,619
          • rust->go = 2,254,772
          • go->go = 2,079,323
        • all 100000/s, 10000/s, and 1000/s SHM rows also completed stably in the same block run
      • implication:
        • the Windows SHM instability still does not reproduce when block 2 runs in isolation under the real runner
        • current strongest working theory:
          • the failure depends on broader full-suite context on win11
          • not on the standalone SHM block itself
    • targeted confirmation of the Windows SHM anomaly:
      • command:
        • NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-confirm.csv 5
      • confirmed max-throughput rerun on the same win11 tree:
        • c->c = 2,396,963
        • rust->c = 1,708,649
        • go->c = 886,451
        • c->rust = 2,566,391
        • rust->rust = 2,563,582
        • go->rust = 2,053,507
        • c->go = 2,539,899
        • rust->go = 2,215,733
        • go->go = 2,047,115
      • factual conclusion:
        • the original rust->c full-suite collapse is not stable
        • max-throughput Windows SHM rows can swing materially between reruns on win11
        • target-rate Windows SHM rows remain stable near their requested rates
        • implication:
          • the strange Windows SHM max delta is currently a measurement-stability / host-noise issue, not proven deterministic language regression
  • Refreshed max-throughput spread summary:
    • Linux:
      • lookup:
        • fastest c->c = 167,974,040
        • slowest go->go = 127,908,975
        • spread: 1.31x
        • improvement versus checked-in previous artifact: 1.77x -> 1.31x
      • shm-ping-pong:
        • fastest rust->rust = 3,486,454
        • slowest go->go = 1,725,340
        • spread: 2.02x
        • note:
          • this widened versus the previous checked-in artifact because go->go max throughput dropped materially
      • shm-batch-ping-pong:
        • fastest c->c = 61,778,266
        • slowest go->go = 31,810,209
        • spread: 1.94x
      • uds-pipeline-d16:
        • fastest rust->c = 712,544
        • slowest rust->go = 550,630
        • spread: 1.29x
      • uds-pipeline-batch-d16:
        • fastest c->c = 99,746,787
        • slowest go->go = 50,690,629
        • spread: 1.97x
    • Windows:
      • lookup:
        • fastest rust->rust = 178,835,588
        • slowest go->go = 97,109,788
        • spread: 1.84x
      • shm-ping-pong full suite:
        • fastest c->rust = 2,650,754
        • slowest rust->c = 850,994
        • spread: 3.11x
        • but targeted confirmation disproved rust->c as a stable deterministic outlier
      • shm-batch-ping-pong:
        • fastest c->c = 52,520,469
        • slowest go->go = 34,390,650
        • spread: 1.53x
      • np-pipeline-batch-d16:
        • fastest go->rust = 38,249,582
        • slowest go->go = 24,333,588
        • spread: 1.57x
  • Strange delta findings that remain real after the refresh:
    • Linux uds-pipeline-d16:
      • Go server remains the clear slow case across clients:
        • c->go = 559,691
        • rust->go = 550,630
        • go->go = 553,858
        • versus C/Rust servers near 686k-713k
      • implication:
        • this is a stable Go-server transport/runtime cost, not client-specific noise
    • Linux uds-pipeline-batch-d16:
      • server choice dominates:
        • C server: 96.2M-99.7M
        • Rust server: 84.1M-86.3M
        • Go server: 50.7M-51.3M
      • implication:
        • the known batch-path server asymmetry is still real
    • Linux shm-batch-ping-pong:
      • C server stays strongest
      • Rust server is mid-band
      • Go server is slowest
      • implication:
        • still consistent with real server-side implementation overhead, not runner corruption
    • Linux / Windows lookup:
      • Linux:
        • c = 167.97M
        • rust = 146.15M
        • go = 127.91M
      • Windows:
        • rust = 178.84M
        • c = 125.60M
        • go = 97.11M
      • implication:
        • lookup is now measuring runtime/data-structure efficiency more than IPC transport behavior
        • the previous fake linear-scan distortion is gone, but cross-language runtime overhead remains visible
  • Strange delta finding that is currently suspicious but not yet proven real:
    • Windows shm-ping-pong @ max:
      • full-suite run made rust->c miss the floor
      • immediate confirmation run moved the collapse to go->c instead
      • conclusion:
        • this is currently a max-throughput measurement-stability issue on win11
        • do not interpret a single bad max row there as a stable language-specific regression without targeted rerun confirmation
    • second isolated Windows SHM rerun on the same win11 tree reinforced the same conclusion:
      • command:
        • NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-rerun.csv 5
      • @max rows:
        • c->c = 2,516,450
        • rust->c = 2,430,413
        • go->c = 2,179,591
        • c->rust = 2,497,180
        • rust->rust = 2,473,159
        • go->rust = 2,114,944
        • c->go = 2,571,394
        • rust->go = 2,282,433
        • go->go = 2,100,658
      • implication:
        • the full-suite rust->c collapse to 850,994 is definitely not stable
      • additional warning sign from the same isolated rerun:
        • some target_rps=10000 rows also became unstable:
          • c->rust = 5,073
          • rust->rust = 4,098
          • while other rows in the same block stayed near 10,000
        • implication:
          • the Windows SHM benchmark instability is not limited to one language pair or only to the first full-suite run
  • Post-commit diagnostic runner work (870fc93 bench: add Windows diagnostic reruns):
    • committed and pushed:
      • commit: 870fc93
      • pushed to origin/main
    • immediate next investigation on win11:
      • goal:
        • identify the smallest Windows benchmark context that reproduces the earlier full-suite SHM collapse
      • standalone SHM block with diagnostics enabled:
        • command:
          • NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-diagnose.csv 5
        • result:
          • exited 0
          • no diagnostics triggered
        • key shm-ping-pong @ max rows:
          • c->c = 2,461,053 with stable_ratio=1.018190
          • rust->c = 2,465,857 with stable_ratio=1.021516
          • go->c = 2,162,135 with stable_ratio=1.017540
          • c->rust = 2,597,936 with stable_ratio=1.016334
          • rust->rust = 2,530,435 with stable_ratio=1.020250
          • go->rust = 2,065,765 with stable_ratio=1.029206
          • c->go = 2,571,619 with stable_ratio=1.013998
          • rust->go = 2,254,772 with stable_ratio=1.022145
          • go->go = 2,079,323 with stable_ratio=1.010925
        • factual conclusion:
          • block 2 alone is stable under the real repeated-median runner
          • the earlier full-suite rust->c collapse is not a standalone SHM bug
      • combined NP -> SHM prefix with diagnostics enabled:
        • command:
          • NIPC_BENCH_FIRST_BLOCK=1 NIPC_BENCH_LAST_BLOCK=2 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/np-shm-diagnose.csv 5
        • result:
          • exited 0
          • no diagnostics triggered
          • total measurements: 72
        • key np-ping-pong @ max rows:
          • c->c = 19,411
          • rust->c = 19,735
          • go->c = 18,744
          • c->rust = 20,188
          • rust->rust = 20,301
          • go->rust = 19,277
          • c->go = 19,383
          • rust->go = 18,558
          • go->go = 19,241
        • key shm-ping-pong @ max rows:
          • c->c = 2,522,584
          • rust->c = 2,522,004
          • go->c = 2,071,095
          • c->rust = 2,580,971
          • rust->rust = 2,511,775
          • go->rust = 2,308,182
          • c->go = 2,657,019
          • rust->go = 2,273,563
          • go->go = 2,109,132
        • factual conclusion:
          • the failure does not reproduce with blocks 1-2
          • the earlier bad rust->c full-suite row requires broader full-suite context than just the NP -> SHM transition
    • updated working theory:
      • speculation:
        • a later block, or cumulative state from multiple later blocks, is needed to trigger the rare full-suite Windows instability
      • not supported by evidence anymore:
        • standalone SHM bug
        • simple NP -> SHM transition bug
    • next diagnostic step:
      • extend the prefix to block 3 and repeat:
        • NIPC_BENCH_FIRST_BLOCK=1 NIPC_BENCH_LAST_BLOCK=3 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/np-shm-batch-diagnose.csv 5
  • Current deep-dive findings after extending the prefix to block 3:
  • Decision needed before code:
    • 1. A Fix the Windows benchmark runner only.
      • scope:
        • replace hard-kill shutdown with graceful server stop / wait and hard-kill fallback only on timeout
        • make per-repeat server/client output files unique
        • fix diagnostic bookkeeping so preserved run dirs and summaries always match the actual run
      • benefits:
        • directly targets the strongest evidence
        • smallest code-change surface
        • most likely enough to make the benchmark harness trustworthy
      • implications:
        • benchmark methodology changes only, not transport semantics
        • if Windows SHM object-collision handling is also weak, the benchmark harness may become stable while the product bug remains latent
      • risks:
        • could leave a real Windows transport bug hidden until another scenario hits it outside the benchmark harness
    • 1. B Fix the Windows benchmark runner and harden Windows SHM object creation in C, Rust, and Go.
      • scope:
        • everything in 1. A
        • plus explicit ERROR_ALREADY_EXISTS handling for Windows SHM mappings/events and clearer collision errors
      • benefits:
        • addresses both the likely benchmark root cause and a real transport safety gap
        • makes leaked object collisions explicit instead of nondeterministic
      • implications:
        • larger change across multiple language implementations
        • requires more testing
      • risks:
        • broader patch, more review surface, more chance of side effects if the three implementations are not kept perfectly aligned
    • 1. C Continue diagnosis without code changes.
      • scope:
        • more targeted reruns and more artifact collection
      • benefits:
        • lowest code risk
      • implications:
        • more benchmark time burned with a runner we already know is violating server lifecycle on Windows
      • risks:
        • low leverage
        • likely delays the obvious fix
    • recommendation:
      • 1. B
      • reasoning:
        • the hard-kill runner behavior is the strongest causal explanation for the benchmark instability
        • but the Windows SHM create path also has a real hardening gap
        • if the goal is "Windows benchmarks trustworthy", fixing only the runner is probably enough for the harness, but not enough for the underlying transport robustness
    • user decision:
      • 1. B
      • accepted scope:
        • fix the Windows benchmark runner lifecycle and diagnostics bookkeeping
        • harden Windows SHM object creation in C, Rust, and Go to detect existing named objects explicitly
    • implementation and verification after 1. B:
      • local code changes completed:
        • runner:
          • tests/run-windows-bench.sh
        • Windows SHM hardening:
          • src/libnetdata/netipc/include/netipc/netipc_win_shm.h
          • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c
          • src/crates/netipc/src/transport/win_shm.rs
          • src/go/pkg/netipc/transport/windows/shm.go
        • regression coverage:
          • tests/fixtures/c/test_win_shm.c
          • src/crates/netipc/src/transport/win_shm.rs
          • src/go/pkg/netipc/transport/windows/shm_test.go
      • factual runner behavior after the patch:
        • the Windows runner now:
          • uses a unique per-repeat runtime/artifact directory instead of reusing the same RUN_DIR for every repeat
          • waits for benchmark servers to stop themselves before killing them
          • preserves the root run dir on any measurement-command failure, not only on stability-gate failures
          • records the first-attempt artifact directory in the diagnostics summary
      • factual transport behavior after the patch:
        • C, Rust, and Go Windows SHM server-create paths now reject existing named mappings/events explicitly instead of treating them as successful creates
        • new error surface:
          • C:
            • NIPC_WIN_SHM_ERR_ADDR_IN_USE
          • Rust:
            • WinShmError::AddrInUse
          • Go:
            • ErrWinShmAddrInUse
      • first verification on win11:
        • focused Windows SHM duplicate-create coverage now passes in all three implementations:
          • Go:
            • cd src/go && GOOS=windows GOARCH=amd64 go test -run TestWinShmServerCreateRejectsExistingObjects -count=1 ./pkg/netipc/transport/windows
          • Rust:
            • cargo test --manifest-path src/crates/netipc/Cargo.toml test_server_create_rejects_existing_objects_windows -- --test-threads=1
          • C:
            • cmake --build build -j4 --target test_win_shm
            • ctest --test-dir build --output-on-failure -R '^test_win_shm$'
        • result:
          • all passed
      • factual new issue exposed by the stricter runner:
        • extending the real benchmark rerun to NIPC_BENCH_FIRST_BLOCK=1 NIPC_BENCH_LAST_BLOCK=3 no longer reproduced the old random SHM collapse first
        • instead, it exposed a deterministic Rust benchmark-driver shutdown bug:
          • every row using a Rust Windows benchmark server failed with:
            • Server rust (...) did not exit cleanly within 10s; forcing kill
          • preserved server output contained only:
            • READY
          • implication:
            • the stricter runner removed the old benchmark-driver hard-kill masking and surfaced a real Rust benchmark-driver lifecycle bug
        • root cause:
          • bench/drivers/rust/src/bench_windows.rs still used the old Windows stop pattern:
            • only running_flag.store(false, ...)
            • no wake connection
          • this is the same Windows accept-loop issue already fixed earlier in the Rust Windows tests:
            • ConnectNamedPipe() stays blocked until a connection wakes it
        • fix:
          • bench/drivers/rust/src/bench_windows.rs now mirrors the tested shutdown pattern:
            • after duration+3, set running_flag = false
            • then issue a dummy NpSession::connect(...) so the blocked accept loop can observe shutdown and exit cleanly
        • direct proof on win11:
          • command:
            • timeout 20 src/crates/netipc/target/release/bench_windows.exe np-ping-pong-server /tmp/plugin-ipc-bench-20260327 rust-stop-check 1
          • result:
            • READY
            • SERVER_CPU_SEC=0.000000
          • implication:
            • the Rust Windows benchmark server now exits on its own instead of hanging until killed
      • focused real benchmark proof after all fixes:
        • command:
          • NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260327/block-2-after-fix.csv 5
        • result:
          • exited 0
          • no diagnostic reruns were needed
          • all 36 shm-ping-pong rows published
        • key evidence:
          • the previously suspicious Windows row is now stable:
            • shm-ping-pong rust->c @ max = 2,458,786
            • stable ratio:
              • 1.009920
          • all SHM @ max rows completed inside the stability gate:
            • c->c = 2,505,981 with stable_ratio=1.038817
            • rust->c = 2,458,786 with stable_ratio=1.009920
            • c->rust = 2,588,642 with stable_ratio=1.028021
            • rust->rust = 2,649,571 with stable_ratio=1.018367
            • rust->go = 2,242,750 with stable_ratio=1.045399
          • the previously suspicious fixed-rate rows are now also stable:
            • rust->c @ 100000/s = 99,997 with stable_ratio=1.000010
            • rust->c @ 10000/s = 9,999 with stable_ratio=1.000000
            • rust->rust @ 10000/s = 9,999 with stable_ratio=1.000000
        • factual conclusion from the focused SHM rerun:
          • the Windows SHM benchmark instability is materially reduced after:
            • runner lifecycle fixes
            • per-repeat runtime isolation
            • explicit Windows SHM collision detection
            • Rust benchmark-server wake-on-stop fix
          • the earlier rust->c SHM collapse no longer reproduces in the real benchmark block that used to be suspicious
      • partial full-suite proof after the focused fixes:
        • command started on win11:
          • NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260327/full-after-fix.csv 5
        • factual behavior before manual interruption:
          • no diagnostics were emitted
          • no forced-kill Rust benchmark-server failures reappeared
          • the run cleared the exact NP area where the stricter runner had previously exposed the Rust benchmark-server shutdown bug:
            • np-ping-pong @ max rows for Rust servers completed cleanly
            • np-ping-pong @ 100000/s rows for Rust servers completed cleanly
            • np-ping-pong @ 10000/s rows were still running cleanly when the run was stopped intentionally for time
        • reason for interruption:
          • no new technical blocker remained
          • the rest of the work was wall-clock runtime only