You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .agents/sow/current/SOW-0021-20260613-netipc-at-scale.md
+110Lines changed: 110 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1187,6 +1187,116 @@ Current interpretation:
1187
1187
- Go remains materially slower than C/Rust in these codec+dispatch microbenchmarks. The obvious avoidable algorithmic and allocation issues found so far are fixed, but the remaining gap still needs full-suite benchmark confirmation and, if required, profiling before declaring it inherent Go/runtime overhead.
1188
1188
- This checkpoint does not close the performance gate. Full POSIX and Windows benchmark regeneration still needs to be run cleanly after commit/push.
1189
1189
1190
+
## 8192-Item Profiling Checkpoint - 2026-06-15
1191
+
1192
+
Context:
1193
+
1194
+
- Netdata can call lookup APIs with arrays of `8192` entries on large HPC systems.
1195
+
- The `8192` case must be treated as normal scale, not as an exceptional or degraded path.
1196
+
- The benchmarked loop is relevant to Level 2 consumers: it encodes the request, dispatches through the typed builder, and decodes/validates the response.
1197
+
1198
+
Profiling evidence before this checkpoint's additional fixes:
1199
+
1200
+
-`perf stat` on `build-bench-posix/bin/bench_posix_go lookup-method-bench 5 cgroups-lookup-mixed-8192 0` reported `cgroups-lookup-mixed-8192,go,go,1185,...,99.0` and about `99%` CPU. This is CPU-bound, not wait-bound.
1201
+
- Isolated Go pprof benchmark before the allocation fix:
1202
+
-`BenchmarkProfileCgroupsLookupMixed8192`: about `967462 ns/op`, `73920 B/op`, `4 allocs/op`.
1203
+
-`BenchmarkProfileAppsLookupMixed8192`: about `782743 ns/op`, `73920 B/op`, `4 allocs/op`.
1204
+
- Go memory profiles showed `makePayloadExceededSuffixBytes` accounting for about `97%-99%` of allocated bytes in both apps and cgroups lookup dispatch.
1205
+
- Go CPU profiles showed the remaining cgroups cost concentrated in:
1206
+
-`CgroupsLookupBuilder.Add`, about `40%` cumulative;
1207
+
- string/NUL scanning through `bytes.IndexByte`/`indexbytebody`;
1208
+
- response decode validation, about `25%` cumulative;
1209
+
- defensive checked arithmetic and directory/payload slice validation.
1210
+
- Go CPU profiles showed the remaining apps cost concentrated in:
1211
+
-`AppsLookupBuilder.Add`, about `47%` cumulative;
1212
+
-`validateAppsLookupItem`, about `29%` cumulative;
1213
+
-`validateAppsLookupSemantics`, about `13%` cumulative before the safe status-switch cleanup.
1214
+
1215
+
Fixes made in this checkpoint:
1216
+
1217
+
- Go:
1218
+
- APPS_LOOKUP dispatch-owned overflow suffix fitting now uses a fixed-size formula instead of allocating an item-count suffix table.
1219
+
- CGROUPS_LOOKUP suffix bytes now use `uint32`, matching the protocol/C representation and halving the table size on 64-bit Go.
1220
+
- APPS_LOOKUP semantic validation now uses a status switch while preserving the existing invalid cgroup-status rejection behavior.
1221
+
- Rust:
1222
+
- APPS_LOOKUP dispatch-owned overflow suffix fitting now uses a fixed-size formula instead of allocating a suffix vector.
1223
+
- Lookup suffix bytes now use `u32` instead of `usize`, matching the protocol/C representation.
1224
+
1225
+
Validation evidence:
1226
+
1227
+
-`cd src/go && go test -count=1 ./pkg/netipc/protocol` passed.
1228
+
-`cargo test --manifest-path src/crates/netipc/Cargo.toml protocol::lookup -- --test-threads=1` passed.
1229
+
-`cd src/go && go test -count=1 ./pkg/netipc/...` passed.
1230
+
-`cargo test --manifest-path src/crates/netipc/Cargo.toml -- --test-threads=1` passed with `374 passed`.
-`BenchmarkProfileCgroupsLookupMixed8192`: about `883683 ns/op`, `41152 B/op`, `4 allocs/op`.
1234
+
-`BenchmarkProfileAppsLookupMixed8192`: about `689551 ns/op`, `208 B/op`, `3 allocs/op`.
1235
+
- Rebuilt release benchmark sample after fixes:
1236
+
-`cgroups-lookup-mixed-8192`: C about `1917`, Rust about `1426`, Go about `1081` requests/s.
1237
+
-`apps-lookup-mixed-8192`: C about `2174`, Rust about `1928`, Go about `1422` requests/s.
1238
+
1239
+
Current interpretation:
1240
+
1241
+
- The avoidable Go allocation problem is fixed for APPS_LOOKUP and materially reduced for CGROUPS_LOOKUP.
1242
+
- Rust now matches the protocol-sized suffix representation, but its `8192` cgroups gap remains CPU-bound rather than allocation-bound in the sampled benchmark.
1243
+
- Go remains materially slower than C/Rust at `8192`, especially for CGROUPS_LOOKUP.
1244
+
- The main remaining cgroups-specific Go cost is duplicated string validation: request decode validates each path, then handlers commonly pass the validated `CStringView.Bytes()` back to `builder.Add`, which validates the same path bytes again because the public builder API accepts raw `[]byte`.
1245
+
- Avoiding that duplicated cgroups scan safely requires an explicit design/API choice, such as an internal dispatch helper or a public builder path that accepts already-validated `CStringView` data. This was not changed in this checkpoint because silently trusting raw `[]byte` would weaken corruption detection.
1246
+
- The performance gate remains open.
1247
+
1248
+
Decision - validated cgroups builder path:
1249
+
1250
+
- The duplicated cgroups path scan is a code-organization problem, not a Go-specific runtime issue.
1251
+
- The accepted design is:
1252
+
- keep the existing raw public builder methods safe and validating;
1253
+
- split validation from item layout/wire writing internally;
1254
+
- add/use an already-validated cgroups path flow for request-derived paths;
1255
+
- preserve corruption detection for raw application input and peer-decoded response data;
1256
+
- apply the same organization in Go, Rust, and C.
1257
+
- This design must not silently trust arbitrary raw byte slices. Any validated path entry point must be backed by a decoder-produced view or an internal request item path whose provenance is known.
1258
+
1259
+
Implementation update after the decision:
1260
+
1261
+
- Added request-backed cgroups response builder entry points in all three SDKs:
- Kept the raw builder entry points validating raw application bytes.
1266
+
- Split internal builder logic so item layout/wire writing can be shared while path validation is skipped only for decoded request-view paths.
1267
+
- Updated C, Rust, and Go benchmark cgroups handlers to use the request-backed builder path.
1268
+
- Added direct C/Rust/Go protocol tests for request-backed builder success and invalid-index rejection.
1269
+
- Updated `docs/codec-cgroups-lookup.md` and `docs/netipc-integrator-skill.md` so future cgroups-lookup handlers use the request-backed builder only when echoing decoded request paths.
1270
+
1271
+
Validation evidence after the request-backed builder update:
-`cd src/go && go test -count=1 -timeout=300s ./pkg/netipc/protocol` passed.
1278
+
-`cargo test --manifest-path src/crates/netipc/Cargo.toml protocol::lookup -- --test-threads=1` passed with `17 passed`.
1279
+
-`cd src/go && go test -count=1 -timeout=300s ./pkg/netipc/...` passed.
1280
+
-`cargo test --manifest-path src/crates/netipc/Cargo.toml -- --test-threads=1` passed with `375 passed`.
1281
+
-`/usr/bin/ctest --test-dir build --output-on-failure` passed with `48/48` tests. The unqualified `ctest` command failed first because the workstation's `~/.local/bin/ctest` Python wrapper could not import `cmake`; the system CTest binary was then used successfully.
- The request-backed cgroups builder removes the unsafe/code-smelly reason for double-validating decoded request paths in C, Rust, and Go.
1296
+
- Go cgroups throughput improved in the focused sample and no longer carries the avoidable request-path revalidation penalty in the benchmark handler.
1297
+
- Go remains slower than C/Rust in these microbenchmarks. That residual gap is now more likely runtime/allocation/implementation overhead outside the specific duplicated path scan, but this is a working theory until a fresh full benchmark/profile pass is reviewed.
1298
+
- These focused samples do not replace full POSIX and Windows benchmark regeneration before SOW close.
0 commit comments