Skip to content

Commit 8a23810

Browse files
committed
Resolve restored scanner findings
1 parent 1b7ce78 commit 8a23810

19 files changed

Lines changed: 351 additions & 125 deletions

File tree

.agents/sow/done/SOW-0010-20260602-static-analysis-finding-cleanup.md

Lines changed: 97 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44

55
Status: completed
66

7-
Sub-state: Completed after restoring the approved hygiene checks and validating
8-
the scanner/test matrix.
7+
Sub-state: Completed after fixing the remaining restored-rule GitHub Code
8+
Scanning findings and validating the scanner/test matrix.
99

1010
## Requirements
1111

@@ -614,3 +614,98 @@ Artifact updates:
614614
because public integration guidance did not change.
615615
- SOW lifecycle: this reopened regression is completed and the SOW will be
616616
moved back to `done/` in the same commit as the restored scanner changes.
617+
618+
## Regression - 2026-06-03 Remote Alerts After Hygiene Restoration
619+
620+
What broke:
621+
622+
- The restored CodeQL and OSV rules were useful and found remaining real
623+
hygiene/security issues after commit
624+
`1b7ce780b7c4c54902e1ff0e957aad1542fe3733`.
625+
626+
Evidence:
627+
628+
- Codacy Cloud and local Codacy are clean for the restored commit, so the
629+
remaining backlog is GitHub Code Scanning, not Codacy local configuration.
630+
- GitHub Code Scanning reported 32 open alerts after the restored-rule commit:
631+
one Go integer-conversion alert in `bench/drivers/go/main.go`, nine Go
632+
standard-library OSV alerts across the three Go modules, two C unused-code
633+
alerts, three C unused-local alerts, thirteen C constant-comparison alerts,
634+
two C TOCTOU alerts, one Go unchecked writable-close test alert, and one Go
635+
useless-assignment test alert.
636+
- Official Go release history states that `go1.25.11` and `go1.26.4`, both
637+
released on 2026-06-02, include security fixes for `crypto/x509`, `mime`,
638+
and `net/textproto`; the repository Go module directives still declare
639+
`go 1.25.10`.
640+
641+
Why previous validation missed it:
642+
643+
- Local `govulncheck` used the workstation Go runtime, currently `go1.26.3-X`,
644+
while GitHub OSV scans the module `go` directive and therefore still sees
645+
`go 1.25.10` as vulnerable.
646+
- The restored CodeQL rules only re-ran remotely after the follow-up commit was
647+
pushed.
648+
649+
Repair plan:
650+
651+
- Update all Go module directives from `go 1.25.10` to the patched supported
652+
Go line.
653+
- Fix the benchmark sample-count conversion by keeping the arithmetic bounded
654+
before converting to `int`.
655+
- Check writable file close errors in the SHM edge test and remove the useless
656+
UDS test assignment.
657+
- Remove unused C helpers and local variables.
658+
- Rewrite redundant overflow checks so the code preserves real guards without
659+
constant comparisons.
660+
- Resolve the stale socket/shared-memory cleanup TOCTOU alerts with a
661+
code-level security decision that is narrow and documented in code.
662+
663+
Validation:
664+
665+
- `cd src/go && go test ./... && go vet ./... && "$(go env GOPATH)/bin/govulncheck" ./... && "$(go env GOPATH)/bin/staticcheck" ./... && "$(go env GOPATH)/bin/gosec" -quiet -fmt json -out /tmp/plugin-ipc-gosec-src-go-followup.json -exclude=G404 ./...`
666+
passed with no Go vulnerabilities or gosec findings.
667+
- `cd tests/fixtures/go && go test ./... && go vet ./... && "$(go env GOPATH)/bin/govulncheck" ./... && "$(go env GOPATH)/bin/staticcheck" ./... && "$(go env GOPATH)/bin/gosec" -quiet -fmt json -out /tmp/plugin-ipc-gosec-fixtures-go-followup.json -exclude=G404 ./...`
668+
passed with no Go vulnerabilities or gosec findings.
669+
- `cd bench/drivers/go && go test ./... && go vet ./... && "$(go env GOPATH)/bin/govulncheck" ./... && "$(go env GOPATH)/bin/staticcheck" ./... && "$(go env GOPATH)/bin/gosec" -quiet -fmt json -out /tmp/plugin-ipc-gosec-bench-go-followup.json -exclude=G404 ./...`
670+
passed with no Go vulnerabilities or gosec findings.
671+
- `cargo test --manifest-path src/crates/netipc/Cargo.toml --all-targets --all-features --no-run && cargo clippy --manifest-path src/crates/netipc/Cargo.toml --all-targets --all-features -- -D clippy::correctness -D clippy::suspicious`
672+
passed. Clippy emitted existing warning-only hygiene output outside the hard
673+
correctness/suspicious gate.
674+
- `cargo audit && cargo deny check advisories bans sources` passed in
675+
`src/crates/netipc`.
676+
- `make` passed and rebuilt C, Rust, Go fixtures, and benchmark drivers.
677+
- `/usr/bin/ctest --test-dir build --output-on-failure` first passed 45/46 and
678+
had one `go_FuzzDecodeCgroupsLookupRequest` timeout flake. The exact target
679+
passed on rerun, and a second full `/usr/bin/ctest --test-dir build --output-on-failure`
680+
run passed all 46 tests.
681+
- `actionlint` passed.
682+
- The local C static workflow commands passed: configure `build-static`, build
683+
`netipc_protocol`, `netipc_uds`, `netipc_shm`, and `netipc_service`, then run
684+
`clang-tidy`, `cppcheck`, and `flawfinder --minlevel=5 --error-level=5`.
685+
- After the final SHM cleanup adjustment, `make` passed again and a focused
686+
`clang-tidy` plus `cppcheck` pass on `netipc_shm.c` exited 0.
687+
- `osv-scanner scan --recursive --format sarif --output-file /tmp/plugin-ipc-osv-followup.sarif .`
688+
exited 0, and the SARIF result count was 0.
689+
- `codacy-analysis analyze . --install-dependencies --output-format json --output /tmp/plugin-ipc-codacy-followup.json --parallel-tools 2 --tool-timeout 900000`
690+
exited 0 with 0 issues and 0 tool errors.
691+
- `codacy-analysis analyze . --install-dependencies --output-format sarif --output /tmp/plugin-ipc-codacy-followup.sarif --parallel-tools 2 --tool-timeout 900000`
692+
exited 0 and generated SARIF with 0 results.
693+
- `bash .agents/sow/audit.sh` passed.
694+
- `git diff --check && git diff --cached --check` passed.
695+
- `git check-ignore -v .env` confirmed `.env` is ignored by `.gitignore`.
696+
697+
Artifact updates:
698+
699+
- AGENTS.md: no update needed; existing project validation commands and SOW
700+
rules remain accurate.
701+
- Runtime project skills: no update needed; the repository still has no runtime
702+
`project-*` skill, and no reusable repo workflow was missing from AGENTS.md.
703+
- Specs: updated `docs/level1-transport.md` and `docs/level1-posix-shm.md` to
704+
document the POSIX private-runtime-directory rule for automatic stale unlink.
705+
- End-user/operator docs: updated the public transport docs listed above.
706+
- End-user/operator skills: updated `docs/netipc-integrator-skill.md` so
707+
downstream integrators keep provider runtime directories private enough for
708+
stale cleanup.
709+
- SOW lifecycle: this regression was appended after the prior SOW content; the
710+
SOW is marked `completed` and will be moved back to `done/` in the same
711+
commit as the implementation and docs.

bench/drivers/go/go.mod

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
module github.com/netdata/plugin-ipc/bench/drivers/go
22

3-
go 1.25.10
3+
go 1.25.11
44

55
require github.com/netdata/plugin-ipc/go v0.0.0
66

bench/drivers/go/main.go

Lines changed: 4 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -77,10 +77,6 @@ func cpuNS() uint64 {
7777
return sec*1_000_000_000 + nsec
7878
}
7979

80-
func maxIntValue() int {
81-
return int(^uint(0) >> 1)
82-
}
83-
8480
func u32FromNonNegativeInt(value int) (uint32, bool) {
8581
if value < 0 || uint64(value) > uint64(^uint32(0)) {
8682
return 0, false
@@ -110,15 +106,12 @@ func estimateSamples(defaultSamples int, targetRPS uint64, durationSec int) int
110106
if targetRPS == 0 || durationSec <= 0 {
111107
return defaultSamples
112108
}
113-
limit := uint64(maxIntValue() / durationSec) // #nosec G115 -- maxIntValue and durationSec are positive here.
114-
if targetRPS > limit {
115-
return maxLatencySamples
116-
}
117-
samples := int(targetRPS) * durationSec // #nosec G115 -- targetRPS is bounded by limit above.
118-
if samples > maxLatencySamples {
109+
duration := uint64(durationSec) // #nosec G115 -- durationSec is checked positive above.
110+
if targetRPS > uint64(maxLatencySamples)/duration {
119111
return maxLatencySamples
120112
}
121-
return samples
113+
samples := targetRPS * duration
114+
return int(samples) // #nosec G115 -- samples is bounded by maxLatencySamples above.
122115
}
123116

124117
func randomBatchSize() uint32 {

docs/level1-posix-shm.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -246,15 +246,20 @@ When a session closes (graceful or broken):
246246
Both `server_create` (per-session) and `cleanup_stale` (startup scan)
247247
use the same stale detection logic:
248248

249-
1. Open the file and mmap the header. If `open()` fails with a
249+
1. Verify `{run_dir}` is safe for automatic stale unlink: owned by the
250+
effective process user and not group- or world-writable. If the directory is
251+
unsafe, stale entries are left in place and treated as in-use.
252+
2. Open the file and mmap the header. If `open()` fails with a
250253
permission error (EACCES, EPERM), the file is left in place — it
251254
may belong to another user or process.
252-
2. Validate magic. If invalid or file is undersized: stale — unlink.
253-
3. Check `owner_pid` and `owner_generation`:
255+
3. Validate magic. If invalid or file is undersized: stale — unlink only when
256+
`{run_dir}` passed the safety check.
257+
4. Check `owner_pid` and `owner_generation`:
254258
- If `owner_pid` is alive AND `owner_generation` is non-zero:
255259
the region is live — leave it.
256260
- Otherwise (PID dead, or generation is zero indicating an
257-
uninitialized/legacy region): stale — unlink.
261+
uninitialized/legacy region): stale — unlink only when `{run_dir}` passed
262+
the safety check.
258263

259264
The `owner_generation` check catches PID reuse: a new process that
260265
reuses an old PID will not have the same generation value. A zero
@@ -268,7 +273,8 @@ behind by a previous server instance that crashed or was killed:
268273

269274
1. Scan `{run_dir}` for files matching `{service_name}-*.ipcshm`.
270275
2. For each file: apply the stale detection logic above.
271-
3. Stale files are unlinked. Live files are left in place.
276+
3. Stale files are unlinked only in a safe `{run_dir}`. Live files and files in
277+
unsafe shared directories are left in place.
272278

273279
This cleanup runs once at server startup, before the listener begins
274280
accepting connections. It is the safety net for crashes and hard
@@ -278,5 +284,6 @@ reboots.
278284

279285
When `server_create` is called for a new session, it applies the same
280286
stale detection logic to the target path before attempting `O_EXCL`
281-
create. If a stale file exists, it is unlinked first. If a live file
282-
exists, the create fails with address-in-use.
287+
create. If a stale file exists and `{run_dir}` is safe, it is unlinked first.
288+
If a live file exists, or `{run_dir}` is unsafe for automatic stale unlink, the
289+
create fails with address-in-use.

docs/level1-transport.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -531,6 +531,10 @@ stale if a server process crashes without cleanup. Level 1 handles this:
531531
it (unlinks and recreates).
532532
- On listen: if an active endpoint exists from a live process, Level 1 fails
533533
with an appropriate error (address already in use).
534+
- On POSIX: automatic stale unlink is allowed only when `run_dir` is owned by
535+
the effective process user and is not group- or world-writable. In unsafe
536+
shared directories, stale entries are treated as in-use so Level 1 does not
537+
unlink paths that another local user could race or replace.
534538
- Stale detection uses process ownership metadata (PID and generation) to
535539
avoid false reclamation due to PID reuse.
536540

docs/netipc-integrator-skill.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -273,6 +273,9 @@ Netdata-specific integration rules:
273273
- use `os_run_dir(true)` on the provider side
274274
- use `os_run_dir(false)` on the consumer side
275275
- treat these as a matched pair
276+
- keep the provider runtime directory service-owned and not group- or
277+
world-writable; POSIX stale cleanup will not unlink old socket/SHM paths from
278+
unsafe shared directories
276279
- if provider and consumer do not agree on auth token or run dir, the client
277280
will stay in `AUTH_FAILED` or `NOT_FOUND`
278281

src/crates/netipc/src/transport/posix.rs

Lines changed: 24 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ use crate::protocol::{
1212
use std::collections::HashSet;
1313
use std::ffi::CString;
1414
use std::io;
15+
use std::os::unix::fs::MetadataExt;
1516
use std::os::unix::io::RawFd;
1617
use std::path::{Path, PathBuf};
1718
use std::sync::atomic::{AtomicU64, Ordering};
@@ -593,7 +594,7 @@ impl UdsListener {
593594
let path = build_socket_path(run_dir, service_name)?;
594595

595596
// Stale recovery
596-
match check_and_recover_stale(&path) {
597+
match check_and_recover_stale(&path, run_dir_allows_stale_unlink(run_dir)) {
597598
StaleResult::LiveServer => return Err(UdsError::AddrInUse),
598599
StaleResult::Stale | StaleResult::NotExist => { /* proceed */ }
599600
}
@@ -910,7 +911,17 @@ enum StaleResult {
910911
LiveServer,
911912
}
912913

913-
fn check_and_recover_stale(path: &str) -> StaleResult {
914+
fn run_dir_allows_stale_unlink(run_dir: &str) -> bool {
915+
let metadata = match std::fs::metadata(run_dir) {
916+
Ok(m) => m,
917+
Err(_) => return false,
918+
};
919+
metadata.is_dir()
920+
&& metadata.uid() == unsafe { libc::geteuid() }
921+
&& metadata.mode() & 0o022 == 0
922+
}
923+
924+
fn check_and_recover_stale(path: &str, allow_stale_unlink: bool) -> StaleResult {
914925
if !Path::new(path).exists() {
915926
return StaleResult::NotExist;
916927
}
@@ -926,10 +937,17 @@ fn check_and_recover_stale(path: &str) -> StaleResult {
926937
// Connected => live server
927938
StaleResult::LiveServer
928939
}
929-
Err(UdsError::Connect(e)) if e == libc::ECONNREFUSED || e == libc::ENOENT => {
930-
// Connection refused or no such socket => stale, unlink
931-
let _ = std::fs::remove_file(path);
932-
StaleResult::Stale
940+
Err(UdsError::Connect(e)) if e == libc::ENOENT => StaleResult::NotExist,
941+
Err(UdsError::Connect(e)) if e == libc::ECONNREFUSED => {
942+
if !allow_stale_unlink {
943+
return StaleResult::LiveServer;
944+
}
945+
// Connection refused means stale; unlink only in a private run dir.
946+
match std::fs::remove_file(path) {
947+
Ok(()) => StaleResult::Stale,
948+
Err(err) if err.kind() == io::ErrorKind::NotFound => StaleResult::NotExist,
949+
Err(_) => StaleResult::LiveServer,
950+
}
933951
}
934952
Err(_) => {
935953
// Other errors (EACCES, etc.) — can't determine ownership,

0 commit comments

Comments
 (0)