|
| 1 | +# SOW-0017 - Rust POSIX tests: umask-independent run directories |
| 2 | + |
| 3 | +## Status |
| 4 | + |
| 5 | +Status: completed |
| 6 | + |
| 7 | +Sub-state: fix implemented, validated under umask 0002 and 0022, committed with this SOW. |
| 8 | + |
| 9 | +## Requirements |
| 10 | + |
| 11 | +### Purpose |
| 12 | + |
| 13 | +Rust netipc tests must pass regardless of the developer/CI process umask, without weakening the stale-unlink safety guard. |
| 14 | + |
| 15 | +### User Request |
| 16 | + |
| 17 | +User shared an external discovery note (`netipc.md`) reporting that `transport::shm::tests::test_stale_recovery` and related tests fail with `umask 0002`, then asked: "fix it and commit please". The fix direction chosen in discussion was the surgical test-setup change (chmod 0700 on the test run dir), not weakening the guard. |
| 18 | + |
| 19 | +### Assistant Understanding |
| 20 | + |
| 21 | +Facts: |
| 22 | + |
| 23 | +- `run_dir_allows_stale_unlink` (`src/crates/netipc/src/transport/shm.rs:976`) refuses stale unlinks when the run dir is group/other-writable (`mode & 0o022 != 0`). The same guard exists for UDS (`src/crates/netipc/src/transport/posix.rs`), in C (`src/libnetdata/netipc/src/transport/posix/`), and in Go (`src/go/pkg/netipc/transport/posix/`). |
| 24 | +- Four Rust test files create their run dirs with bare `std::fs::create_dir_all`, which inherits the process umask: `transport/shm_tests.rs`, `transport/posix_tests.rs`, `service/raw_unix_tests.rs`, `service/cgroups_unix_tests.rs`. |
| 25 | +- Under `umask 0002` the dirs land at mode `0775`, the guard refuses to unlink stale files, and 6 tests fail (full list in Analysis). |
| 26 | +- C tests use `mkdir(TEST_RUN_DIR, 0700)` and Go tests use `os.MkdirAll(dir, 0700)`; `0700` is unaffected by any umask, so C and Go are not exposed. |
| 27 | +- The external note also claimed the guard is missing from this upstream repo; that claim was verified false — it compared against a stale clone at `48c74fe` (2026-04-06), 105 commits behind. The guard landed in `8a23810` (2026-06-03). |
| 28 | + |
| 29 | +Inferences: |
| 30 | + |
| 31 | +- Any CI runner or developer machine with a permissive umask (0002 is the default on several distros) silently fails these tests; the guard behavior is correct, the test setup is environmentally fragile. |
| 32 | + |
| 33 | +Unknowns: |
| 34 | + |
| 35 | +- None blocking. |
| 36 | + |
| 37 | +### Acceptance Criteria |
| 38 | + |
| 39 | +- `cargo test` in `src/crates/netipc` passes with `umask 0002` (previously 6 failures). |
| 40 | +- `cargo test` in `src/crates/netipc` passes with `umask 0022` (no regression). |
| 41 | +- No production code changed; the safety guard semantics are untouched. |
| 42 | + |
| 43 | +## Analysis |
| 44 | + |
| 45 | +Sources checked: |
| 46 | + |
| 47 | +- `src/crates/netipc/src/transport/shm.rs:972-1062` (guard and call sites) |
| 48 | +- `src/crates/netipc/src/transport/shm_tests.rs:13-15`, `posix_tests.rs:11-13`, `service/raw_unix_tests.rs:22-24`, `service/cgroups_unix_tests.rs:16-18` (run-dir setup) |
| 49 | +- C tests `tests/fixtures/c/*.c` and Go tests `src/go/pkg/netipc/**/*_test.go` (same-failure scan) |
| 50 | +- External discovery note `netipc.md` (user-provided, claims cross-checked against git history) |
| 51 | + |
| 52 | +Current state (pre-fix evidence, `umask 0002`, full `cargo test` in `src/crates/netipc`): |
| 53 | + |
| 54 | +- FAILED: `transport::posix::tests::test_stale_recovery` |
| 55 | +- FAILED: `transport::shm::tests::test_cleanup_stale_invalid_entries` |
| 56 | +- FAILED: `transport::shm::tests::test_cleanup_stale_unlinks_dangling_symlink` |
| 57 | +- FAILED: `transport::shm::tests::test_cleanup_stale_unlinks_directory_symlink_when_mmap_fails` |
| 58 | +- FAILED: `transport::shm::tests::test_server_create_recovers_invalid_stale_file` |
| 59 | +- FAILED: `transport::shm::tests::test_stale_recovery` |
| 60 | +- Result: 327 passed; 6 failed. |
| 61 | + |
| 62 | +Risks: |
| 63 | + |
| 64 | +- None significant: change is test-only, 4 files, identical 1-line addition each. |
| 65 | + |
| 66 | +## Pre-Implementation Gate |
| 67 | + |
| 68 | +Status: ready |
| 69 | + |
| 70 | +Problem / root-cause model: |
| 71 | + |
| 72 | +- Test run dirs under `/tmp` inherit the process umask; with `umask 0002` they are group-writable (`0775`), and the stale-unlink safety guard correctly refuses to unlink inside group/other-writable directories. Stale files survive, subsequent `O_EXCL` creates fail with `EEXIST`/`AddrInUse`, and stale-recovery assertions fail. |
| 73 | + |
| 74 | +Evidence reviewed: |
| 75 | + |
| 76 | +- See Analysis. Reproduced locally: 6 failures under `umask 0002`, all pass under `umask 0022`. |
| 77 | + |
| 78 | +Affected contracts and surfaces: |
| 79 | + |
| 80 | +- Rust test setup only. No public API, wire format, transport behavior, or production code. |
| 81 | + |
| 82 | +Existing patterns to reuse: |
| 83 | + |
| 84 | +- C tests: `mkdir(TEST_RUN_DIR, 0700)` (`tests/fixtures/c/test_hardening.c:47`). |
| 85 | +- Go tests: `os.MkdirAll(dir, 0700)` (`src/go/pkg/netipc/transport/posix/uds_test.go:36` and others). |
| 86 | +- Rust mirrors this by `std::fs::set_permissions(.., from_mode(0o700))` after `create_dir_all`, which also repairs a pre-existing dir left with a permissive mode by an earlier run. |
| 87 | + |
| 88 | +Risk and blast radius: |
| 89 | + |
| 90 | +- Test-only. Worst case: a test environment where `set_permissions` fails — error is ignored like the existing `create_dir_all` error, and the affected test fails exactly as before. |
| 91 | + |
| 92 | +Sensitive data handling plan: |
| 93 | + |
| 94 | +- No secrets, customer data, or personal data involved. SOW cites only repo paths, public commit hashes, and test names. |
| 95 | + |
| 96 | +Implementation plan: |
| 97 | + |
| 98 | +1. Add `set_permissions(0o700)` to `ensure_run_dir()` in the four Rust POSIX test files. |
| 99 | + |
| 100 | +Validation plan: |
| 101 | + |
| 102 | +- Full `cargo test` in `src/crates/netipc` under `umask 0002` and `umask 0022`, with `/tmp/nipc_*` test dirs removed first to exercise fresh creation. |
| 103 | + |
| 104 | +Artifact impact plan: |
| 105 | + |
| 106 | +- AGENTS.md: unaffected — no workflow/process change. |
| 107 | +- Runtime project skills: none exist; nothing to update. |
| 108 | +- Specs: unaffected — no protocol/API/transport behavior change; guard semantics untouched. |
| 109 | +- End-user/operator docs: unaffected — test-internal change; `docs/` documents the product, not test scaffolding. |
| 110 | +- End-user/operator skills: unaffected — `docs/netipc-integrator-skill.md` covers integration workflow, not test internals. |
| 111 | +- SOW lifecycle: single SOW, completed and committed with the change. |
| 112 | + |
| 113 | +Open-source reference evidence: |
| 114 | + |
| 115 | +- None checked; the failure is local and fully reproducible, no external reference needed. |
| 116 | + |
| 117 | +Open decisions: |
| 118 | + |
| 119 | +- Fix approach (chmod 0700 vs per-test tempdir) was presented to the user before this SOW; user accepted the recommended surgical chmod-0700 variant by requesting the fix. |
| 120 | + |
| 121 | +## Implications And Decisions |
| 122 | + |
| 123 | +1. Fix approach: (A) `chmod 0700` the existing shared run dirs — zero new dependencies, mirrors C/Go; (B) per-test `tempfile::tempdir()` — stronger isolation, new dev-dependency, larger diff. Selected: A (surgical), per user request following the assistant's recommendation. |
| 124 | + |
| 125 | +## Plan |
| 126 | + |
| 127 | +1. Patch `ensure_run_dir()` in the four Rust POSIX test files (low risk, no dependencies). |
| 128 | +2. Validate under both umasks; complete and commit. |
| 129 | + |
| 130 | +## Execution Log |
| 131 | + |
| 132 | +### 2026-06-10 |
| 133 | + |
| 134 | +- Reproduced 6 failures under `umask 0002` on clean `/tmp` state. |
| 135 | +- Patched `ensure_run_dir()` in `src/crates/netipc/src/transport/shm_tests.rs`, `src/crates/netipc/src/transport/posix_tests.rs`, `src/crates/netipc/src/service/raw_unix_tests.rs`, `src/crates/netipc/src/service/cgroups_unix_tests.rs`. |
| 136 | +- Verified full Rust suite green under `umask 0002` and `umask 0022`. |
| 137 | + |
| 138 | +## Validation |
| 139 | + |
| 140 | +Acceptance criteria evidence: |
| 141 | + |
| 142 | +- `umask 0002`: `cargo test` in `src/crates/netipc` — 333 passed; 0 failed (previously 327 passed; 6 failed). |
| 143 | +- `umask 0022`: `cargo test` in `src/crates/netipc` — 333 passed; 0 failed. |
| 144 | +- `git diff --stat` shows only the four test files changed; no production code touched. |
| 145 | + |
| 146 | +Tests or equivalent validation: |
| 147 | + |
| 148 | +- Full `cargo test` run twice (umask 0002 and 0022), `/tmp/nipc_*` dirs removed before each run. |
| 149 | + |
| 150 | +Real-use evidence: |
| 151 | + |
| 152 | +- The tests are the runnable path; both runs above are real executions on this workstation. |
| 153 | + |
| 154 | +Reviewer findings: |
| 155 | + |
| 156 | +- External reviewers not run: 4-file test-only setup change falls under the documented "wasteful reviewer calls" category. |
| 157 | + |
| 158 | +Same-failure scan: |
| 159 | + |
| 160 | +- Grepped all C (`tests/fixtures/c/*.c`) and Go (`**/*_test.go`) test dir creation: all use mode `0700` explicitly; not exposed to umask. Grepped all Rust `create_dir_all` test call sites: the four patched files were the complete set of POSIX run-dir creators (Windows test files are not subject to POSIX modes). |
| 161 | + |
| 162 | +Sensitive data gate: |
| 163 | + |
| 164 | +- No secrets, credentials, customer or personal data in this SOW, the diff, or comments. |
| 165 | + |
| 166 | +Artifact maintenance gate: |
| 167 | + |
| 168 | +- AGENTS.md: no update — no workflow, guardrail, or process change. |
| 169 | +- Runtime project skills: none exist (per AGENTS.md index); no update. |
| 170 | +- Specs: no update — no behavior, contract, wire-format, or operational-guarantee change; guard semantics untouched. |
| 171 | +- End-user/operator docs: no update — change is internal to test scaffolding, not visible to integrators or operators. |
| 172 | +- End-user/operator skills: no update — `docs/netipc-integrator-skill.md` triggers (public APIs, examples, workflow, transports, validation commands) are unaffected. |
| 173 | +- SOW lifecycle: `Status: completed`, file placed in `.agents/sow/done/`, committed together with the fix in one commit. |
| 174 | + |
| 175 | +Specs update: |
| 176 | + |
| 177 | +- Not needed — see artifact maintenance gate. |
| 178 | + |
| 179 | +Project skills update: |
| 180 | + |
| 181 | +- Not needed — no runtime project skills exist. |
| 182 | + |
| 183 | +End-user/operator docs update: |
| 184 | + |
| 185 | +- Not needed — see artifact maintenance gate. |
| 186 | + |
| 187 | +End-user/operator skills update: |
| 188 | + |
| 189 | +- Not needed — no docs/spec changes occurred. |
| 190 | + |
| 191 | +Lessons: |
| 192 | + |
| 193 | +- POSIX test scaffolding that creates directories must set modes explicitly; inheriting the process umask makes test outcomes environment-dependent, especially when production code legitimately inspects directory permissions. C and Go already followed this; Rust did not. |
| 194 | +- External discovery notes must be re-verified against current upstream HEAD before acting: the triggering note compared against a 105-commit-stale clone and drew a false "divergence" conclusion. |
| 195 | + |
| 196 | +Follow-up mapping: |
| 197 | + |
| 198 | +- The note's netdata-side recommendations (CI umask check in `netdata/netdata`, vendor-sync commit trailers) concern the downstream repo, not this one; this fix flows to netdata via the normal vendor sync. No follow-up SOW needed here. |
| 199 | + |
| 200 | +## Outcome |
| 201 | + |
| 202 | +All Rust netipc tests pass independent of the process umask. Test run directories are now created (or repaired) with mode `0700`, matching the existing C and Go test conventions. No production code changed. |
| 203 | + |
| 204 | +## Lessons Extracted |
| 205 | + |
| 206 | +See Lessons under Validation. |
| 207 | + |
| 208 | +## Followup |
| 209 | + |
| 210 | +None. |
| 211 | + |
| 212 | +## Regression Log |
| 213 | + |
| 214 | +None yet. |
| 215 | + |
| 216 | +Append regression entries here only after this SOW was completed or closed and later testing or use found broken behavior. Use a dated `## Regression - YYYY-MM-DD` heading at the end of the file. Never prepend regression content above the original SOW narrative. |
0 commit comments