Skip to content

Commit f91393f

Browse files
committed
Make POSIX stale endpoint recovery liveness-only
Stale UDS socket and SHM region recovery was gated on run-directory permissions: automatic unlink was refused unless the run dir was owned by the process euid and not group/other-writable. netdata's systemd unit ships RuntimeDirectoryMode=0775, so on every standard install a crashed server could never reclaim its endpoint on restart — the service stayed AddrInUse until manual cleanup or reboot. The guard came from a CodeQL TOCTOU alert and assumed hostile writers in a shared directory; the run dir is the embedding service's private runtime directory, so that threat model does not apply. Liveness is now the only criterion, in C, Rust, and Go, for both transports: a live server's endpoint (SHM owner PID alive with non-zero generation; UDS connect probe succeeds) is never deleted and a second server gets address-in-use. Everything else found at an endpoint name — dead server leftovers, foreign files, symlinks, unreadable entries, empty directories — is silently reclaimed so the service starts. The one exception: fd exhaustion (EMFILE/ENFILE) during the liveness check keeps the entry, since liveness could not be evaluated and deleting could remove a live endpoint. Descriptor-relative fstatat/unlinkat cleanup mechanics are kept. Tests updated across all three languages, with new coverage for crash recovery in 0775 run dirs and foreign-file reclaim; the public specs and the integrator skill describe the new contract. Tracked as SOW-0018; supersedes the SOW-0017 test-side umask workaround.
1 parent 9b5cde8 commit f91393f

23 files changed

Lines changed: 612 additions & 332 deletions

.agents/sow/done/SOW-0018-20260610-liveness-only-stale-recovery.md

Lines changed: 239 additions & 0 deletions
Large diffs are not rendered by default.

docs/level1-posix-shm.md

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -246,20 +246,24 @@ When a session closes (graceful or broken):
246246
Both `server_create` (per-session) and `cleanup_stale` (startup scan)
247247
use the same stale detection logic:
248248

249-
1. Verify `{run_dir}` is safe for automatic stale unlink: owned by the
250-
effective process user and not group- or world-writable. If the directory is
251-
unsafe, stale entries are left in place and treated as in-use.
252-
2. Open the file and mmap the header. If `open()` fails with a
253-
permission error (EACCES, EPERM), the file is left in place — it
254-
may belong to another user or process.
255-
3. Validate magic. If invalid or file is undersized: stale — unlink only when
256-
`{run_dir}` passed the safety check.
257-
4. Check `owner_pid` and `owner_generation`:
249+
1. Open the file and mmap the header. If `open()` fails with ENOENT,
250+
nothing is there. If it fails with EMFILE/ENFILE (fd exhaustion),
251+
liveness cannot be evaluated — the entry is kept and treated as
252+
in-use. Any other open failure (unreadable file, symlink, foreign
253+
junk at the endpoint name) marks the entry as junk — silently unlink.
254+
2. Validate magic. If invalid, or the file is undersized or not a
255+
regular file: junk — silently unlink.
256+
3. Check `owner_pid` and `owner_generation`:
258257
- If `owner_pid` is alive AND `owner_generation` is non-zero:
259-
the region is live — leave it.
258+
the region is live — leave it, report address-in-use.
260259
- Otherwise (PID dead, or generation is zero indicating an
261-
uninitialized/legacy region): stale — unlink only when `{run_dir}` passed
262-
the safety check.
260+
uninitialized/legacy region): stale — silently unlink.
261+
262+
Liveness is the only criterion: a live server's endpoint is never
263+
deleted, and everything else found at an endpoint name is reclaimed so
264+
the service can start. Run-directory permissions do not gate stale
265+
recovery — the run dir is expected to be the embedding service's
266+
private runtime directory.
263267

264268
The `owner_generation` check catches PID reuse: a new process that
265269
reuses an old PID will not have the same generation value. A zero

docs/level1-posix-uds.md

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -118,19 +118,25 @@ and therefore a different SHM file path.
118118
- Server checks if the socket path exists before bind.
119119
- If it exists: attempt to connect to it. If connection succeeds, a
120120
live server owns it — fail with address-in-use. If connection fails
121-
with ECONNREFUSED or ENOENT, the socket is stale — unlink and
122-
recreate. Other connect errors (EACCES, EPERM, etc.) are not treated
123-
as stale — the file is left in place and the server fails.
121+
with ENOENT, nothing is there. Any other connect failure
122+
(ECONNREFUSED, EACCES, a foreign file squatting on the socket path,
123+
etc.) means the endpoint is not a live server — it is silently
124+
unlinked and recreated.
125+
- The only exception is fd exhaustion (EMFILE/ENFILE) while creating
126+
the probe socket: liveness cannot be evaluated, so the endpoint is
127+
kept and the server fails with address-in-use rather than risk
128+
deleting a live socket.
124129

125130
### SHM file
126131

127132
- Server checks the `.ipcshm` file for ownership metadata (PID and
128133
generation) stored in the SHM region header.
129134
- If the owner PID is alive and the generation matches, the region is
130135
active — fail with address-in-use.
131-
- If the owner PID is dead or the region is invalid/undersized, the
132-
region is stale — unlink and recreate. If the file cannot be opened
133-
due to permission errors (EACCES, EPERM), it is left in place.
136+
- If the owner PID is dead or the region is invalid/undersized/unreadable,
137+
the region is stale or junk — silently unlink and recreate. Only fd
138+
exhaustion (EMFILE/ENFILE) keeps the file in place, because liveness
139+
could not be evaluated.
134140
- Clients that open an undersized/unpopulated SHM file (server has
135141
created it but not yet initialized the header) treat it as a
136142
retryable protocol-not-ready condition.

docs/level1-transport.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -531,12 +531,17 @@ stale if a server process crashes without cleanup. Level 1 handles this:
531531
it (unlinks and recreates).
532532
- On listen: if an active endpoint exists from a live process, Level 1 fails
533533
with an appropriate error (address already in use).
534-
- On POSIX: automatic stale unlink is allowed only when `run_dir` is owned by
535-
the effective process user and is not group- or world-writable. In unsafe
536-
shared directories, stale entries are treated as in-use so Level 1 does not
537-
unlink paths that another local user could race or replace.
534+
- Liveness is the only criterion: anything found at an endpoint name that is
535+
not a live server (dead server's endpoint, foreign file accidentally placed
536+
there, invalid or unreadable entry) is silently reclaimed so the service
537+
can start. The one exception is fd exhaustion during the liveness check
538+
itself — then the entry is kept and reported as in-use, so a live endpoint
539+
is never deleted on an unverifiable check.
538540
- Stale detection uses process ownership metadata (PID and generation) to
539541
avoid false reclamation due to PID reuse.
542+
- `run_dir` is expected to be the embedding service's private runtime
543+
directory (e.g. netdata's run dir); its permissions do not gate stale
544+
recovery.
540545

541546
## Testing requirements
542547

docs/netipc-integrator-skill.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -286,9 +286,9 @@ Netdata-specific integration rules:
286286
- use `os_run_dir(true)` on the provider side
287287
- use `os_run_dir(false)` on the consumer side
288288
- treat these as a matched pair
289-
- keep the provider runtime directory service-owned and not group- or
290-
world-writable; POSIX stale cleanup will not unlink old socket/SHM paths from
291-
unsafe shared directories
289+
- stale socket/SHM paths from dead processes are reclaimed automatically on
290+
server start, regardless of run-dir permissions; only a live server's
291+
endpoint blocks a new server (address-in-use)
292292
- if provider and consumer do not agree on auth token or run dir, the client
293293
will stay in `AUTH_FAILED` or `NOT_FOUND`
294294

src/crates/netipc/src/service/cgroups_unix_tests.rs

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,7 @@ const RESPONSE_BUF_SIZE: usize = 65536;
1414
static SERVICE_COUNTER: AtomicU64 = AtomicU64::new(0);
1515

1616
fn ensure_run_dir() {
17-
use std::os::unix::fs::PermissionsExt;
1817
let _ = std::fs::create_dir_all(TEST_RUN_DIR);
19-
// The stale-unlink guard refuses group/other-writable run dirs; pin the
20-
// mode so the process umask cannot decide test outcomes.
21-
let _ = std::fs::set_permissions(TEST_RUN_DIR, std::fs::Permissions::from_mode(0o700));
2218
}
2319

2420
fn unique_service(prefix: &str) -> String {

src/crates/netipc/src/service/raw_unix_tests.rs

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,7 @@ const RESPONSE_BUF_SIZE: usize = 65536;
2020
static RAW_SERVICE_COUNTER: std::sync::atomic::AtomicU64 = std::sync::atomic::AtomicU64::new(0);
2121

2222
fn ensure_run_dir() {
23-
use std::os::unix::fs::PermissionsExt;
2423
let _ = std::fs::create_dir_all(TEST_RUN_DIR);
25-
// The stale-unlink guard refuses group/other-writable run dirs; pin the
26-
// mode so the process umask cannot decide test outcomes.
27-
let _ = std::fs::set_permissions(TEST_RUN_DIR, std::fs::Permissions::from_mode(0o700));
2824
}
2925

3026
fn cleanup_all(service: &str) {
@@ -1160,7 +1156,10 @@ fn test_server_falls_back_to_baseline_when_linux_shm_prepare_fails() {
11601156

11611157
let shm_path = format!("{TEST_RUN_DIR}/{svc}-{:016x}.ipcshm", 1u64);
11621158
let _ = std::fs::remove_dir_all(&shm_path);
1159+
// A non-empty directory cannot be reclaimed by stale recovery, so SHM
1160+
// prepare keeps failing and the server must fall back to baseline.
11631161
std::fs::create_dir(&shm_path).expect("create SHM obstruction directory");
1162+
std::fs::write(format!("{shm_path}/keep"), b"x").expect("populate obstruction");
11641163

11651164
let mut server = TestServer::start_with(
11661165
svc,

src/crates/netipc/src/transport/posix.rs

Lines changed: 17 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ use crate::protocol::{
1212
use std::collections::HashSet;
1313
use std::ffi::CString;
1414
use std::io;
15-
use std::os::unix::fs::MetadataExt;
1615
use std::os::unix::io::RawFd;
1716
use std::path::{Path, PathBuf};
1817
use std::sync::atomic::{AtomicU64, Ordering};
@@ -602,7 +601,7 @@ impl UdsListener {
602601
let path = build_socket_path(run_dir, service_name)?;
603602

604603
// Stale recovery
605-
match check_and_recover_stale(&path, run_dir_allows_stale_unlink(run_dir)) {
604+
match check_and_recover_stale(&path) {
606605
StaleResult::LiveServer => return Err(UdsError::AddrInUse),
607606
StaleResult::Stale | StaleResult::NotExist => { /* proceed */ }
608607
}
@@ -919,25 +918,17 @@ enum StaleResult {
919918
LiveServer,
920919
}
921920

922-
fn run_dir_allows_stale_unlink(run_dir: &str) -> bool {
923-
let metadata = match std::fs::metadata(run_dir) {
924-
Ok(m) => m,
925-
Err(_) => return false,
926-
};
927-
metadata.is_dir()
928-
&& metadata.uid() == unsafe { libc::geteuid() }
929-
&& metadata.mode() & 0o022 == 0
930-
}
931-
932-
fn check_and_recover_stale(path: &str, allow_stale_unlink: bool) -> StaleResult {
921+
fn check_and_recover_stale(path: &str) -> StaleResult {
933922
if !Path::new(path).exists() {
934923
return StaleResult::NotExist;
935924
}
936925

937926
// Try connecting to see if a live server is there
938927
let probe = unsafe { libc::socket(libc::AF_UNIX, libc::SOCK_SEQPACKET, 0) };
939928
if probe < 0 {
940-
return StaleResult::NotExist;
929+
// Cannot probe liveness (fd exhaustion) — keep the endpoint; the
930+
// subsequent bind() fails instead of risking a live socket.
931+
return StaleResult::LiveServer;
941932
}
942933

943934
let result = match connect_unix(probe, path) {
@@ -946,23 +937,21 @@ fn check_and_recover_stale(path: &str, allow_stale_unlink: bool) -> StaleResult
946937
StaleResult::LiveServer
947938
}
948939
Err(UdsError::Connect(e)) if e == libc::ENOENT => StaleResult::NotExist,
949-
Err(UdsError::Connect(e)) if e == libc::ECONNREFUSED => {
950-
if !allow_stale_unlink {
951-
StaleResult::LiveServer
952-
} else {
953-
// Connection refused means stale; unlink only in a private run dir.
954-
match std::fs::remove_file(path) {
955-
Ok(()) => StaleResult::Stale,
956-
Err(err) if err.kind() == io::ErrorKind::NotFound => StaleResult::NotExist,
957-
Err(_) => StaleResult::LiveServer,
940+
Err(_) => {
941+
// Nothing accepted the connection: a dead server's socket, or a
942+
// foreign file squatting on the endpoint path. Reclaim it.
943+
match std::fs::remove_file(path) {
944+
Ok(()) => StaleResult::Stale,
945+
Err(err) if err.kind() == io::ErrorKind::NotFound => StaleResult::NotExist,
946+
Err(err) if err.raw_os_error() == Some(libc::EISDIR) => {
947+
match std::fs::remove_dir(path) {
948+
Ok(()) => StaleResult::Stale,
949+
Err(_) => StaleResult::LiveServer,
950+
}
958951
}
952+
Err(_) => StaleResult::LiveServer,
959953
}
960954
}
961-
Err(_) => {
962-
// Other errors (EACCES, etc.) — can't determine ownership,
963-
// treat as live to prevent overwriting
964-
StaleResult::LiveServer
965-
}
966955
};
967956

968957
unsafe {

src/crates/netipc/src/transport/posix_tests.rs

Lines changed: 48 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,7 @@ const TEST_RUN_DIR: &str = "/tmp/nipc_rust_test";
99
const AUTH_TOKEN: u64 = 0xDEADBEEFCAFEBABE;
1010

1111
fn ensure_run_dir() {
12-
use std::os::unix::fs::PermissionsExt;
1312
let _ = std::fs::create_dir_all(TEST_RUN_DIR);
14-
// The stale-unlink guard refuses group/other-writable run dirs; pin the
15-
// mode so the process umask cannot decide test outcomes.
16-
let _ = std::fs::set_permissions(TEST_RUN_DIR, std::fs::Permissions::from_mode(0o700));
1713
}
1814

1915
fn cleanup_socket(service: &str) {
@@ -477,20 +473,14 @@ fn test_request_payload_over_cap() {
477473
// Test 7: Stale socket recovery
478474
// -----------------------------------------------------------------------
479475

480-
#[test]
481-
fn test_stale_recovery() {
482-
ensure_run_dir();
483-
let svc = unique_service("rs_stale");
484-
cleanup_socket(&svc);
485-
486-
let path = format!("{TEST_RUN_DIR}/{svc}.sock");
487-
488-
// Create a stale socket file (bound but not listening)
476+
/// Create a socket file that is bound but not listening — what a crashed
477+
/// server leaves behind.
478+
fn create_stale_socket_file(path: &str) {
489479
unsafe {
490480
let sock = libc::socket(libc::AF_UNIX, libc::SOCK_SEQPACKET, 0);
491481
assert!(sock >= 0);
492482

493-
let c_path = CString::new(path.as_str()).unwrap();
483+
let c_path = CString::new(path).unwrap();
494484
let mut addr: libc::sockaddr_un = std::mem::zeroed();
495485
addr.sun_family = libc::AF_UNIX as libc::sa_family_t;
496486
let path_bytes = c_path.as_bytes_with_nul();
@@ -505,7 +495,16 @@ fn test_stale_recovery() {
505495
// Close without unlink => stale
506496
libc::close(sock);
507497
}
498+
}
508499

500+
#[test]
501+
fn test_stale_recovery() {
502+
ensure_run_dir();
503+
let svc = unique_service("rs_stale");
504+
cleanup_socket(&svc);
505+
506+
let path = format!("{TEST_RUN_DIR}/{svc}.sock");
507+
create_stale_socket_file(&path);
509508
assert!(Path::new(&path).exists(), "stale socket should exist");
510509

511510
// listen should recover it
@@ -515,6 +514,41 @@ fn test_stale_recovery() {
515514
cleanup_socket(&svc);
516515
}
517516

517+
#[test]
518+
fn test_stale_recovery_in_group_writable_run_dir() {
519+
use std::os::unix::fs::PermissionsExt;
520+
// Crash recovery must work regardless of run-dir permissions
521+
// (netdata's systemd unit ships RuntimeDirectoryMode=0775).
522+
let dir = format!("{TEST_RUN_DIR}_0775");
523+
let _ = std::fs::create_dir_all(&dir);
524+
std::fs::set_permissions(&dir, std::fs::Permissions::from_mode(0o775)).expect("chmod 0775");
525+
526+
let svc = unique_service("rs_stale_gw");
527+
let path = format!("{dir}/{svc}.sock");
528+
let _ = std::fs::remove_file(&path);
529+
create_stale_socket_file(&path);
530+
531+
let listener = UdsListener::bind(&dir, &svc, default_server_config())
532+
.expect("stale recovery must work in a group-writable run dir");
533+
drop(listener);
534+
let _ = std::fs::remove_file(&path);
535+
}
536+
537+
#[test]
538+
fn test_foreign_file_at_socket_path_is_reclaimed() {
539+
ensure_run_dir();
540+
let svc = unique_service("rs_foreign");
541+
cleanup_socket(&svc);
542+
543+
let path = format!("{TEST_RUN_DIR}/{svc}.sock");
544+
std::fs::write(&path, b"accidentally copied file").expect("write foreign file");
545+
546+
let listener = UdsListener::bind(TEST_RUN_DIR, &svc, default_server_config())
547+
.expect("listen should silently replace a foreign file at the socket path");
548+
drop(listener);
549+
cleanup_socket(&svc);
550+
}
551+
518552
// -----------------------------------------------------------------------
519553
// Test 8: Disconnect detection
520554
// -----------------------------------------------------------------------

0 commit comments

Comments
 (0)