Skip to content

Commit c73c20d

Browse files
committed
Use descriptor-relative stale cleanup
1 parent 0a3ec7a commit c73c20d

4 files changed

Lines changed: 124 additions & 52 deletions

File tree

.agents/sow/done/SOW-0010-20260602-static-analysis-finding-cleanup.md

Lines changed: 22 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -730,26 +730,28 @@ Evidence:
730730
stale cleanup `unlink()` calls in
731731
`src/libnetdata/netipc/src/transport/posix/netipc_shm.c` and
732732
`src/libnetdata/netipc/src/transport/posix/netipc_uds.c`.
733-
- The stale cleanup race cannot be fully eliminated with POSIX path unlink
734-
semantics while preserving automatic stale socket/shared-memory cleanup. The
735-
implemented product constraint is that automatic stale unlink only runs in a
736-
process-owned run directory that is not group/world writable; the remaining
737-
CodeQL finding is intentionally narrow and source-documented.
738-
- The previous inline `// codeql[cpp/toctou-race-condition]` comments did not
739-
affect GitHub SARIF because the CodeQL configuration did not include the
740-
C/C++ `AlertSuppression.ql` query.
733+
- The previous full-path stale cleanup used `lstat()` before `unlink()`. Even
734+
with the process-owned private run directory requirement, CodeQL correctly
735+
treats that pattern as a path check/use race.
736+
- A local Unix socket probe showed that connecting to a regular file at the
737+
socket path returns `ECONNREFUSED`, so blindly unlinking after a failed
738+
connect would be unsafe; the socket/file type and same-inode checks still
739+
have to exist.
741740

742741
Why previous validation missed it:
743742

744743
- The local environment does not have the CodeQL CLI installed, so the only
745744
authoritative CodeQL validation point is the GitHub run after push.
746745
- The prior local validation proved the code built and local tools were clean,
747-
but it did not prove CodeQL SARIF suppression handling.
746+
but the remote CodeQL query source showed it specifically models full-path
747+
`lstat()` guarding full-path `unlink()`.
748748

749749
Repair plan:
750750

751-
- Keep the CodeQL rule enabled globally and add the C/C++ alert suppression
752-
query so the two reviewed stale-unlink suppressions are represented in SARIF.
751+
- Keep the CodeQL rule enabled globally and remove the full-path
752+
`lstat()`/`unlink()` stale cleanup pattern. POSIX stale cleanup now opens the
753+
validated private run directory and performs descriptor-relative
754+
`fstatat()`/`unlinkat()` on generated entry names.
753755
- Compile the remaining addition-overflow guards only on 32-bit `size_t`
754756
platforms, where they are real portability checks, so the 64-bit CodeQL build
755757
no longer sees constant-false comparisons.
@@ -763,8 +765,8 @@ Validation:
763765
passed.
764766
- `cmake --build build-static --parallel --target netipc_protocol netipc_uds netipc_shm netipc_service`
765767
passed.
766-
- `clang-tidy -p build-static` on the C library sources passed with the
767-
repository's existing warning-only baseline.
768+
- `clang-tidy -p build-static` on the changed POSIX transport sources passed
769+
with the repository's existing warning-only baseline.
768770
- The first attempted `cppcheck --project=build-static/compile_commands.json`
769771
validation failed because it scanned the whole compile database and surfaced
770772
existing test/benchmark findings unrelated to this SOW; that is not the
@@ -778,6 +780,11 @@ Validation:
778780
exited 0, and the SARIF result count was 0.
779781
- `codacy-analysis analyze . --install-dependencies --output-format json --output /tmp/plugin-ipc-codacy-final.json --parallel-tools 2 --tool-timeout 900000`
780782
exited 0 with 0 issues and 0 tool errors.
783+
- After the descriptor-relative stale cleanup change, `make`,
784+
`cmake --build build-static --parallel --target netipc_protocol netipc_uds netipc_shm netipc_service`,
785+
focused `clang-tidy`, workflow-equivalent `cppcheck`,
786+
`flawfinder --minlevel=5 --error-level=5 src/libnetdata/netipc tests`, and
787+
`/usr/bin/ctest --test-dir build --output-on-failure` all passed again.
781788

782789
Artifact updates:
783790

@@ -787,8 +794,8 @@ Artifact updates:
787794
`project-*` skill.
788795
- Specs: no update needed; the stale cleanup behavior was already documented in
789796
the prior regression update.
790-
- End-user/operator docs: no update needed; this follow-up only makes CodeQL
791-
suppression handling explicit and preserves the already documented behavior.
797+
- End-user/operator docs: no update needed; this follow-up preserves the
798+
already documented private runtime directory behavior.
792799
- End-user/operator skills: no update needed; the public integrator skill was
793800
already updated for the private runtime directory requirement.
794801
- SOW lifecycle: this residual remote CodeQL regression was appended after the

.github/codeql.yml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,6 @@ queries:
66
- uses: security-extended
77
- uses: security-and-quality
88

9-
packs:
10-
c-cpp:
11-
- codeql/cpp-queries:AlertSuppression.ql
12-
139
query-filters:
1410
# Current baseline rule: CodeQL queries are enabled only when this branch has
1511
# no current alerts for them, or when the query is fixed in the same change.

src/libnetdata/netipc/src/transport/posix/netipc_shm.c

Lines changed: 50 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -55,16 +55,32 @@ static int validate_service_name(const char *name)
5555
return 0;
5656
}
5757

58+
static int build_shm_name(char *dst, size_t dst_len,
59+
const char *service_name,
60+
uint64_t session_id)
61+
{
62+
if (validate_service_name(service_name) < 0)
63+
return -2; /* invalid service name */
64+
65+
int n = snprintf(dst, dst_len, "%s-%016llx.ipcshm",
66+
service_name, (unsigned long long)session_id);
67+
if (n < 0 || (size_t)n >= dst_len)
68+
return -1;
69+
return 0;
70+
}
71+
5872
/* Build per-session SHM file path: {run_dir}/{service_name}-{session_id:016x}.ipcshm */
5973
static int build_shm_path(char *dst, size_t dst_len,
6074
const char *run_dir, const char *service_name,
6175
uint64_t session_id)
6276
{
63-
if (validate_service_name(service_name) < 0)
64-
return -2; /* invalid service name */
77+
char shm_name[256];
78+
int name_rc = build_shm_name(shm_name, sizeof(shm_name), service_name,
79+
session_id);
80+
if (name_rc < 0)
81+
return name_rc;
6582

66-
int n = snprintf(dst, dst_len, "%s/%s-%016llx.ipcshm",
67-
run_dir, service_name, (unsigned long long)session_id);
83+
int n = snprintf(dst, dst_len, "%s/%s", run_dir, shm_name);
6884
if (n < 0 || (size_t)n >= dst_len)
6985
return -1;
7086
return 0;
@@ -153,28 +169,28 @@ static inline uint32_t *shm_u32_ptr(void *base, int offset)
153169
* -1 = doesn't exist
154170
* -2 = exists but undersized / invalid (treated as stale, unlinked)
155171
*/
156-
static int unlink_same_file(const char *path, const struct stat *expected,
172+
static int unlink_same_file(int dir_fd, const char *name, const struct stat *expected,
157173
bool allow_stale_unlink)
158174
{
159175
if (!allow_stale_unlink)
160176
return -1;
161177

162178
struct stat current;
163-
if (lstat(path, &current) != 0)
179+
if (fstatat(dir_fd, name, &current, AT_SYMLINK_NOFOLLOW) != 0)
164180
return (errno == ENOENT) ? 0 : -1;
165181

166182
if (current.st_dev != expected->st_dev || current.st_ino != expected->st_ino)
167183
return -1;
168184

169-
/* Stale recovery only unlinks same-inode files in a private run directory. */
170-
// codeql[cpp/toctou-race-condition]
171-
if (unlink(path) == 0 || errno == ENOENT)
185+
/* Stale recovery only unlinks same-inode entries in a private run directory. */
186+
if (unlinkat(dir_fd, name, 0) == 0 || errno == ENOENT)
172187
return 0;
173188

174189
return -1;
175190
}
176191

177-
static int check_shm_stale(const char *path, bool allow_stale_unlink)
192+
static int check_shm_stale(const char *path, int dir_fd, const char *name,
193+
bool allow_stale_unlink)
178194
{
179195
#ifdef O_NOFOLLOW
180196
int fd = open(path, O_RDONLY | O_NOFOLLOW);
@@ -203,21 +219,21 @@ static int check_shm_stale(const char *path, bool allow_stale_unlink)
203219
/* Must be at least header-sized to inspect. */
204220
if (st.st_size < (off_t)NIPC_SHM_HEADER_LEN) {
205221
close(fd);
206-
return (unlink_same_file(path, &st, allow_stale_unlink) == 0) ? -2 : 1;
222+
return (unlink_same_file(dir_fd, name, &st, allow_stale_unlink) == 0) ? -2 : 1;
207223
}
208224

209225
void *map = mmap(NULL, NIPC_SHM_HEADER_LEN, PROT_READ, MAP_SHARED, fd, 0);
210226
close(fd);
211227
if (map == MAP_FAILED) {
212-
return (unlink_same_file(path, &st, allow_stale_unlink) == 0) ? -2 : 1;
228+
return (unlink_same_file(dir_fd, name, &st, allow_stale_unlink) == 0) ? -2 : 1;
213229
}
214230

215231
const nipc_shm_region_header_t *hdr = (const nipc_shm_region_header_t *)map;
216232

217233
/* Validate magic first. */
218234
if (hdr->magic != NIPC_SHM_REGION_MAGIC) {
219235
munmap(map, NIPC_SHM_HEADER_LEN);
220-
return (unlink_same_file(path, &st, allow_stale_unlink) == 0) ? -2 : 1;
236+
return (unlink_same_file(dir_fd, name, &st, allow_stale_unlink) == 0) ? -2 : 1;
221237
}
222238

223239
int32_t owner = hdr->owner_pid;
@@ -229,7 +245,7 @@ static int check_shm_stale(const char *path, bool allow_stale_unlink)
229245
}
230246

231247
/* Dead owner or zero generation (uninitialized/legacy) — stale */
232-
return (unlink_same_file(path, &st, allow_stale_unlink) == 0) ? 0 : 1;
248+
return (unlink_same_file(dir_fd, name, &st, allow_stale_unlink) == 0) ? 0 : 1;
233249
}
234250

235251
/* ------------------------------------------------------------------ */
@@ -258,6 +274,14 @@ nipc_shm_error_t nipc_shm_server_create(const char *run_dir,
258274
if (path_rc < 0)
259275
return NIPC_SHM_ERR_PATH_TOO_LONG;
260276

277+
char shm_name[256];
278+
int name_rc = build_shm_name(shm_name, sizeof(shm_name), service_name,
279+
session_id);
280+
if (name_rc == -2)
281+
return NIPC_SHM_ERR_BAD_PARAM;
282+
if (name_rc < 0)
283+
return NIPC_SHM_ERR_PATH_TOO_LONG;
284+
261285
/* Round capacities up to alignment. */
262286
req_capacity = align64(req_capacity);
263287
resp_capacity = align64(resp_capacity);
@@ -277,7 +301,15 @@ nipc_shm_error_t nipc_shm_server_create(const char *run_dir,
277301
/* If O_EXCL failed because file exists, do stale recovery and retry. */
278302
if (fd < 0 && errno == EEXIST) {
279303
bool allow_stale_unlink = run_dir_allows_stale_unlink(run_dir);
280-
int stale = check_shm_stale(path, allow_stale_unlink);
304+
int dir_fd = allow_stale_unlink
305+
? open(run_dir, O_RDONLY | O_DIRECTORY | O_CLOEXEC)
306+
: -1;
307+
if (dir_fd < 0)
308+
allow_stale_unlink = false;
309+
310+
int stale = check_shm_stale(path, dir_fd, shm_name, allow_stale_unlink);
311+
if (dir_fd >= 0)
312+
close(dir_fd);
281313
if (stale == 1)
282314
return NIPC_SHM_ERR_ADDR_IN_USE;
283315
/* Stale file was unlinked, retry create */
@@ -797,7 +829,8 @@ void nipc_shm_cleanup_stale(const char *run_dir, const char *service_name)
797829
if (!dir)
798830
return;
799831

800-
bool allow_stale_unlink = run_dir_allows_stale_unlink(run_dir);
832+
int dir_fd = dirfd(dir);
833+
bool allow_stale_unlink = dir_fd >= 0 && run_dir_allows_stale_unlink(run_dir);
801834

802835
struct dirent *ent;
803836
while ((ent = readdir(dir)) != NULL) {
@@ -819,7 +852,7 @@ void nipc_shm_cleanup_stale(const char *run_dir, const char *service_name)
819852

820853
/* check_shm_stale unlinks stale files and returns:
821854
* 0 = stale (unlinked), +1 = live, -1 = gone, -2 = invalid (unlinked) */
822-
check_shm_stale(path, allow_stale_unlink);
855+
check_shm_stale(path, dir_fd, ent->d_name, allow_stale_unlink);
823856
}
824857

825858
closedir(dir);

src/libnetdata/netipc/src/transport/posix/netipc_uds.c

Lines changed: 52 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -62,14 +62,27 @@ static int validate_service_name(const char *name)
6262
return 0;
6363
}
6464

65+
static int build_socket_name(char *dst, size_t dst_len, const char *service_name)
66+
{
67+
if (validate_service_name(service_name) < 0)
68+
return -2; /* invalid service name */
69+
70+
int n = snprintf(dst, dst_len, "%s.sock", service_name);
71+
if (n < 0 || (size_t)n >= dst_len)
72+
return -1;
73+
return 0;
74+
}
75+
6576
/* Build socket path into dst, return 0 on success, -1 if too long. */
6677
static int build_socket_path(char *dst, size_t dst_len,
6778
const char *run_dir, const char *service_name)
6879
{
69-
if (validate_service_name(service_name) < 0)
70-
return -2; /* invalid service name */
80+
char socket_name[sizeof(((struct sockaddr_un *)0)->sun_path)];
81+
int name_rc = build_socket_name(socket_name, sizeof(socket_name), service_name);
82+
if (name_rc < 0)
83+
return name_rc;
7184

72-
int n = snprintf(dst, dst_len, "%s/%s.sock", run_dir, service_name);
85+
int n = snprintf(dst, dst_len, "%s/%s", run_dir, socket_name);
7386
if (n < 0 || (size_t)n >= dst_len)
7487
return -1; /* path too long */
7588
return 0;
@@ -506,28 +519,42 @@ static nipc_uds_error_t server_handshake(int fd,
506519
/* Stale endpoint recovery */
507520
/* ------------------------------------------------------------------ */
508521

509-
static int unlink_stale_socket_path(const char *path, bool allow_stale_unlink)
522+
static int unlink_stale_socket_path(const char *run_dir,
523+
const char *socket_name,
524+
bool allow_stale_unlink)
510525
{
511526
if (!allow_stale_unlink)
512527
return -1;
513528

529+
int dir_fd = open(run_dir, O_RDONLY | O_DIRECTORY | O_CLOEXEC);
530+
if (dir_fd < 0)
531+
return -1;
532+
514533
struct stat st;
515-
if (lstat(path, &st) != 0)
516-
return (errno == ENOENT) ? 0 : -1;
534+
if (fstatat(dir_fd, socket_name, &st, AT_SYMLINK_NOFOLLOW) != 0) {
535+
int ret = (errno == ENOENT) ? 0 : -1;
536+
close(dir_fd);
537+
return ret;
538+
}
517539

518-
if (!S_ISSOCK(st.st_mode))
540+
if (!S_ISSOCK(st.st_mode)) {
541+
close(dir_fd);
519542
return -1;
543+
}
520544

521-
/* Stale recovery only unlinks entries in a private run directory. */
522-
// codeql[cpp/toctou-race-condition]
523-
if (unlink(path) == 0 || errno == ENOENT)
524-
return 0;
545+
int ret = -1;
546+
if (unlinkat(dir_fd, socket_name, 0) == 0 || errno == ENOENT)
547+
ret = 0;
525548

526-
return -1;
549+
close(dir_fd);
550+
return ret;
527551
}
528552

529553
/* Returns: 0 = stale (unlinked), 1 = live server, -1 = doesn't exist */
530-
static int check_and_recover_stale(const char *path, bool allow_stale_unlink)
554+
static int check_and_recover_stale(const char *run_dir,
555+
const char *socket_name,
556+
const char *path,
557+
bool allow_stale_unlink)
531558
{
532559
/* Try connecting to check if a live server is there */
533560
int probe = socket(AF_UNIX, SOCK_SEQPACKET, 0);
@@ -548,11 +575,12 @@ static int check_and_recover_stale(const char *path, bool allow_stale_unlink)
548575
int saved_errno = errno;
549576
close(probe);
550577
/* Only ECONNREFUSED proves a stale socket path. Other errors
551-
* (including regular files at the path) must not remove anything. */
578+
* must not remove anything. */
552579
if (saved_errno == ENOENT) {
553580
ret = -1;
554581
} else if (saved_errno == ECONNREFUSED) {
555-
ret = (unlink_stale_socket_path(path, allow_stale_unlink) == 0) ? 0 : 1;
582+
ret = (unlink_stale_socket_path(run_dir, socket_name,
583+
allow_stale_unlink) == 0) ? 0 : 1;
556584
} else {
557585
/* Can't determine ownership — treat as live to prevent overwriting */
558586
ret = 1;
@@ -581,9 +609,17 @@ nipc_uds_error_t nipc_uds_listen(const char *run_dir,
581609
if (path_rc < 0)
582610
return NIPC_UDS_ERR_PATH_TOO_LONG;
583611

612+
char socket_name[sizeof(((struct sockaddr_un *)0)->sun_path)];
613+
int name_rc = build_socket_name(socket_name, sizeof(socket_name), service_name);
614+
if (name_rc == -2)
615+
return NIPC_UDS_ERR_BAD_PARAM;
616+
if (name_rc < 0)
617+
return NIPC_UDS_ERR_PATH_TOO_LONG;
618+
584619
/* Stale recovery */
585620
bool allow_stale_unlink = run_dir_allows_stale_unlink(run_dir);
586-
int stale = check_and_recover_stale(path, allow_stale_unlink);
621+
int stale = check_and_recover_stale(run_dir, socket_name, path,
622+
allow_stale_unlink);
587623
if (stale == 1)
588624
return NIPC_UDS_ERR_ADDR_IN_USE;
589625

0 commit comments

Comments
 (0)