Summary
A race condition in libgfapi allows glfs_open() and 30+ other entry points to create file descriptors on a graph that glfs_fini() is actively tearing down. The fd references freed inodes and xlator private data, causing use-after-free.
This problem has been discussed informally for years (see #404, "Implement proper cleanup sequence", filed 2018). The drain loop comment at glfs.c:1303 — "leaked frames may exist, we ignore" — acknowledges the symptom. This issue provides the first formal specification of the race and a proposed fix.
Root Cause
__GLFS_ENTRY_VALIDATE_FS (glfs-internal.h:434) checks only if (!fs). It never checks cleanup_started, migration_in_progress, or any shutdown flag. glfs_fini() does not set any shutdown flag at entry; ctx->cleanup_started is set inside the drain loop (glfs.c:1294) only after call_pool->cnt reaches 0.
Notably, cleanup_started IS checked in 2 of 30+ entry paths (glfs-handleops.c:2208 and glfs-fops.c:6268 for upcalls), showing that the need for a shutdown gate was recognized but only partially applied.
Race Sequence
Thread B (glfs_fini) Thread A (glfs_open)
-------------------- --------------------
enters drain loop (line 1282)
NO shutdown flag set
polls call_pool->cnt == 0 -> breaks
sets ctx->cleanup_started = 1
__GLFS_ENTRY_VALIDATE_FS -> passes (no check)
glfs_active_subvol -> succeeds (no migration)
subvol->winds++
fd_create(loc.inode) <- inode from dying table
syncop_open -> STACK_WIND into dying stack
calls glfs_active_subvol (line 1316)
xlator_notify(PARENT_DOWN)
waits for graph->used == 0
...CHILD_DOWN arrives...
inode_table_destroy_all <- Thread A's inode is freed here
glusterfs_graph_deactivate <- Thread A's xlator->private freed here
syncenv_destroy <- Thread A's syncop engine destroyed here
Two Distinct Bugs
Bug A -- New fd after fini starts: glfs_open() can succeed during or after the drain loop because no shutdown flag is checked. Creates an fd on a graph entering teardown.
Bug B -- Destruction during use: The drain loop's call_pool->cnt == 0 is a point-in-time check. New operations starting between drain-exit and inode_table_destroy_all are not caught. glfs_fini proceeds unconditionally to destroy inodes, deactivate the graph, and destroy the syncenv.
Why Existing Guards Don't Help
| Guard |
Why it doesn't prevent the race |
__GLFS_ENTRY_VALIDATE_FS |
Checks only fs != NULL |
glfs_lock(fs, _gf_true) |
Waits for migration_in_progress (graph switch), not shutdown |
subvol->winds++ |
glfs_fini sends PARENT_DOWN explicitly and proceeds regardless of wind count |
graph->used |
Boolean set by CHILD_UP/DOWN events, not a reference counter for in-flight ops |
call_pool->cnt in drain loop |
Point-in-time check; new operations can start after it passes |
Formal Verification
A TLA+ model of the gfapi session lifecycle proves this race at the protocol level. Key modeling insights:
-
Non-atomic decomposition: glfs_fini() is split into GlfsFiniStart (sets flag) and GlfsFiniComplete (transitions to DRAINING) -- two separate TLA+ actions, because another thread can act between them.
-
Ghost variable: stale_fds tracks a semantic property invisible in the C code -- "this fd references a graph that has begun teardown." The NoStaleFds invariant (stale_fds = 0) is violated at depth 5 with 19 distinct states.
-
The model is conservative: It assumes a flag is set at fini entry. The real C code sets no flag at all -- the actual race window is wider than modeled.
Reproduce the verification:
java -cp /usr/share/java/tla2tools.jar tlc2.TLC gfapi_session -workers auto
# -> Invariant NoStaleFds is violated.
Affected Consumers
Any multi-threaded gfapi consumer is vulnerable: QEMU (iothreads), NFS-Ganesha (thread pool), custom gfapi applications. Samba's fork-per-client model mitigates in the common case, but Samba's directory fd handling interacts with this (see Samba MR !4474).
Proposed Fix
Add a shutting_down flag to struct glfs, set it at the top of pub_glfs_fini (before the drain loop), and check it at two choke points:
__GLFS_ENTRY_VALIDATE_FS -- fast-path rejection (no mutex; safe because the flag is monotonic)
priv_glfs_active_subvol -- authoritative gate (under fs->mutex)
Since glfs_fini itself is blocked by the new gate, replace its glfs_active_subvol call with a direct fs->active_subvol read under mutex, mirroring the old_subvol cleanup from priv_glfs_active_subvol.
Cleanup paths (glfs_fd_destroy, glfs_subvol_done) use glfs_lock with _gf_false and are unaffected.
This addresses Bug A (no new operations after fini starts). Bug B (in-flight operations between drain-loop exit and graph destruction) is a separate fix requiring a barrier or re-check mechanism.
Patch: branch fix-gfapi-fini-open-race on ThalesBarretto/glusterfs
Related Issues
Summary
A race condition in libgfapi allows
glfs_open()and 30+ other entry points to create file descriptors on a graph thatglfs_fini()is actively tearing down. The fd references freed inodes and xlator private data, causing use-after-free.This problem has been discussed informally for years (see #404, "Implement proper cleanup sequence", filed 2018). The drain loop comment at
glfs.c:1303— "leaked frames may exist, we ignore" — acknowledges the symptom. This issue provides the first formal specification of the race and a proposed fix.Root Cause
__GLFS_ENTRY_VALIDATE_FS(glfs-internal.h:434) checks onlyif (!fs). It never checkscleanup_started,migration_in_progress, or any shutdown flag.glfs_fini()does not set any shutdown flag at entry;ctx->cleanup_startedis set inside the drain loop (glfs.c:1294) only aftercall_pool->cntreaches 0.Notably,
cleanup_startedIS checked in 2 of 30+ entry paths (glfs-handleops.c:2208andglfs-fops.c:6268for upcalls), showing that the need for a shutdown gate was recognized but only partially applied.Race Sequence
Two Distinct Bugs
Bug A -- New fd after fini starts:
glfs_open()can succeed during or after the drain loop because no shutdown flag is checked. Creates an fd on a graph entering teardown.Bug B -- Destruction during use: The drain loop's
call_pool->cnt == 0is a point-in-time check. New operations starting between drain-exit andinode_table_destroy_allare not caught.glfs_finiproceeds unconditionally to destroy inodes, deactivate the graph, and destroy the syncenv.Why Existing Guards Don't Help
__GLFS_ENTRY_VALIDATE_FSfs != NULLglfs_lock(fs, _gf_true)migration_in_progress(graph switch), not shutdownsubvol->winds++glfs_finisends PARENT_DOWN explicitly and proceeds regardless of wind countgraph->usedcall_pool->cntin drain loopFormal Verification
A TLA+ model of the gfapi session lifecycle proves this race at the protocol level. Key modeling insights:
Non-atomic decomposition:
glfs_fini()is split intoGlfsFiniStart(sets flag) andGlfsFiniComplete(transitions to DRAINING) -- two separate TLA+ actions, because another thread can act between them.Ghost variable:
stale_fdstracks a semantic property invisible in the C code -- "this fd references a graph that has begun teardown." TheNoStaleFdsinvariant (stale_fds = 0) is violated at depth 5 with 19 distinct states.The model is conservative: It assumes a flag is set at fini entry. The real C code sets no flag at all -- the actual race window is wider than modeled.
Reproduce the verification:
java -cp /usr/share/java/tla2tools.jar tlc2.TLC gfapi_session -workers auto # -> Invariant NoStaleFds is violated.Affected Consumers
Any multi-threaded gfapi consumer is vulnerable: QEMU (iothreads), NFS-Ganesha (thread pool), custom gfapi applications. Samba's fork-per-client model mitigates in the common case, but Samba's directory fd handling interacts with this (see Samba MR !4474).
Proposed Fix
Add a
shutting_downflag tostruct glfs, set it at the top ofpub_glfs_fini(before the drain loop), and check it at two choke points:__GLFS_ENTRY_VALIDATE_FS-- fast-path rejection (no mutex; safe because the flag is monotonic)priv_glfs_active_subvol-- authoritative gate (underfs->mutex)Since
glfs_finiitself is blocked by the new gate, replace itsglfs_active_subvolcall with a directfs->active_subvolread under mutex, mirroring theold_subvolcleanup frompriv_glfs_active_subvol.Cleanup paths (
glfs_fd_destroy,glfs_subvol_done) useglfs_lockwith_gf_falseand are unaffected.This addresses Bug A (no new operations after fini starts). Bug B (in-flight operations between drain-loop exit and graph destruction) is a separate fix requiring a barrier or re-check mechanism.
Patch: branch
fix-gfapi-fini-open-raceon ThalesBarretto/glusterfsRelated Issues
afr_notify()due to access after free inafr_has_quorum#4644 --afr_notifynull deref during fini (open)glfs_finitakes too much time (closed)