Skip to content

gfapi: use-after-free race between glfs_open and glfs_fini (no shutdown gate) #4666

@ThalesBarretto

Description

@ThalesBarretto

Summary

A race condition in libgfapi allows glfs_open() and 30+ other entry points to create file descriptors on a graph that glfs_fini() is actively tearing down. The fd references freed inodes and xlator private data, causing use-after-free.

This problem has been discussed informally for years (see #404, "Implement proper cleanup sequence", filed 2018). The drain loop comment at glfs.c:1303"leaked frames may exist, we ignore" — acknowledges the symptom. This issue provides the first formal specification of the race and a proposed fix.

Root Cause

__GLFS_ENTRY_VALIDATE_FS (glfs-internal.h:434) checks only if (!fs). It never checks cleanup_started, migration_in_progress, or any shutdown flag. glfs_fini() does not set any shutdown flag at entry; ctx->cleanup_started is set inside the drain loop (glfs.c:1294) only after call_pool->cnt reaches 0.

Notably, cleanup_started IS checked in 2 of 30+ entry paths (glfs-handleops.c:2208 and glfs-fops.c:6268 for upcalls), showing that the need for a shutdown gate was recognized but only partially applied.

Race Sequence

Thread B (glfs_fini)                    Thread A (glfs_open)
--------------------                    --------------------
enters drain loop (line 1282)
  NO shutdown flag set
  polls call_pool->cnt == 0 -> breaks
  sets ctx->cleanup_started = 1
                                        __GLFS_ENTRY_VALIDATE_FS -> passes (no check)
                                        glfs_active_subvol -> succeeds (no migration)
                                        subvol->winds++
                                        fd_create(loc.inode) <- inode from dying table
                                        syncop_open -> STACK_WIND into dying stack
calls glfs_active_subvol (line 1316)
xlator_notify(PARENT_DOWN)
waits for graph->used == 0
  ...CHILD_DOWN arrives...
inode_table_destroy_all                 <- Thread A's inode is freed here
glusterfs_graph_deactivate              <- Thread A's xlator->private freed here
syncenv_destroy                         <- Thread A's syncop engine destroyed here

Two Distinct Bugs

Bug A -- New fd after fini starts: glfs_open() can succeed during or after the drain loop because no shutdown flag is checked. Creates an fd on a graph entering teardown.

Bug B -- Destruction during use: The drain loop's call_pool->cnt == 0 is a point-in-time check. New operations starting between drain-exit and inode_table_destroy_all are not caught. glfs_fini proceeds unconditionally to destroy inodes, deactivate the graph, and destroy the syncenv.

Why Existing Guards Don't Help

Guard Why it doesn't prevent the race
__GLFS_ENTRY_VALIDATE_FS Checks only fs != NULL
glfs_lock(fs, _gf_true) Waits for migration_in_progress (graph switch), not shutdown
subvol->winds++ glfs_fini sends PARENT_DOWN explicitly and proceeds regardless of wind count
graph->used Boolean set by CHILD_UP/DOWN events, not a reference counter for in-flight ops
call_pool->cnt in drain loop Point-in-time check; new operations can start after it passes

Formal Verification

A TLA+ model of the gfapi session lifecycle proves this race at the protocol level. Key modeling insights:

  1. Non-atomic decomposition: glfs_fini() is split into GlfsFiniStart (sets flag) and GlfsFiniComplete (transitions to DRAINING) -- two separate TLA+ actions, because another thread can act between them.

  2. Ghost variable: stale_fds tracks a semantic property invisible in the C code -- "this fd references a graph that has begun teardown." The NoStaleFds invariant (stale_fds = 0) is violated at depth 5 with 19 distinct states.

  3. The model is conservative: It assumes a flag is set at fini entry. The real C code sets no flag at all -- the actual race window is wider than modeled.

Reproduce the verification:

java -cp /usr/share/java/tla2tools.jar tlc2.TLC gfapi_session -workers auto
# -> Invariant NoStaleFds is violated.

Affected Consumers

Any multi-threaded gfapi consumer is vulnerable: QEMU (iothreads), NFS-Ganesha (thread pool), custom gfapi applications. Samba's fork-per-client model mitigates in the common case, but Samba's directory fd handling interacts with this (see Samba MR !4474).

Proposed Fix

Add a shutting_down flag to struct glfs, set it at the top of pub_glfs_fini (before the drain loop), and check it at two choke points:

  1. __GLFS_ENTRY_VALIDATE_FS -- fast-path rejection (no mutex; safe because the flag is monotonic)
  2. priv_glfs_active_subvol -- authoritative gate (under fs->mutex)

Since glfs_fini itself is blocked by the new gate, replace its glfs_active_subvol call with a direct fs->active_subvol read under mutex, mirroring the old_subvol cleanup from priv_glfs_active_subvol.

Cleanup paths (glfs_fd_destroy, glfs_subvol_done) use glfs_lock with _gf_false and are unaffected.

This addresses Bug A (no new operations after fini starts). Bug B (in-flight operations between drain-loop exit and graph destruction) is a separate fix requiring a barrier or re-check mechanism.

Patch: branch fix-gfapi-fini-open-race on ThalesBarretto/glusterfs

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions