gfapi: use-after-free race between glfs_open and glfs_fini (no shutdown gate)

## Summary

A race condition in libgfapi allows `glfs_open()` and 30+ other entry points to create file descriptors on a graph that `glfs_fini()` is actively tearing down. The fd references freed inodes and xlator private data, causing use-after-free.

This problem has been discussed informally for years (see #404, "Implement proper cleanup sequence", filed 2018). The drain loop comment at `glfs.c:1303` — *"leaked frames may exist, we ignore"* — acknowledges the symptom. This issue provides the first formal specification of the race and a proposed fix.

## Root Cause

`__GLFS_ENTRY_VALIDATE_FS` (`glfs-internal.h:434`) checks only `if (!fs)`. It never checks `cleanup_started`, `migration_in_progress`, or any shutdown flag. `glfs_fini()` does not set any shutdown flag at entry; `ctx->cleanup_started` is set **inside** the drain loop (`glfs.c:1294`) only after `call_pool->cnt` reaches 0.

Notably, `cleanup_started` IS checked in 2 of 30+ entry paths (`glfs-handleops.c:2208` and `glfs-fops.c:6268` for upcalls), showing that the need for a shutdown gate was recognized but only partially applied.

## Race Sequence

```
Thread B (glfs_fini)                    Thread A (glfs_open)
--------------------                    --------------------
enters drain loop (line 1282)
  NO shutdown flag set
  polls call_pool->cnt == 0 -> breaks
  sets ctx->cleanup_started = 1
                                        __GLFS_ENTRY_VALIDATE_FS -> passes (no check)
                                        glfs_active_subvol -> succeeds (no migration)
                                        subvol->winds++
                                        fd_create(loc.inode) <- inode from dying table
                                        syncop_open -> STACK_WIND into dying stack
calls glfs_active_subvol (line 1316)
xlator_notify(PARENT_DOWN)
waits for graph->used == 0
  ...CHILD_DOWN arrives...
inode_table_destroy_all                 <- Thread A's inode is freed here
glusterfs_graph_deactivate              <- Thread A's xlator->private freed here
syncenv_destroy                         <- Thread A's syncop engine destroyed here
```

## Two Distinct Bugs

**Bug A -- New fd after fini starts:** `glfs_open()` can succeed during or after the drain loop because no shutdown flag is checked. Creates an fd on a graph entering teardown.

**Bug B -- Destruction during use:** The drain loop's `call_pool->cnt == 0` is a point-in-time check. New operations starting between drain-exit and `inode_table_destroy_all` are not caught. `glfs_fini` proceeds unconditionally to destroy inodes, deactivate the graph, and destroy the syncenv.

## Why Existing Guards Don't Help

| Guard | Why it doesn't prevent the race |
|-------|-------------------------------|
| `__GLFS_ENTRY_VALIDATE_FS` | Checks only `fs != NULL` |
| `glfs_lock(fs, _gf_true)` | Waits for `migration_in_progress` (graph switch), not shutdown |
| `subvol->winds++` | `glfs_fini` sends PARENT_DOWN explicitly and proceeds regardless of wind count |
| `graph->used` | Boolean set by CHILD_UP/DOWN events, not a reference counter for in-flight ops |
| `call_pool->cnt` in drain loop | Point-in-time check; new operations can start after it passes |

## Formal Verification

A TLA+ model of the gfapi session lifecycle proves this race at the protocol level. Key modeling insights:

1. **Non-atomic decomposition:** `glfs_fini()` is split into `GlfsFiniStart` (sets flag) and `GlfsFiniComplete` (transitions to DRAINING) -- two separate TLA+ actions, because another thread can act between them.

2. **Ghost variable:** `stale_fds` tracks a semantic property invisible in the C code -- "this fd references a graph that has begun teardown." The `NoStaleFds` invariant (`stale_fds = 0`) is violated at depth 5 with 19 distinct states.

3. **The model is conservative:** It assumes a flag is set at fini entry. The real C code sets no flag at all -- the actual race window is wider than modeled.

Reproduce the verification:
```bash
java -cp /usr/share/java/tla2tools.jar tlc2.TLC gfapi_session -workers auto
# -> Invariant NoStaleFds is violated.
```

## Affected Consumers

Any multi-threaded gfapi consumer is vulnerable: QEMU (iothreads), NFS-Ganesha (thread pool), custom gfapi applications. Samba's fork-per-client model mitigates in the common case, but Samba's directory fd handling interacts with this (see Samba MR !4474).

## Proposed Fix

Add a `shutting_down` flag to `struct glfs`, set it at the **top** of `pub_glfs_fini` (before the drain loop), and check it at two choke points:

1. `__GLFS_ENTRY_VALIDATE_FS` -- fast-path rejection (no mutex; safe because the flag is monotonic)
2. `priv_glfs_active_subvol` -- authoritative gate (under `fs->mutex`)

Since `glfs_fini` itself is blocked by the new gate, replace its `glfs_active_subvol` call with a direct `fs->active_subvol` read under mutex, mirroring the `old_subvol` cleanup from `priv_glfs_active_subvol`.

Cleanup paths (`glfs_fd_destroy`, `glfs_subvol_done`) use `glfs_lock` with `_gf_false` and are unaffected.

This addresses Bug A (no new operations after fini starts). Bug B (in-flight operations between drain-loop exit and graph destruction) is a separate fix requiring a barrier or re-check mechanism.

Patch: branch `fix-gfapi-fini-open-race` on ThalesBarretto/glusterfs

## Related Issues

- #404 -- "Implement proper cleanup sequence" (closed stale, 2018)
- #4644 -- `afr_notify` null deref during fini (open)
- #2606 -- `glfs_fini` takes too much time (closed)
- PR #4527 -- Don't fire callback during cleanup (open)

Guard	Why it doesn't prevent the race
`__GLFS_ENTRY_VALIDATE_FS`	Checks only `fs != NULL`
`glfs_lock(fs, _gf_true)`	Waits for `migration_in_progress` (graph switch), not shutdown
`subvol->winds++`	`glfs_fini` sends PARENT_DOWN explicitly and proceeds regardless of wind count
`graph->used`	Boolean set by CHILD_UP/DOWN events, not a reference counter for in-flight ops
`call_pool->cnt` in drain loop	Point-in-time check; new operations can start after it passes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gfapi: use-after-free race between glfs_open and glfs_fini (no shutdown gate) #4666

Summary

Root Cause

Race Sequence

Two Distinct Bugs

Why Existing Guards Don't Help

Formal Verification

Affected Consumers

Proposed Fix

Related Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

gfapi: use-after-free race between glfs_open and glfs_fini (no shutdown gate) #4666

Description

Summary

Root Cause

Race Sequence

Two Distinct Bugs

Why Existing Guards Don't Help

Formal Verification

Affected Consumers

Proposed Fix

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions