Commit e65daa8
vfs: fix MemFS cloneMu recursive RLock deadlock
Three MemFS methods deadlock against concurrent CrashClone due to
recursive cloneMu.RLock() acquisition. Go's sync.RWMutex is
writer-preferring: once CrashClone is waiting for the write lock, any
new RLock call blocks behind it. When a method already holding RLock
calls another method that also tries RLock, neither can proceed.
The three affected methods and their recursive locking chains:
- ReuseForWrite: takes RLock, then calls Rename which takes RLock
- Link: takes RLock at entry, then takes RLock again inside inner
walk callback (copy-paste of the locking boilerplate)
- Lock: takes RLock, then calls Create which takes RLock
The bug was introduced in a70d5b3 ("vfs: redesign MemFS strict mode",
2024-08-26), which replaced the old SetIgnoreSyncs/ResetToSyncedState
approach with the CrashClone model. That commit systematically added
cloneMu.RLock() to every mutation method, but did not trace call graphs
to catch methods that compose other MemFS methods.
The deadlock has existed for ~20 months as a sporadic CI flake (#4677,
#5199, #5327). CrashClone fires only once per test iteration (when the
error injector's write counter hits zero), so the contention window is
extremely narrow on fast hardware. It surfaces reliably only on slow
platforms: s390x QEMU (~10x slower), Windows CI, linux-race. Each
occurrence was closed without a root cause until #5987 provided a full
goroutine dump showing the RWMutex.Lock / RWMutex.RLock deadlock.
Fix: extract lock-free internal methods (rename, create) that assume the
caller already holds cloneMu. The public Rename/Create methods delegate
to these after acquiring the lock. ReuseForWrite calls rename() and Lock
calls create() to avoid the recursive acquisition. For Link, the inner
RLock/RUnlock is simply removed since the outer lock at method entry
already covers the entire body. Add a comment on the cloneMu field
documenting the recursive-locking constraint to prevent recurrence.
Add TestMemFSCrashCloneConcurrency which exercises concurrent CrashClone
against ReuseForWrite, Link, and Lock. Without this fix, the test
deadlocks within seconds.
Fixes #5987.
Informs #5959, #5972.
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>1 parent 52cdf9e commit e65daa8
2 files changed
Lines changed: 133 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
85 | 85 | | |
86 | 86 | | |
87 | 87 | | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
88 | 93 | | |
89 | 94 | | |
90 | 95 | | |
| |||
224 | 229 | | |
225 | 230 | | |
226 | 231 | | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
227 | 238 | | |
228 | 239 | | |
229 | 240 | | |
| |||
281 | 292 | | |
282 | 293 | | |
283 | 294 | | |
284 | | - | |
285 | | - | |
286 | 295 | | |
287 | 296 | | |
288 | 297 | | |
| |||
423 | 432 | | |
424 | 433 | | |
425 | 434 | | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
426 | 441 | | |
427 | 442 | | |
428 | 443 | | |
| |||
461 | 476 | | |
462 | 477 | | |
463 | 478 | | |
464 | | - | |
| 479 | + | |
465 | 480 | | |
466 | 481 | | |
467 | 482 | | |
| |||
530 | 545 | | |
531 | 546 | | |
532 | 547 | | |
533 | | - | |
| 548 | + | |
534 | 549 | | |
535 | 550 | | |
536 | 551 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
| 8 | + | |
8 | 9 | | |
9 | 10 | | |
10 | 11 | | |
11 | 12 | | |
12 | 13 | | |
13 | 14 | | |
| 15 | + | |
14 | 16 | | |
| 17 | + | |
15 | 18 | | |
16 | 19 | | |
17 | 20 | | |
| |||
140 | 143 | | |
141 | 144 | | |
142 | 145 | | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
143 | 257 | | |
144 | 258 | | |
145 | 259 | | |
| |||
0 commit comments