Add timeout protection to join mechanism#563
Conversation
- Add timeout to whilePaused() to prevent deadlock if mutation queue is stuck - Crash with it_fatalError on timeout since skipping joined block would leave state inconsistent (DEBUG: 2s max, RELEASE: 15s max) - Add cancellation flag to prevent async block from setting joined state after timeout - Add DEBUG-only reentrancy guard to detect overlapping joins - Update performSynchroDanceWithBlock: to return BOOL for API clarity
|
Have you encountered a case where the mutation queue stalls or is this meant to be defensive? |
|
Both. I did hit a deadlock, but after further consideration, it wouldn't actually catch the deadlock I'm struggling with, since in my case, the Mutation queue closure is past sema2.signal() and stuck at sema.wait() I'll prioritize recreating this deadlock. I think it went away after I implemented the reentrance check (but that seems tangentially related, at best) and I shifted my focus to thinking about how to prevent TaskNotifier from getting into the wait() state in the first place, i.e. #560 |
|
The most likely way that would happen is that something forgot to unpause a |
Summary
Add timeouts and fatal errors in
whilePaused()to avoid indefinite hangs when the mutation queue is stuck.Motivation
whilePaused()could wait indefinitely if the mutation queue stalls. A hang gives no diagnostics, while a crash on timeout yields a useful crash log pinpointing the stuck state. (And we are likely to hit this deadlock more frequently as a consequence of #552, #553, 0f7425e, etc.)Changes
whilePaused()semaphore waits (DEBUG: 1–2s, RELEASE: 5–15s)it_fatalErroron timeout — skipping a joined block would leave mutable state inconsistentperformSynchroDanceWithBlock:return type toBOOLfor clarityConsiderations
This might be best left as debug-only, since the UI does not fully freeze in this deadlock. Tab switching still works, and users might want to observe some state before closing. It also might be best to leave the terminal in a hung state if dying means killing tasks that are still running. At the same time, continuing is really not possible afaict, and the freeze impairs automated testing, CI (or my local approximation of it), etc., so I think it's good to fail fast in debug.
Maybe there's a middle ground where the user gets some indication that the end is nigh, but is not forced to close?