Fix deadlock in whilePaused when re-entered via nested runloop#559
Fix deadlock in whilePaused when re-entered via nested runloop#559chall37 wants to merge 1 commit into
Conversation
Check iTermGCD.joined at entry to avoid scheduling a second semaphore dance when the mutation queue is already waiting.
| // Runs block synchronously while token executor is stopped. | ||
| func whilePaused(_ block: () -> (), onExecutorQueue: Bool) { | ||
| dispatchPrecondition(condition: .onQueue(.main)) | ||
| if iTermGCD.joined { |
There was a problem hiding this comment.
I'm fairly convinced this can't happen in practice, although it is certainly possible that a future change could create problems.
There are two ways we enter the window of vulnerability:
Via performBlockWithJoinedThreads
- The main thread calls performBlockWithJoinedThreads
- performBlockWithJoinedThreads calls performSynchroDanceWithBlock.
- performSynchroDanceWithBlock calls whilePaused.
- While in whilePaused, sema2 is signaled. iTermGCD.joined is set to true. The window of vulnerability opens.
- whilePaused calls its block, which goes to performSynchroDanceWithBlock's block.
- performSynchroDanceWithBlock calls its block, which goes to performBlockWithJoinedThreads's block.
- It calls reallyPerformBlockWithJoinedThreads. At this point, gPerformingJoinedBlock is set to 1. The window of vulnerability has closed.
For a deadlock to occur, some code between line 442 here and atomic_exchange(&gPerformingJoinedBlock, 1) in reallyPerformBlockWithJoinedThreads would need to create a nested runloop.
Via performLightweightBlockWithJoinedThreads
- The main thread calls performLightweightBlockWithJoinedThreads
- performLightweightBlockWithJoinedThreads calls performSynchroDanceWithBlock
- performSynchroDanceWithBlock calls whilePaused.
- While in whilePaused, sema2 is signaled. iTermGCD.joined is set to true. The window of vulnerability opens.
- whilePaused calls its block, which goes to performSynchroDanceWithBlock's block.
- performSynchroDanceWithBlock calls its block, which goes to performLightweightBlockWithJoinedThreads block.
- It calls reallyPerformLightweightBlockWithJoinedThreads
- It does
VT100ScreenMutableState.performingJoinedBlock = YES, which closes the vulnerability window.
With this improvement, we could still have nasty problems during the vulnerability window because performSynchroDanceWithBlock's block would do [iTermGCD setMainQueueSafe:NO] even though that's not actually true.
That being said, it would be good to detect any bug that could lead to this condition because it would be horrible to track down. What about doing this instead?
it_assert(!iTermGCD.joined, "Reentrant call to whilePaused would deadlock")
This code is unfortunately complex and brittle and I am grateful that you're talking the time to think deeply about it.
|
Agreed, this is exceedingly unlikely to happen as-is, probably to the point of impossibility. That's why I waffled between submitting it or not. But since I found it while trying to hunt down an actual deadlock, I figured it was at least worth mentioning. Your suggestion makes more sense than trying to continue though! |
|
FYI 6c8672d fixes a deadlock in the case where VT100ScreenMutableState is dealloced while paused. |
Summary
Fixes a potential deadlock in
whilePausedwhen re-entered via a nested runloop.The Problem
When
whilePausedis called, the following sequence occurs:sema2sema2, setsiTermGCD.joined = true, waits onsemasema2, runs the block parameterreallyPerformBlockWithJoinedThreadssetsgPerformingJoinedBlock = trueThere is a window between steps 4 and 5 where
iTermGCD.joinedis true butgPerformingJoinedBlockis false. If code in step 4 triggers a nested runloop that causes another join attempt, the existinggPerformingJoinedBlockcheck will not catch it.Without this fix, the nested
whilePausedcall would schedule another async block on the mutation queue and wait on a new semaphore. But the mutation queue is blocked waiting onsemafrom step 3, so the new async block never runs, and main queue waits forever—deadlock.The Fix
Check
iTermGCD.joinedat the start ofwhilePaused. If true, we are already in a joined state (mutation queue is paused), so just run the block directly and return. This complements the existinggPerformingJoinedBlockchecks, providing defense-in-depth for the narrow window before that flag is set.