Skip to content

Fix deadlock in whilePaused when re-entered via nested runloop#559

Open
chall37 wants to merge 1 commit into
gnachman:masterfrom
chall37:fix/whilePaused-reentrant-deadlock
Open

Fix deadlock in whilePaused when re-entered via nested runloop#559
chall37 wants to merge 1 commit into
gnachman:masterfrom
chall37:fix/whilePaused-reentrant-deadlock

Conversation

@chall37
Copy link
Copy Markdown
Contributor

@chall37 chall37 commented Jan 20, 2026

Summary

Fixes a potential deadlock in whilePaused when re-entered via a nested runloop.

The Problem

When whilePaused is called, the following sequence occurs:

  1. Main queue schedules async block on mutation queue
  2. Main queue waits on sema2
  3. Mutation queue runs async block: signals sema2, sets iTermGCD.joined = true, waits on sema
  4. Main queue receives sema2, runs the block parameter
  5. Inside the block, reallyPerformBlockWithJoinedThreads sets gPerformingJoinedBlock = true

There is a window between steps 4 and 5 where iTermGCD.joined is true but gPerformingJoinedBlock is false. If code in step 4 triggers a nested runloop that causes another join attempt, the existing gPerformingJoinedBlock check will not catch it.

Without this fix, the nested whilePaused call would schedule another async block on the mutation queue and wait on a new semaphore. But the mutation queue is blocked waiting on sema from step 3, so the new async block never runs, and main queue waits forever—deadlock.

The Fix

Check iTermGCD.joined at the start of whilePaused. If true, we are already in a joined state (mutation queue is paused), so just run the block directly and return. This complements the existing gPerformingJoinedBlock checks, providing defense-in-depth for the narrow window before that flag is set.

Check iTermGCD.joined at entry to avoid scheduling a second
semaphore dance when the mutation queue is already waiting.
@chall37 chall37 closed this Jan 20, 2026
@chall37 chall37 reopened this Jan 20, 2026
@chall37 chall37 marked this pull request as ready for review January 20, 2026 19:04
// Runs block synchronously while token executor is stopped.
func whilePaused(_ block: () -> (), onExecutorQueue: Bool) {
dispatchPrecondition(condition: .onQueue(.main))
if iTermGCD.joined {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly convinced this can't happen in practice, although it is certainly possible that a future change could create problems.

There are two ways we enter the window of vulnerability:

Via performBlockWithJoinedThreads

  1. The main thread calls performBlockWithJoinedThreads
  2. performBlockWithJoinedThreads calls performSynchroDanceWithBlock.
  3. performSynchroDanceWithBlock calls whilePaused.
  4. While in whilePaused, sema2 is signaled. iTermGCD.joined is set to true. The window of vulnerability opens.
  5. whilePaused calls its block, which goes to performSynchroDanceWithBlock's block.
  6. performSynchroDanceWithBlock calls its block, which goes to performBlockWithJoinedThreads's block.
  7. It calls reallyPerformBlockWithJoinedThreads. At this point, gPerformingJoinedBlock is set to 1. The window of vulnerability has closed.

For a deadlock to occur, some code between line 442 here and atomic_exchange(&gPerformingJoinedBlock, 1) in reallyPerformBlockWithJoinedThreads would need to create a nested runloop.

Via performLightweightBlockWithJoinedThreads

  1. The main thread calls performLightweightBlockWithJoinedThreads
  2. performLightweightBlockWithJoinedThreads calls performSynchroDanceWithBlock
  3. performSynchroDanceWithBlock calls whilePaused.
  4. While in whilePaused, sema2 is signaled. iTermGCD.joined is set to true. The window of vulnerability opens.
  5. whilePaused calls its block, which goes to performSynchroDanceWithBlock's block.
  6. performSynchroDanceWithBlock calls its block, which goes to performLightweightBlockWithJoinedThreads block.
  7. It calls reallyPerformLightweightBlockWithJoinedThreads
  8. It does VT100ScreenMutableState.performingJoinedBlock = YES, which closes the vulnerability window.

With this improvement, we could still have nasty problems during the vulnerability window because performSynchroDanceWithBlock's block would do [iTermGCD setMainQueueSafe:NO] even though that's not actually true.

That being said, it would be good to detect any bug that could lead to this condition because it would be horrible to track down. What about doing this instead?

  it_assert(!iTermGCD.joined, "Reentrant call to whilePaused would deadlock")                                                                                                                                                                                                                                                                                                                                                             

This code is unfortunately complex and brittle and I am grateful that you're talking the time to think deeply about it.

@chall37
Copy link
Copy Markdown
Contributor Author

chall37 commented Jan 23, 2026

Agreed, this is exceedingly unlikely to happen as-is, probably to the point of impossibility. That's why I waffled between submitting it or not. But since I found it while trying to hunt down an actual deadlock, I figured it was at least worth mentioning.

Your suggestion makes more sense than trying to continue though!

@gnachman
Copy link
Copy Markdown
Owner

FYI 6c8672d fixes a deadlock in the case where VT100ScreenMutableState is dealloced while paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants