Skip to content

Add timeout protection to join mechanism#563

Open
chall37 wants to merge 1 commit into
gnachman:masterfrom
chall37:fix/join-mechanism-timeout
Open

Add timeout protection to join mechanism#563
chall37 wants to merge 1 commit into
gnachman:masterfrom
chall37:fix/join-mechanism-timeout

Conversation

@chall37
Copy link
Copy Markdown
Contributor

@chall37 chall37 commented Jan 20, 2026

Summary

Add timeouts and fatal errors in whilePaused() to avoid indefinite hangs when the mutation queue is stuck.

Motivation

whilePaused() could wait indefinitely if the mutation queue stalls. A hang gives no diagnostics, while a crash on timeout yields a useful crash log pinpointing the stuck state. (And we are likely to hit this deadlock more frequently as a consequence of #552, #553, 0f7425e, etc.)

Changes

  • Add timeout to whilePaused() semaphore waits (DEBUG: 1–2s, RELEASE: 5–15s)
  • Crash with it_fatalError on timeout — skipping a joined block would leave mutable state inconsistent
  • Add cancellation flag to prevent async block from setting joined state after timeout
  • Add DEBUG-only reentrancy guard to detect overlapping joins
  • Update performSynchroDanceWithBlock: return type to BOOL for clarity

Considerations

This might be best left as debug-only, since the UI does not fully freeze in this deadlock. Tab switching still works, and users might want to observe some state before closing. It also might be best to leave the terminal in a hung state if dying means killing tasks that are still running. At the same time, continuing is really not possible afaict, and the freeze impairs automated testing, CI (or my local approximation of it), etc., so I think it's good to fail fast in debug.

Maybe there's a middle ground where the user gets some indication that the end is nigh, but is not forced to close?

- Add timeout to whilePaused() to prevent deadlock if mutation queue is stuck
- Crash with it_fatalError on timeout since skipping joined block would leave
  state inconsistent (DEBUG: 2s max, RELEASE: 15s max)
- Add cancellation flag to prevent async block from setting joined state after
  timeout
- Add DEBUG-only reentrancy guard to detect overlapping joins
- Update performSynchroDanceWithBlock: to return BOOL for API clarity
@chall37 chall37 marked this pull request as ready for review January 20, 2026 18:58
@gnachman
Copy link
Copy Markdown
Owner

Have you encountered a case where the mutation queue stalls or is this meant to be defensive?

@chall37
Copy link
Copy Markdown
Contributor Author

chall37 commented Jan 23, 2026

Both. I did hit a deadlock, but after further consideration, it wouldn't actually catch the deadlock I'm struggling with, since in my case, the Mutation queue closure is past sema2.signal() and stuck at sema.wait()

  Thread 1 - TaskNotifier (blocked 725 samples):                                                                      
  TaskNotifier.run                                                                                                    
    → PTYTask.processRead                                                                                             
      → VT100ScreenMutableState.addTokens                                                                             
        → TokenExecutor.addTokens                                                                                     
          → semaphore.wait()
                                                                                                                      
  Thread 2 - Mutation Queue (blocked):                                                                                
  TokenExecutorImpl.whilePaused closure                                                                               
    → sema.wait()
                                                                                                                      
  Thread 3 - Main Thread (NOT blocked - in run loop):                                                                 
  Normal run loop - mach_msg  

I'll prioritize recreating this deadlock. I think it went away after I implemented the reentrance check (but that seems tangentially related, at best) and I shifted my focus to thinking about how to prevent TaskNotifier from getting into the wait() state in the first place, i.e. #560

@gnachman
Copy link
Copy Markdown
Owner

The most likely way that would happen is that something forgot to unpause a iTermTokenExecutorUnpauser. Can you tell if the main thread was in a nested runloop situation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants