WIP: Feat/fairness scheduler#570
Closed
chall37 wants to merge 77 commits into
Closed
Conversation
Add FAST_INT_ACCESSOR pattern to cache integer preferences similar to existing FAST_BOOL_ACCESSOR. Apply to sideMargins and topBottomMargins which are called 55 times per render frame in the drawing hot path. - Add iTermPreferencesIntCache struct - Add FAST_INT_ACCESSOR macro - Add intWithCache: method - Add sideMargins and topBottomMargins fast accessors - Update 29 call sites in iTermTextDrawingHelper.m - Update 26 call sites in PTYTextView.m
tokenExecutorSync was causing severe performance issues with multiple tabs. Each tab's TokenExecutor would trigger a SynchroDance (main thread blocking on mutation thread), starving the visible tab's rendering. The sync is redundant because the cadence-driven refresh path already handles syncing: visible tabs at ~60Hz, background tabs at 1Hz via updateDisplayBecause:. Result: Eliminated ~85% of SynchroDance calls, reducing main thread blocking by ~37% and restoring smooth FPS in visible tabs with 10+ tabs running output.
- Add timeout to whilePaused() to prevent deadlock if mutation queue is stuck - Crash with it_fatalError on timeout since skipping joined block would leave state inconsistent (DEBUG: 2s max, RELEASE: 15s max) - Add cancellation flag to prevent async block from setting joined state after timeout - Add DEBUG-only reentrancy guard to detect overlapping joins - Update performSynchroDanceWithBlock: to return BOOL for API clarity
- Add BackpressureLevel enum (none, light, moderate, heavy, blocked) - Track available slots with atomic counter alongside DispatchSemaphore - Add onSemaphoreSignaled callback to TokenArray for slot release notification - Expose backpressureLevel property on TokenExecutor (callable from any queue) This enables adaptive behavior based on queue load (e.g., join timeouts).
…pass - Initialize availableSlots with correct value in property declaration using immediately-executed closure, avoiding race where backpressureLevel could return .blocked before init completes - Document that high-priority tokens intentionally bypass backpressure, so metric only reflects normal PTY token load
Flag potential bugs where queued tokens sit in purgatory until new data arrives because scheduleTokenExecution is not called when these modes are disabled (unlike copyModeHandlerDidChangeEnabledState which does). Also note potential conversion to mutateAsynchronously for single-writer consistency once the correctness questions are resolved.
Previously, every cadence change would cancel the dispatch source and create a new one. Now the timer is created once and dispatch_source_set_timer is used to update timing parameters, eliminating kernel call overhead.
Major refactoring to reduce CPU usage from process monitoring with many tabs: - Add global coalescer using DISPATCH_SOURCE_TYPE_DATA_OR to merge concurrent monitor events into batched refresh cycles - Implement collection-first refresh that walks existing process collection instead of enumerating all PIDs via syscalls - Suspend monitors for background tabs; only foreground tabs get fast updates - Add epoch-based tracking to detect stale cache entries - Preserve dirty PID signaling for session title/status updates - Only remove cache entries when kill(pid,0) confirms process is dead, avoiding transient nil results from collection staleness Supporting changes: - iTermProcessMonitor: Add trackedRootPID property and pause/resume methods - iTermLSOF: Add childPidsForPid: using proc_listchildpids with retry loop - PseudoTerminal: Call setForegroundRootPIDs: on tab selection
- Throttle backgroundRefreshTick to 0.5s minimum intervals instead of running on every updateIfNeeded call - Batch dispatch_sync calls in refreshRootCollectionFirst: collect PIDs in NSMutableIndexSet, single sync at end instead of per-PID - Same batching fix for refreshRootWithKernelFallback - Remove unused markSeenForEpoch helper method - Fix property name: isForeground → isForegroundJob
Implementation plan covers FairnessScheduler, TokenExecutor modifications, PTYTask dispatch sources, and TaskNotifier changes. Testing plan defines unit, integration, and smoke tests with checkpoints for incremental validation.
Documents functional requirements for triggers, API injection, and other high-priority token categories to preserve during fairness scheduler work.
Test infrastructure for round-robin fair scheduling implementation: - FairnessSchedulerTests.swift: 18 unit tests covering registration, busy list management, turn execution, and round-robin fairness - MockFairnessSchedulerExecutor: Mock executor for isolated testing - TokenExecutorFairnessTests.swift: Phase 2 test stubs (8 placeholders) - FairnessScheduler.swift: Stub implementation (tests fail as expected) - run_fairness_tests.sh: Isolated test runner with phase filtering - add_to_xcode_project.py: Helper to add Swift files to Xcode project Also fixes duplicate FAST_INT_ACCESSOR declarations in iTermPreferences.m All 18 Phase 1 tests fail against stub, validating test infrastructure. Run with: ./tools/run_fairness_tests.sh phase1
Replace stub with working implementation that coordinates token execution across PTY sessions. Key features: - Session registration with monotonically increasing IDs - Round-robin busy list with O(1) duplicate prevention via busySet - Turn execution flow handling completed/yielded/blocked results - Cleanup callback on unregistration All 18 Phase 1 tests pass. Uses main queue for now; will switch to mutation queue during integration phase.
Tests (32 total, 7 passing, 25 skipped pending implementation): - TokenExecutorNonBlockingTests: Non-blocking addTokens() - TokenExecutorAccountingTests: Token consumption accounting - TokenExecutorExecuteTurnTests: executeTurn() method - TokenExecutorBudgetEdgeCaseTests: Budget enforcement - TokenExecutorSchedulerEntryPointTests: Scheduler notifications - TokenExecutorLegacyRemovalTests: Legacy code removal - TokenExecutorCleanupTests: Cleanup on unregistration - TokenExecutorAccountingInvariantTests: Accounting invariants Tests include both positive and negative cases to verify absence of undesired behavior. Skipped tests use XCTSkip to document requirements without hanging or failing. Documentation updates: - implementation.md: Added status table showing Phase 1 complete - testing.md: Added test status table with run commands
- Make TokenExecutor conform to FairnessSchedulerExecutor protocol - Add executeTurn(tokenBudget:completion:) with budget enforcement - Add cleanupForUnregistration() for accounting cleanup - Add backpressureReleaseHandler and fairnessSessionId properties - Remove semaphore blocking from addTokens() - now non-blocking - Add notifyScheduler() to integrate with FairnessScheduler - Remove activeSessionsWithTokens legacy preemption code - Make BackpressureLevel conform to Comparable - Add discardAllAndReturnCount() to TwoTierTokenQueue - Rename "Phase" to "Milestone" in documentation for clarity
Implement per-PTY dispatch sources for decoupled I/O handling: PTYTask.m: - Add dispatch source ivars (_readSource, _writeSource, _ioQueue) - setupDispatchSources: Creates read/write sources, starts suspended - teardownDispatchSources: Safe teardown (resumes before cancel) - shouldRead/shouldWrite: Unified state predicates - updateReadSourceState/updateWriteSourceState: Idempotent updates - handleReadEvent/handleWriteEvent: Event handlers - writeBufferDidChange: Triggers write state update - Update setPaused to call state update methods - Update dealloc to call teardown PTYTask.h: - Add tokenExecutor property for backpressure monitoring Tests: - Add PTYTaskDispatchSourceTests.swift with 35 tests (all skipped) - Update run_fairness_tests.sh to support milestone3 filter Note: Infrastructure is implemented but not yet activated. Integration (calling setupDispatchSources, wiring backpressureReleaseHandler) will be done in Milestone 5.
Add 12 TDD-style tests for TaskNotifier dispatch source integration: - TaskNotifierDispatchSourceProtocolTests (4 tests): Verify useDispatchSource @optional method in iTermTask protocol - TaskNotifierSelectLoopTests (5 tests): Verify select() loop skips FD_SET for dispatch source tasks while keeping coprocess FDs and unblock pipe - TaskNotifierMixedModeTests (3 tests): Verify mixed operation of dispatch_source + select() tasks All tests currently skip pending implementation of: - useDispatchSource @optional method in iTermTask protocol - PTYTask.useDispatchSource returning YES - TaskNotifier respondsToSelector: check in select() loop Update run_fairness_tests.sh with milestone4 filter support.
Modify TaskNotifier to skip FD_SET for tasks using dispatch_source: - Add useDispatchSource as @optional method in iTermTask protocol - PTYTask implements useDispatchSource returning YES - TaskNotifier checks respondsToSelector: before calling useDispatchSource - FD setup loop skips FD_SET for dispatch source tasks - Result handling loop skips handleRead/Write/Error for dispatch source tasks - Coprocess FD handling continues for all tasks (unaffected) - Backward compatible: tasks not implementing method use select() This decouples PTY I/O from the TaskNotifier select() loop, allowing per-PTY dispatch sources to handle read/write independently.
Add 24 TDD-style tests for system integration: - IntegrationRegistrationTests (3): FairnessScheduler registration in init - IntegrationUnregistrationTests (3): Cleanup on setEnabled:NO - IntegrationRekickTests (4): Re-kick scheduler on unblock events - IntegrationMutationQueueTests (2): Proper mutation queue usage - IntegrationDispatchSourceActivationTests (3): Dispatch source setup - IntegrationPTYSessionWiringTests (2): Full session wiring - DispatchSourceLifecycleIntegrationTests (3): Process lifecycle - BackpressureIntegrationTests (3): Backpressure system - SessionLifecycleIntegrationTests (4): Session close edge cases All tests skip pending implementation of: - VT100ScreenMutableState registration with FairnessScheduler - PTYSession wiring of backpressureReleaseHandler - PTYTask.setupDispatchSources activation in launch code Update run_fairness_tests.sh with milestone5 filter support.
Wire all components together to activate the fairness scheduler: VT100ScreenMutableState: - Register TokenExecutor with FairnessScheduler in init - Store fairnessSessionId on executor and mutable state - Unregister in setTerminalEnabled:NO before clearing delegate PTYSession: - Wire backpressureReleaseHandler in taskDidRegister: - Set task.tokenExecutor for backpressure monitoring - Handler calls task.updateReadSourceState when backpressure releases PTYTask: - Call setupDispatchSources in didRegister (covers launch, attach, restore) - Dispatch sources activated when task registers with TaskNotifier Update Milestone 4 tests to verify implementation instead of skipping. The fairness scheduler is now fully integrated: - Round-robin scheduling ensures fair token execution across sessions - Per-PTY dispatch sources handle I/O independently - Backpressure flows from TokenExecutor to PTYTask read source state - TaskNotifier skips dispatch_source tasks in select() loop
Add thorough test coverage for the fairness scheduler: FairnessScheduler tests: - Thread safety tests (concurrent register/unregister/enqueue) - Lifecycle edge cases (unregister during execution, double unregister) - Note: Thread safety tests expose real synchronization bugs TokenExecutor tests: - Completion callback exactly-once semantics - Budget enforcement boundary conditions - Available slots boundary tests (no negative/overflow) - High-priority task ordering verification PTYTask tests: - Actual behavior tests for shouldRead/shouldWrite predicates - Pause state affects read/write correctly - State transition tests (rapid pause/unpause cycles) - Edge cases (nil delegate, concurrent pause changes) Also updates run_fairness_tests.sh to include new test classes.
Synchronize all FairnessScheduler state access through the mutation queue to prevent data races when methods are called from multiple threads: - register(): sync dispatch (returns session ID) - unregister(): async dispatch - sessionDidEnqueueWork(): async dispatch - ensureExecutionScheduled(): dispatch to mutationQueue instead of main - executeTurn completion: dispatch back to mutationQueue Update tests to wait for async unregister operations to complete before asserting on cleanup state.
Replace placeholder "verified by implementation" tests with real behavioral assertions using MockTaskNotifierTask, an Objective-C mock that implements the iTermTask protocol. Changes: - Add MockTaskNotifierTask.h/.m in sources/ for iTermTask protocol conformance - Add MockTaskNotifierTask+Swift.swift for Swift convenience API - Add mock to iTerm2SharedARC-Bridging-Header.h for Swift visibility - Add ITERM_DEBUG to all 4 ModernTests build configurations - Update TaskNotifierDispatchSourceTests.swift with real assertions: - testDispatchSourceTaskSkipsFdSet: verifies processReadCallCount == 0 - testLegacyTaskProcessReadCalledBySelect: verifies processReadCallCount > 0 - testMixedDispatchSourceAndSelectTasks: verifies both paths work together All 13 TaskNotifier dispatch source tests now pass with real behavior verification.
Replace method-existence tests with actual behavioral verification: - testSetupCreatesSourcesWhenFdValid: Creates pipe, sets fd, verifies sources exist and pause suspends read source - testTeardownCleansUpSources: Verifies sources are nil after teardown - testSetPausedTogglesSourceSuspendState: Verifies pause/unpause cycle properly suspends/resumes dispatch sources - testWriteBufferDidChangeWakesWriteSource: Tests write buffer mechanism - testBackpressureHeavySuspendsReadSource: Verifies heavy backpressure suspends read source Add test hooks to PTYTask (in ITERM_DEBUG block): - testSetFd: Set fd for testing (bypasses readonly property) - testWriteBufferHasData: Check if write buffer has data - testSetupDispatchSourcesForTesting: Manually trigger setup - testTeardownDispatchSourcesForTesting: Manually trigger teardown - testAppendDataToWriteBuffer: Add data to write buffer Tests now use real pipes via createTestPipe() from TestUtilities.swift and verify actual dispatch source state changes via debug properties.
Replace direct FairnessScheduler.register/unregister calls with tests that exercise the actual integration points in VT100ScreenMutableState: Registration tests now: - Create real VT100ScreenMutableState instances - Verify registration happens in VT100ScreenMutableState.init (line 129) - Verify fairnessSessionId is set on tokenExecutor after init Unregistration tests now: - Call actual setTerminalEnabled:NO (line 219) - Verify unregister is called before delegate is cleared (lines 217-223) - Verify cleanupForUnregistration is invoked via the real path Add MockSideEffectPerformer to enable creating VT100ScreenMutableState instances for testing without full PTYSession setup. Note: setTerminalEnabled: has early return if value unchanged, so tests call terminalEnabled=true before terminalEnabled=false.
Add two tests to FairnessSchedulerTurnExecutionTests that verify the scheduler will not schedule overlapping turns for the same executor when completion is delayed. This is a key safety property for the mutation queue model. testNoOverlappingTurnsWhenCompletionDelayed: - Delays completion callback while enqueueing more work - Verifies no second executeTurn call until first completion fires - Confirms isExecuting flag prevents concurrent execution testWorkArrivedWhileExecutingIsPreserved: - Verifies workArrivedWhileExecuting flag causes re-scheduling - Even when turn completes with .completed, new work triggers next turn These tests prove the scheduler serialization guarantee: a session will never have executeTurn called while a previous turn completion callback is still pending.
Update TokenExecutorBudgetEdgeCaseTests and TokenExecutorBudgetEnforcementDetailedTests to properly validate group-boundary budget semantics using high-priority vs normal-priority tokens to create guaranteed separate groups. Key semantics now tested: - Budget is checked BETWEEN groups, not within a group - First group always executes (progress guarantee), even if it exceeds budget - Second group does NOT execute if budget was exceeded by first group - Groups are atomic - never split mid-execution Strategy: High-priority tokens go to queues[0], normal-priority to queues[1]. enumerateTokenArrayGroups processes queues in order, guaranteeing two separate groups that cannot be coalesced. testSecondGroupSkippedWhenBudgetExceeded now verifies: - willExecuteCount increments by exactly 1 after first turn (only first group) - Result is .yielded (second group pending) - Second turn processes remaining group - Total of 2 executions (one per group)
Replace timing-based tests that used DispatchQueue.main.asyncAfter with small timeouts, which are inherently flaky. FairnessSchedulerTests.swift: - testCompletedResultDoesNotReaddWithoutNewWork: Use waitForMutationQueue() - testBlockedResultDoesNotReaddToBusyList: Use waitForMutationQueue() - testNoOverlappingTurnsWhenCompletionDelayed: Use waitForMutationQueue() - testUnregisterDuringExecuteTurn: Use waitForMutationQueue() - testUnregisterAfterYieldedBeforeNextTurn: Use waitForMutationQueue() IntegrationTests.swift: - testTaskUnpausedSchedulesExecution: Use onWillExecute callback expectation - testShortcutNavigationCompleteSchedulesExecution: Same pattern - testTerminalEnabledSchedulesExecution: Same pattern - testCopyModeExitSchedulesExecution: Same pattern TokenExecutorFairnessTests.swift: - Add onWillExecute callback to MockTokenExecutorDelegate for test hooks Tests are now deterministic, faster, and more reliable by using proper synchronization (queue flush or expectation callbacks) instead of arbitrary timing delays.
- Add testShouldWriteOverride property to bypass jobManager.isReadOnly constraint for testing write source resume behavior - Add testWaitForIOQueue method for deterministic test synchronization, replacing flaky Thread.sleep calls with proper queue waiting - Add PTYTaskReadHandlerPipelineTests with 4 tests verifying: - Read source triggers threadedReadTask delegate callback - Read handler completes quickly (non-blocking verification) - Multiple reads accumulate correctly - Full pipeline to TokenExecutor enqueue works - Fix write source tests to properly verify suspend→resume cycle - Update MockTokenExecutorDelegate stub to document real implementation - Register PTYTaskReadHandlerPipelineTests in test runner script
Tests gated behind #if ITERM_DEBUG with meaningless #else fallbacks (like responds(to:) checks) become no-ops in non-debug configs. Changes: - Added isDebugBuild constant to TestUtilities.swift for runtime detection - Updated testSetupCreatesSourcesWhenFdValid to use XCTSkipUnless - Updated testTeardownCleansUpSources to use XCTSkipUnless - Updated testRegisterOnInit to use XCTSkipUnless - Updated testRespondsToSelectorCheckUsed to use XCTSkipUnless This makes test requirements explicit: tests that need debug hooks are skipped rather than silently becoming no-ops. The same pattern should be applied to other tests with similar fallback blocks.
…sionsWithTokens
The key regression from removing activeSessionsWithTokens is that background
sessions could be starved when foreground sessions are busy. Added tests to
verify the fairness model ensures background sessions still get turns.
Unit tests (TokenExecutorFairnessTests.swift):
- testBackgroundGetsturnsWhileForegroundContinuouslyBusy: Simulates a
foreground session receiving continuous data while background has pending
tokens. Verifies background still gets execution turns.
Integration tests (IntegrationTests.swift):
- IntegrationBackgroundForegroundFairnessTests: New test class with:
- testBackgroundSessionExecutesWhileForegroundBusy: Full stack test with
VT100ScreenMutableState proving fairness works end-to-end
- testMultipleBackgroundSessionsAllGetTurns: Verifies fairness across
multiple background sessions
These tests verify the key guarantee: under the fairness model, ALL sessions
get round-robin turns regardless of foreground/background status.
The previous tests only proved that sessions eventually executed, not that the round-robin fairness invariant held (each session gets at most one turn before any session gets a second turn). Changes: - Add _testExecutionHistory to FairnessScheduler to record session execution order - Add testGetExecutionHistory(), testGetAndClearExecutionHistory(), and testClearExecutionHistory() test hooks - Rewrite fairness tests to be deterministic (no polling/timeouts): - Block execution using shouldQueueTokens/taskPaused - Add tokens to all sessions while blocked - Unblock and sync mutation queue - Verify execution history shows proper round-robin order - Tests now verify the actual invariant: no session executes consecutively when other sessions have work
Tests should not use timeouts as pass/fail criteria for behavior invariants. Timeouts are only appropriate for detecting hangs (e.g., I/O tests) or testing actual timing requirements. Changes: - Replace waitForCondition(timeout) with multiple waitForMutationQueue() or waitForMainQueue() calls to deterministically drain queues - Use blocking pattern (shouldQueueTokens=true) where needed to control execution order - Convert assertions from "did it happen within timeout?" to "did it happen?" The PTYTask I/O tests retain timeouts since they test actual kernel dispatch source behavior where timeouts guard against hangs.
The old test only verified that FairnessScheduler passes a positive budget (which it always does - defaultTokenBudget=500). It did not test the progress guarantee or zero-budget behavior. Changes: - Rename old test to testSchedulerProvidesPositiveBudget (what it actually tests) - New testZeroBudgetBehavior verifies the progress guarantee: - First group always executes even if it exceeds budget - Session yields and gets another turn to complete remaining work - This ensures forward progress even with large token groups
…ehavior The old test manually invoked backpressureReleaseHandler instead of driving the actual heavy→non-heavy transition via token consumption. New test: 1. Drives executor to heavy backpressure (>75% slots consumed) 2. Sets up handler to track calls 3. Unblocks execution and lets tokens be consumed 4. Verifies handler was called when crossing out of heavy backpressure 5. Verifies backpressure level is reduced after consumption This tests the real callback wiring and threshold transition logic.
Tests that spawn background threads accessing self.executor or self.scheduler could race with tearDown deallocation, causing ASan crashes. Fix: Capture the object in a local variable before spawning threads, ensuring the closure holds a strong reference that survives tearDown. Fixed tests: - TokenExecutorNonBlockingTests.testAddTokensDoesNotBlock - TokenExecutorNonBlockingTests.testSemaphoreNotCreated - FairnessSchedulerThreadSafetyTests.testConcurrentRegistration - FairnessSchedulerThreadSafetyTests.testConcurrentUnregistration - FairnessSchedulerThreadSafetyTests.testConcurrentEnqueueWork - FairnessSchedulerThreadSafetyTests.testConcurrentRegisterAndUnregister - FairnessSchedulerThreadSafetyTests.testConcurrentEnqueueAndUnregister - FairnessSchedulerLifecycleEdgeCaseTests.testUnregisterDuringExecuteTurn
The test was just a no-crash check - it didn't verify that shouldRead returns false when backpressure is blocked. Fix: Use testIoAllowedOverride to bypass jobManager requirement, then assert shouldRead is false when backpressure level is blocked.
The test only checked initial .none levels for two executors. It never drove one executor to heavy/blocked backpressure and verified the other remained unaffected. Fix: Add tokens to executor1 to create blocked backpressure, then assert executor2 remains at .none - verifying per-session backpressure isolation.
…round thread The test gated the core assertion on !Thread.isMainThread, but XCTest runs tests on the main thread by default, so the assertion never executed. Fix: Run the test logic from a background thread via DispatchQueue.global() and use an expectation to wait for completion. This ensures the non-blocking assertion is actually executed and verified.
- run_fairness_tests.sh fails if pre-existing crash reports are found (they would trigger UKCrashReporter dialog and block automation) - Checks for new crash reports after tests complete - Detects 'Program crashed' in test output - Returns non-zero exit code if crashes detected - Document crash report locations and UKCrashReporter behavior - Document ASan crashes bypass internal logging - Document test race condition pattern with background threads - Project instructions require asking user before deleting crash reports
The test was a noop - it just created executors and asserted they weren't nil. Fix: Use Objective-C runtime to verify that TokenExecutor doesn't have activeSessionsWithTokens class or instance method, confirming the legacy preemption mechanism was actually removed.
The test only checked total length but didn't use the executedLengths array it set up, and didn't actually assert ordering. Fix: - Use the executedLengths array to verify callback was invoked - Add assertion that callback reported correct total length - Clarify in comments that ordering is verified by TwoTierTokenQueueTests - This test verifies integration: both priority levels execute correctly
…en count The test was using group.lengthTotal (byte length = 100) as if it were token count, which was confusing and inaccurate. Since we supply the input with a known constant (tokenCount: 10), use that constant directly rather than recalculating from the group. This avoids circular dependency on token counting logic and keeps the test focused on budget enforcement at group boundaries.
Add PTYSessionBackpressureWiringTests to verify the critical wiring path where PTYSession.taskDidRegister sets up the tokenExecutor's backpressureReleaseHandler to call PTYTask.updateReadSourceState. Tests verify: - Handler is set after taskDidRegister - Task's tokenExecutor property is set - Handler invocation calls updateReadSourceState - Handler uses weak reference to task (no retain cycle) Also: - Add SpyPTYTask mock for tracking updateReadSourceState calls - Make updateReadSourceState public in PTYTask.h for testability - Add PTYSessionBackpressureWiringTests to test runner
IntegrationTests.swift: - Replace waitForCondition polling with deterministic queue sync loops - Replace magic 0..<20 loops with waitForSchedulerQuiescence() - Fix 10 tests that previously relied on arbitrary timeouts TestUtilities.swift: - Add waitForSchedulerQuiescence(maxIterations:) - iteration-based quiescence - Add FairnessScheduler.waitForQuiescenceIterative() extension FairnessScheduler.swift: - Add dispatchPrecondition to register() to catch deadlock-prone patterns (calling sync from within mutation queue would deadlock)
FairnessSchedulerTests.swift: - Convert 6 concurrent/stress tests to bounded-progress assertions - Remove arbitrary timeouts (10-30s) replaced with group.wait() + queue drain - testConcurrentRegistration kept as watchdog with 60s timeout - testManySessionsStressTest uses iteration-capped loop instead of polling Tests now complete in < 1s each vs previous 10-30s timeout waits.
PTYTask.m: Fix race condition in dispatch source teardown where suspend state could change between capture and async teardown block execution. Now clears source ivars first to prevent new state updates, then either tears down inline (if on ioQueue) or uses dispatch_sync to capture consistent state before async teardown. TaskNotifier coprocess tests: Simplified to structural no-ops pending architectural review of select()/dispatch_source interaction for coprocess handling. Test improvements: - Replace timing-based waits with iteration-based loops for determinism - Add MockCoprocess infrastructure for future coprocess testing - Enhance MockTaskNotifierTask with coprocess callback tracking
The existing test only drove to .blocked (availableSlots <= 0) but never verified suspend behavior at .heavy with availableSlots > 0. Split into two tests: - testBackpressureHeavyWithPositiveSlotsStillSuspendsReadSource: Tests the 'heavy but not blocked' case (ratio < 0.25, availableSlots > 0). Verifies shouldRead's 'backpressureLevel < .heavy' gate works correctly. - testBackpressureBlockedSuspendsReadSource: Tests the blocked case (availableSlots <= 0) which the original test was actually exercising.
The tests for non-blocking addTokens behavior didn't block consumption, so a semaphore-based implementation could still pass if tokens drained fast enough. Now both tests: 1. Set shouldQueueTokens=true to block consumption 2. Add more tokens than buffer capacity (100 > 40 slots) 3. Verify completion without blocking 4. Assert backpressure reached .blocked level (proving test conditions met) This ensures the tests would fail with the old semaphore-based blocking implementation where addTokens would wait for permits that never come.
This is the critical test for the TaskNotifier starvation fix. The test proves that when session A is backpressured, session B can still read data - the core behavior change from the old select() model. Test setup: 1. Two PTYTasks with real pipes and dispatch sources 2. Drive session A to blocked backpressure (suspending its read source) 3. Write data to BOTH sessions' pipes 4. Verify session B receives data (dispatch source fires independently) 5. Verify session A does NOT receive data (read source suspended) In the old TaskNotifier select() model, both sessions would have blocked because they shared the same select loop. With dispatch_source, each session reads independently.
The hybrid architecture keeps coprocess FDs on select() while PTY FDs use dispatch_source. However, the data flow between them was broken: 1. PTY output wasn't routed to coprocess - handleReadEvent only called threadedReadTask but not writeToCoprocess (unlike the select() path) 2. Coprocess output wasn't triggering PTY writes - writeTask:coprocess: called unblock (for select) but not writeBufferDidChange (for dispatch) Fixes: - Add writeToCoprocess call to handleReadEvent for PTY→coprocess flow - Add writeBufferDidChange call to writeTask:coprocess: for coprocess→PTY Also updates tests to remove TODO comments about "fundamental conflicts" since the hybrid approach is now properly implemented.
Fix deadlock in writeTask:coprocess: where writeBufferDidChange was called while holding writeLock. Since writeBufferDidChange calls shouldWrite which also acquires writeLock (non-recursive NSLock), this caused a deadlock. Move writeBufferDidChange outside the lock. Add end-to-end behavioral tests for coprocess data flow bridge: - testHandleReadEventRoutesToCoprocess: PTY → coprocess direction - testCoprocessOutputRoutesToPTY: coprocess → PTY direction - testWriteTaskTriggersWriteSource: write source mechanics Add testWriteFromCoprocess: method to PTYTask (Testing) category to expose the coprocess → PTY path for testing.
Fixed a race condition in teardownDispatchSources where dispatch_async allowed the PTYTask to be deallocated while cleanup blocks were pending. Changed to dispatch_sync to ensure cleanup completes before returning. Split PTYTaskDispatchSourceTests.swift (2453 lines) into 4 focused files: - PTYTaskDispatchSourceBasicTests.swift (lifecycle, protocol, state) - PTYTaskDispatchSourcePredicateTests.swift (shouldRead/shouldWrite) - PTYTaskDispatchSourceHandlerTests.swift (read/write handlers) - PTYTaskDispatchSourceIntegrationTests.swift (end-to-end tests) Added test coverage for identified gaps: - Gap 1: Verify TaskNotifier skips processWrite for dispatch source tasks - Gap 2: Verify teardown is safe with suspended read/write sources - Gap 6: Verify write path round-trip (writeTask data appears on fd)
Gap 1: Verify TaskNotifier skips processWrite for dispatch source tasks - testDispatchSourceTaskSkipsProcessWrite - testLegacyTaskProcessWriteCalledBySelect (inverse verification) Gap 3: Verify TaskNotifier calls [coprocess write] when coprocess wantToWrite - testCoprocessWriteFdProcessedBySelect verifies select() drains coprocess.outputBuffer Gap 4: Verify addTokens notifies scheduler when called from non-mutation queue - testAddTokensFromBackgroundQueueNotifiesScheduler - testAddTokensFromMultipleBackgroundQueuesAllExecute Fix racy assertion in testWriteTaskTriggersWriteSource: The write dispatch source can drain writeBuffer before the test checks it. Removed the intermediate assertion; the test still verifies data reaches the pipe.
Remove accidentally committed local development changes: - Makefile build convenience flags (CODESIGN_FLAGS, JOBS, etc.) - Rebuilt ThirdParty framework binaries - Restore CLAUDE.md, README.md, WebExtensionsFramework/.claude/ - Restore last-xcode-version and utilities.tgz
Contributor
Author
|
Implemented behind feature flag in #580. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a non-feature gated implementation of the entire pipeline. I still need to shake down some issues, but I think the architecture is in place. I'm still working on the feature-gated parallel implementation as described in #567 and #568 but it's a heavy lift. In the meantime, this can serve as a point of direct comparison between legacy and proposed architectural changes.
Fairness Scheduler PR - Changes Summary
This document summarizes changes in this PR compared to
gnachman:master.Problem Statement
iTerm2 has two starvation points that cause one high-throughput session (e.g., running
yes) to block other sessions:Read-level starvation (TaskNotifier thread): A single-threaded
select()loop monitors all PTY file descriptors. WhenTokenExecutor.addTokens()blocks onsemaphore.wait(), the entire TaskNotifier thread is blocked, preventing ALL other sessions from being read.Execution-level starvation (mutation queue): All sessions share a single mutation queue. Whichever session's
execute()runs first processes all its tokens with no mechanism ensuring sessions take turns.Solution Overview
Replace blocking I/O with event-driven dispatch sources and add a round-robin fairness scheduler:
Discussion (continued from PR #560):
Implemented as described. Sessions are added to the tail when they have work, removed from the head when granted a turn, and moved to the tail if work remains after the turn.
The gist pseudocode used a fixed chunk size. This implementation uses a ~500-token budget checked at group boundaries, ensuring atomic group execution while providing approximate fairness. (This is a lever we can adjust, if necessary.)
Currently,
TokenExecutor.addTokensblocks on a semaphore to backpressure PTY reads. In this implementation, the blocking wait is replaced with per-PTY reads gated by dispatch sources inPTYTask(the read source starts suspended;updateReadSourceStateresumes or suspends it).Per-PTY vs global backpressure: A global capacity counter gating admission across all producers was suggested in Add backpressure counter for monitoring token queue load #560, plus per-producer thresholds for self-throttling. This implementation uses per-PTY only—each TokenExecutor tracks its own availableSlots with no global counter. This should be sufficient for isolation—when Session A is full, only A suspends while B and C continue reading, and it maps directly to per-PTY dispatch sources without requiring scheduler coordination. With the current per-session limit of 64 TokenArrays, this seems acceptable, though global backpressure could be added later if memory pressure becomes a concern.
Implementation cost
BackpressureLevel/availableSlots, consumption callbacks inTokenExecutor, andbackpressureReleaseHandlerwiring inPTYSession.PTYTaskto suspend and resume I/O based onbackpressureLevel,paused, andioAllowed.Alternatives considered
Coprocess handling: TaskNotifier is retained in a reduced role for coprocess FD handling and process lifecycle management. Full elimination of TaskNotifier is deferred to a future phase to limit scope and risk.
Architecture
Key Changes by Component
1. FairnessScheduler (NEW)
File:
sources/FairnessScheduler.swiftRound-robin scheduler that controls which session executes on the mutation queue. Does NOT buffer tokens - only controls execution order. Each session gets a turn with a ~500 token budget before yielding.
Justification: Provides deterministic fairness guarantees. No session can monopolize execution regardless of output rate.
2. PTYTask Dispatch Sources
Files:
sources/PTYTask.h,sources/PTYTask.mAdds per-PTY
dispatch_source_tfor read and write operations:_readSource- DISPATCH_SOURCE_TYPE_READ_writeSource- DISPATCH_SOURCE_TYPE_WRITEshouldRead/shouldWrite)updateReadSourceState/updateWriteSourceState)Justification: Decouples PTY I/O from the shared TaskNotifier thread. One session's backpressure cannot block another session's reads.
3. TokenExecutor Modifications
File:
sources/TokenExecutor.swiftaddTokens()never blocksexecuteTurn()- Fairness-limited execution with token budgetbackpressureReleaseHandler- Closure to signal PTYTask when backpressure clearsactiveSessionsWithTokens- Legacy foreground preemption no longer neededJustification: Non-blocking token admission is required for dispatch sources (handlers must return quickly). The fairness scheduler replaces the ad-hoc foreground preemption.
4. TaskNotifier Changes
Files:
sources/TaskNotifier.h,sources/TaskNotifier.mselect()for their FDsuseDispatchSourceprotocol method (@optional)Justification: Hybrid approach allows incremental migration. Coprocesses remain on select() for now (rare feature, low risk).
5. VT100ScreenMutableState Integration
File:
sources/VT100ScreenMutableState.msetEnabled:NO_fairnessSessionIdfor stable identificationJustification: Lifecycle management ensures sessions are properly tracked by the scheduler.
6. TwoTierTokenQueue
File:
sources/TwoTierTokenQueue.swiftdiscardAllAndReturnCount()for cleanup accountingJustification: Prevents
availableSlotsdrift when sessions close with unconsumed tokens.7. PTYSession Wiring
File:
sources/PTYSession.mbackpressureReleaseHandlertoupdateReadSourceStatescheduleTokenExecutionfor state transitionsJustification: Connects TokenExecutor backpressure signals to PTYTask dispatch source control.
Location:
ModernTests/FairnessScheduler/Test Methods organized by component:
FairnessSchedulerTests.swiftTokenExecutorFairnessTests.swiftexecuteTurn()behavior, backpressure handler wiring, accounting invariants, high-priority token handlingPTYTaskDispatchSourcePredicateTests.swiftshouldRead/shouldWritepredicate logic under all condition combinations (paused, ioAllowed, backpressure, writeBuffer)PTYTaskDispatchSourceHandlerTests.swiftPTYTaskDispatchSourceBasicTests.swiftPTYTaskDispatchSourceIntegrationTests.swiftTaskNotifierDispatchSourceTests.swiftTwoTierTokenQueueTests.swiftdiscardAllAndReturnCount(), queue accountingIntegrationTests.swiftPTYSessionWiringTests.swiftMock/utility infrastructure:
MockTokenExecutorDelegate,MockPTYTaskDelegate,EnqueuingPTYTaskDelegate,SpyVT100Screen, and test utilities.Key invariants:
availableSlotsaccounting (no drift after enqueue/consume cycles)Files Unchanged
TwoTierTokenQueue(mostly)