Fix server hang after "Running scheduled diagnostics refresh"#3628
Fix server hang after "Running scheduled diagnostics refresh"#3628
Conversation
- Make RefreshDiagnostics fire-and-forget: send the request but don't wait for a response (client always returns null). Any response from the client is silently ignored by the read loop. - Add 1s timeout to updateWatch for WatchFiles/UnwatchFiles client requests to prevent holding watchesMu indefinitely if the client is slow or unresponsive. Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/9d82d127-ea83-455b-bd36-297c9fe864b7 Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/9d82d127-ea83-455b-bd36-297c9fe864b7 Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Addresses an LSP server hang triggered by “Running scheduled diagnostics refresh” by removing an unnecessary client-response wait and bounding how long watch registration holds a critical mutex.
Changes:
- Make
RefreshDiagnosticssendworkspace/diagnostic/refreshas fire-and-forget (no response channel / no wait). - Add a 1s timeout for
WatchFiles/UnwatchFilesclient requests made while holdingsession.watchesMu. - Introduce
sendClientRequestFireAndForgethelper for non-essential client responses.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| internal/project/session.go | Adds a watch request timeout context for client watcher register/unregister calls to avoid indefinite mutex holds. |
| internal/lsp/server.go | Changes diagnostics refresh to fire-and-forget and adds a helper to send client requests without awaiting responses. |
|
@copilot fix review comments |
…lock, rollback + retry on WatchFiles failure Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/b1cea66d-c309-4c26-97cc-f815ab98e5fe Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
Addressed all review comments in
|
…apshot updates re-attempt naturally Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/02f6f16f-047c-455f-981d-b83603fc1df8 Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/b572b8f8-1888-4b53-a86f-59468b89d099 Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
All review comments have been addressed across |
|
@copilot write a test for the updateWatch timeout and retry. I think you should be able to mock the client implementation using synctest. |
Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/ddc2f2cd-a802-46ea-a7a2-3977a30c87f0 Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
Added in |
Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/4d583367-5604-427b-8cfb-be0e60198035 Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
Rewrote the test using real session operations in |
…ternal functions Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/db5b9454-2d1d-4250-b97e-e9f919dd9116 Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
…hanged identity When a WatchFiles call times out, the watcher ID is now marked as pending in the watchRegistry. On the next snapshot update, updateWatches checks for pending watchers and retries them as fresh adds, even when the watcher identity hasn't changed between snapshots. Rewrote the test to use a single character DidChangeFile edit (no watcher identity change) followed by GetLanguageService to flush the change and verify the retry succeeds. Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/85f8d007-9ec1-47fa-bf47-4e9a07aba587 Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
internal/project/session.go:1203
- pending watcher IDs are only cleared on successful registration in the newWatcher path. If a watcher times out (MarkPending) and is later removed (updateWatch called with newWatcher == nil), its pending entry is never cleared, so watchRegistry.pending can grow over the lifetime of a session. Consider calling ClearPending for oldWatcher.Watchers().WatcherID when removing/unwatching, or otherwise pruning pending entries when the corresponding watcher is no longer present.
if oldWatcher != nil {
w := oldWatcher.Watchers()
watchers := append(w.WorkspaceWatchers, w.OutsideWorkspaceWatchers...)
if len(watchers) > 0 {
var removedIDs []WatcherID
for _, watcher := range watchers {
if id, removed := session.watches.Release(watcher); removed {
removedIDs = append(removedIDs, id)
}
}
for _, id := range removedIDs {
if err := session.client.UnwatchFiles(watchCtx, id); err != nil {
errors = append(errors, err)
} else if logger != nil && newWatcher == nil {
logger.Log(fmt.Sprintf("Removed watch: %s", id))
}
}
}
}
| // Retry config watchers whose IDs didn't change but whose previous registration failed. | ||
| for path, newEntry := range newSnapshot.ConfigFileRegistry.configs { | ||
| if oldEntry, ok := oldSnapshot.ConfigFileRegistry.configs[path]; ok { | ||
| if oldEntry.rootFilesWatch.ID() == newEntry.rootFilesWatch.ID() && s.watches.IsPending(newEntry.rootFilesWatch.ID()) { | ||
| errors = append(errors, updateWatch(ctx, s, s.logger, nil, newEntry.rootFilesWatch)...) | ||
| } |
There was a problem hiding this comment.
The pending retry path calls updateWatch(ctx, s, ..., nil, newWatcher). If the original updateWatch call partially succeeded (some WatchFiles calls returned nil before a later one timed out), the registry will still contain refcounts for the successful watchers. Retrying with oldWatcher==nil will Acquire again and permanently increment those refcounts (no matching Release), which can prevent future UnwatchFiles from ever firing and leak client registrations. To keep refcounts consistent, consider making the failure path roll back the entire parent watcher’s bookkeeping when any error occurs (not just the failed watcher), or implement a retry helper that registers only missing keys without incrementing refcounts for already-tracked entries.
| // watchRegistry tracks the current watch globs and how many individual | ||
| // WatchedFiles reference each glob. It guards concurrent access with a mutex | ||
| // and provides ref-count helpers so callers don't manipulate the map directly. | ||
| // It also tracks which parent watcher IDs have pending (failed) registrations | ||
| // so that updateWatches can retry them even when the watcher identity hasn't changed. | ||
| type watchRegistry struct { | ||
| mu sync.Mutex | ||
| entries map[fileSystemWatcherKey]*fileSystemWatcherValue | ||
| pending map[WatcherID]struct{} | ||
| } | ||
|
|
||
| func newWatchRegistry() *watchRegistry { | ||
| return &watchRegistry{ | ||
| entries: make(map[fileSystemWatcherKey]*fileSystemWatcherValue), | ||
| pending: make(map[WatcherID]struct{}), | ||
| } | ||
| } | ||
|
|
||
| // Acquire increments the ref count for a watcher. If this is the first | ||
| // reference (count goes from 0 to 1), it returns true so the caller knows | ||
| // to register the watcher with the client. | ||
| func (r *watchRegistry) Acquire(watcher *lsproto.FileSystemWatcher, id WatcherID) (isNew bool) { | ||
| key := toFileSystemWatcherKey(watcher) | ||
| value := r.entries[key] | ||
| if value == nil { | ||
| value = &fileSystemWatcherValue{id: id} | ||
| r.entries[key] = value | ||
| } | ||
| value.count++ | ||
| return value.count == 1 | ||
| } | ||
|
|
There was a problem hiding this comment.
watchRegistry is documented as guarding concurrent access with its mutex, but Acquire/Release/MarkPending/ClearPending do not lock r.mu (only IsPending does). Since session.go now locks session.watches.mu directly, the locking discipline is split across call sites, which makes it easy to accidentally call these methods without holding the mutex (data race) and undermines the stated encapsulation. Consider either (1) moving the locking inside watchRegistry methods (and removing external r.mu access), or (2) updating the type/method docs and API to make the required locking explicit (e.g., AcquireLocked/ReleaseLocked or a WithLock helper) so future callers can’t misuse it.
| session.watches.mu.Lock() | ||
| defer session.watches.mu.Unlock() | ||
| // Use a timeout for client requests to prevent holding the mutex indefinitely | ||
| // if the client is slow or unresponsive. | ||
| watchCtx, watchCancel := context.WithTimeout(ctx, watchRequestTimeout) | ||
| defer watchCancel() | ||
| if newWatcher != nil { | ||
| w := newWatcher.Watchers() | ||
| watchers := append(w.WorkspaceWatchers, w.OutsideWorkspaceWatchers...) | ||
| if len(watchers) > 0 { | ||
| var newWatchers collections.OrderedMap[WatcherID, *lsproto.FileSystemWatcher] | ||
| for i, watcher := range watchers { | ||
| key := toFileSystemWatcherKey(watcher) | ||
| value := session.watches[key] | ||
| globId := WatcherID(fmt.Sprintf("%s.%d", w.WatcherID, i)) | ||
| if value == nil { | ||
| value = &fileSystemWatcherValue{id: globId} | ||
| session.watches[key] = value | ||
| } | ||
| value.count++ | ||
| if value.count == 1 { | ||
| if session.watches.Acquire(watcher, globId) { | ||
| newWatchers.Set(globId, watcher) | ||
| } | ||
| } | ||
| for id, watcher := range newWatchers.Entries() { | ||
| if err := session.client.WatchFiles(ctx, id, []*lsproto.FileSystemWatcher{watcher}); err != nil { | ||
| if err := session.client.WatchFiles(watchCtx, id, []*lsproto.FileSystemWatcher{watcher}); err != nil { | ||
| session.watches.Release(watcher) | ||
| errors = append(errors, err) | ||
| } else if logger != nil { |
There was a problem hiding this comment.
updateWatch creates a single watchCtx with a 1s deadline and reuses it for all WatchFiles/UnwatchFiles calls in this update. If multiple registrations/unregistrations are needed, later calls may inherit very little time remaining (e.g., first call takes 900ms → second gets ~100ms), causing avoidable timeouts even when the client is responsive. Consider creating a fresh per-request timeout context inside the WatchFiles/UnwatchFiles loops (or otherwise resetting the deadline per call) while still ensuring the mutex is not held indefinitely.
| for i, watcher := range watchers { | ||
| key := toFileSystemWatcherKey(watcher) | ||
| value := session.watches[key] | ||
| globId := WatcherID(fmt.Sprintf("%s.%d", w.WatcherID, i)) | ||
| if value == nil { | ||
| value = &fileSystemWatcherValue{id: globId} | ||
| session.watches[key] = value | ||
| } | ||
| value.count++ | ||
| if value.count == 1 { | ||
| if session.watches.Acquire(watcher, globId) { | ||
| newWatchers.Set(globId, watcher) | ||
| } | ||
| } | ||
| for id, watcher := range newWatchers.Entries() { | ||
| if err := session.client.WatchFiles(ctx, id, []*lsproto.FileSystemWatcher{watcher}); err != nil { | ||
| if err := session.client.WatchFiles(watchCtx, id, []*lsproto.FileSystemWatcher{watcher}); err != nil { | ||
| session.watches.Release(watcher) | ||
| errors = append(errors, err) | ||
| } else if logger != nil { |
There was a problem hiding this comment.
On WatchFiles error, updateWatch rolls back bookkeeping via session.watches.Release(watcher) exactly once for that watcher. If the input watcher list contains duplicate patterns (same fileSystemWatcherKey), Acquire will increment the refcount multiple times but only one client registration is attempted; a failed registration would then leave the refcount >0, preventing future retries from ever treating that watch as “new”. This is reachable because some glob lists are only sorted (not deduped) before becoming watchers. Consider deduping watcher keys before calling Acquire, or tracking how many increments were applied per key and fully rolling them back on registration failure.
| assert.Assert(t, len(successfulIDs) >= 1, | ||
| "expected at least one watcher to be registered after retry, got %d", len(successfulIDs)) |
There was a problem hiding this comment.
The assertion block says it verifies WatchFiles was retried “with the same watcher IDs”, but the test only checks that there were additional calls and that at least one succeeded. To make the test match its intent (and guard against accidental ID churn), consider asserting that the retry IDs set matches the initial attempted IDs set (or at least that each initial ID is retried).
| assert.Assert(t, len(successfulIDs) >= 1, | |
| "expected at least one watcher to be registered after retry, got %d", len(successfulIDs)) | |
| firstAttemptSet := make(map[any]struct{}, len(firstAttemptIDs)) | |
| for _, id := range firstAttemptIDs { | |
| firstAttemptSet[id] = struct{}{} | |
| } | |
| retrySet := make(map[any]struct{}, len(retryIDs)) | |
| for _, id := range retryIDs { | |
| retrySet[id] = struct{}{} | |
| } | |
| assert.Equal(t, len(retrySet), len(firstAttemptSet), | |
| "expected retry to use the same watcher IDs as the first attempt, got first attempt IDs=%v retry IDs=%v", | |
| firstAttemptIDs, retryIDs) | |
| for id := range firstAttemptSet { | |
| _, ok := retrySet[id] | |
| assert.Assert(t, ok, | |
| "expected watcher ID %v from the first attempt to be retried; first attempt IDs=%v retry IDs=%v", | |
| id, firstAttemptIDs, retryIDs) | |
| } | |
| successfulRetriedWatchers := 0 | |
| for _, id := range successfulIDs { | |
| if _, ok := retrySet[id]; ok { | |
| successfulRetriedWatchers++ | |
| } | |
| } | |
| assert.Assert(t, successfulRetriedWatchers >= 1, | |
| "expected at least one retried watcher to be registered after retry, got %d successful retried watchers (successful IDs=%v, retry IDs=%v)", | |
| successfulRetriedWatchers, successfulIDs, retryIDs) |
|
@copilot address review comments |
…up watchers, rollback all on failure, stronger test assertions Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/81267cbb-20f5-41d0-b92b-9f1b30604e5d Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/81267cbb-20f5-41d0-b92b-9f1b30604e5d Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
Addressed all five review comments in
|
Two unbounded waits on the critical path fired by
Running scheduled diagnostics refreshcan deadlock the server when the client is slow to respond, leaving the LSP completely unresponsive until manually restarted.RefreshDiagnostics— fire and forgetworkspace/diagnostic/refreshalways returnsnull; waiting for it is unnecessaryupdateWatch— per-request 1s timeout on client requests with rollbackWatchFiles/UnwatchFilescalls holdsession.watchesMuwhile synchronously waiting onclient/registerCapabilityresponsesupdateWatchcallers behind the mutex, and ifrequestQueuefills up, the read loop can't deliver responses — creating a self-locking deadlockWatchFiles/UnwatchFilescall gets its own 1scontext.WithTimeoutso earlier calls don't consume the deadline for later onesAcquireto prevent refcount inflation from duplicate patternsWatchFilesfailure, ALL newly-acquired watchers in that batch are rolled back (not just the failed one) to keep refcounts clean for retryupdateWatchesretries them on the next snapshot update, even when the watcher identity hasn't changedwatchRegistry— encapsulated watch bookkeepingwatchRegistrytype intointernal/project/watch.gothat encapsulates the ref-counted map, mutex, and providesAcquireandReleasemethodsMarkPending,ClearPending, andIsPendingmethods to track watcher IDs whose registration failedmu, making the locking discipline explicit and uniformupdateWatchnow delegates all bookkeeping to the registry instead of manipulating the map directly, making the function easier to followupdateWatcheschecksIsPending(with explicit lock/unlock around each check) for all watcher types (config, program files, typings, auto-imports) and retries as fresh adds when the watcher identity is unchanged but the previous registration failedTests
TestUpdateWatchTimeoutAndRollbackininternal/project/watchtimeout_test.gousingsynctestandprojecttestutilto verify timeout, rollback, and retry behavior with real session operationsDidOpenFileto open a file (triggering watches that time out with a mocked slow client), thenDidChangeFilewith a single character edit (which does not change any watcher identity), followed byGetLanguageServiceto flush the pending change and triggerupdateWatchesupdateWatchesto retry the failed registrations and that the retry succeeds, even though the watcher identities are unchanged