Server stops responding after `Running scheduled diagnostics refresh`

## Symptom

In `typescript-native-preview.log` the last line before a long, total silence is `Running scheduled diagnostics refresh`. After that the log gets nothing — no `handled method`, no `Cloning snapshot`, no periodic `Runtime Metrics` — for tens of minutes (observed cases of 12, 17, 30, 33, and 78 minutes on different projects).

While the server is in this state:

- the editor receives no diagnostics, hover, completion, goto-definition, or code actions;
- there's no error popup or progress indicator — VSCode just silently freezes for tsgo features;
- the only way to recover is **TypeScript Native Preview: Restart Server**;
- the first Restart click itself fails with `Stopping server timed out` (this is #3601), so two clicks are required to actually get a fresh process.

## What the log looks like

```
... Scheduling new diagnostics refresh...
... Delaying scheduled diagnostics refresh...
... Delaying scheduled diagnostics refresh...
... Delaying scheduled diagnostics refresh...
... Delaying scheduled diagnostics refresh...
   (60+ Delaying lines over ~200 ms — typical fingerprint of a file system event burst)
... Delaying scheduled diagnostics refresh...
... Running scheduled diagnostics refresh
[server stops responding]
... Restarting language server...
[error] Stopping server timed out
... Restarting language server...
... Resolved client capabilities: { ... }   ← new process
```

## Reproduction in real workspaces

Not deterministic. From observing several occurrences back-to-back, these conditions appear correlated:

- **Large workspace with multiple ConfiguredProjects** (e.g. monorepo). Captured cases: workspace with 3 ConfiguredProjects, live Go heap ≈ 1.5–2 GB at the moment of the hang.
- **Tens of editors open at once.** Captured case: 224 didOpen / 155 didClose before the hang ⇒ ~70 active editors at the time of `Running scheduled diagnostics refresh`.
- **Burst of file system events shortly before the hang.** A long sequence of `Delaying scheduled diagnostics refresh` (60+ over ~200 ms) is the typical fingerprint of `git checkout` / monorepo `sync` / mass `touch` in the workspace.
- **No special server-side load at the moment of hang.** In the captured case only 23 `textDocument/diagnostic` requests had been handled in the entire session prior; hover/completion latencies were sub-millisecond.

## What we tried in synthetic stress testing

We drove a real `tsgo --lsp --stdio` from a Node-based test client, replaying the captured workload (220 open files from a real hang, 3 ConfiguredProjects, full VSCode-style capabilities, ~2 GB warmed heap) plus extra pressure: idle traffic at 120 rps across codeLens / inlayHint / semanticTokens / documentSymbol / foldingRange / hover, watch-event bursts, repeated snapshot updates contending with `idleCacheCleanTimer`, and delayed client replies to `workspace/diagnostic/refresh`. The server kept up — parallel `textDocument/diagnostic` finished in ~1 s and probe hovers stayed sub-millisecond throughout.

We did reproduce a permanent stall **once**, after pushing further: 1 500 open files, the same 120 rps idle traffic, and a watch-event burst that triggered the diagnostic refresh. The server went silent for 2 minutes before our watchdog tripped and we captured a goroutine dump via `SIGQUIT`. The dump shows:

- 3 goroutines in `sendClientRequest` waiting on `<-responseChan` for **2 minutes**: `Server.WatchFiles` (holding `session.watchesMu`), `Server.RefreshDiagnostics`, and `serverProgressReporter.createWorkDoneProgress`;
- 12 goroutines parked on `session.watchesMu.Lock()` inside `updateWatch`;
- `Server.readLoop` parked in `chansend` on a full `s.requestQueue` (cap 100) — so client responses to the three pending requests above could not be delivered, making the deadlock self-locking.

This is consistent with the `Running scheduled diagnostics refresh` symptom but the trigger (1 500 simultaneously open files plus 120 rps client traffic) is much more aggressive than what real VSCode produces, so we don't claim it's the *same* root cause hitting users in production. Lower-volume settings did not reproduce. **The symptom matches; the production root cause may be different.**

The full dump and a unit test that demonstrates the `updateWatch` mutex pattern violation in 3 s are available on request.

## Code paths worth a look while we wait for a dump

Two unbounded waits sitting directly on the path that fires at `Running scheduled diagnostics refresh`:

- **`Session.RefreshDiagnostics`** ([`internal/project/session.go`](https://github.com/microsoft/typescript-go/blob/main/internal/project/session.go) line 414): sends `workspace/diagnostic/refresh` on `s.backgroundCtx` with **no per-request timeout or cancel handle**. If the client never responds, this goroutine waits forever — and it's exactly the request the log shows being issued right before the silence.

- **`updateWatch`** ([`internal/project/session.go`](https://github.com/microsoft/typescript-go/blob/main/internal/project/session.go) lines 1136–1209): holds `session.watchesMu` across synchronous `client.WatchFiles` / `client.UnwatchFiles` calls. Any delay in the client's response to `client/registerCapability` blocks every other `updateWatch` behind this mutex. (We have a unit test that demonstrates this pattern violation in 3 s using only the existing `ClientMock`; happy to attach if useful.)

## Environment

- Extension: `typescriptteam.native-preview-0.20260421.1-darwin-arm64`
- Verified against `microsoft/typescript-go` `main` at commit `ba858e5c6`
- macOS 25.3.0 / 26.x (Darwin arm64)
- Multiple users, multiple projects, multiple times of day


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server stops responding after `Running scheduled diagnostics refresh` #3615

Symptom

What the log looks like

Reproduction in real workspaces

What we tried in synthetic stress testing

Code paths worth a look while we wait for a dump

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Server stops responding after Running scheduled diagnostics refresh #3615

Description

Symptom

What the log looks like

Reproduction in real workspaces

What we tried in synthetic stress testing

Code paths worth a look while we wait for a dump

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Server stops responding after `Running scheduled diagnostics refresh` #3615