Commit e7ac499
authored
Retry PET refresh on process crash (ConnectionError), not just timeout (microsoft#1447)
### Problem
Telemetry from Kusto shows that ~99% of all manager registration
failures happen at the `nativeFinderRefresh` stage. These failures break
down into two categories:
| Error Type | % of Failures | Meaning |
|---|---|---|
| `spawn_timeout` | ~70% | PET process started but didn't respond within
the 30s timeout |
| `process_crash` | ~20% | PET process died mid-request (JSON-RPC
connection dropped) |
The `spawn_timeout` failures are already retried — when a
`RpcTimeoutError` is caught, the code kills the hung process, sets
`processExited = true`, and retries on a fresh PET instance. This retry
path has existed since the retry logic was introduced.
However, **`process_crash` failures get zero retries**. When PET crashes
mid-request, the `vscode-jsonrpc` library throws a `ConnectionError`.
The existing retry condition only checks for `RpcTimeoutError`, so
`ConnectionError` falls straight through to `throw ex` — the request
fails permanently even though the exact same restart infrastructure
could recover it.
This means ~20% of all failures are unrecoverable for no architectural
reason. The restart machinery (`ensureProcessRunning` → `restart()` with
exponential backoff) is already built and works — it just wasn't wired
up for crash recovery.
### Fix
Extend the retry condition in three locations to also handle
`rpc.ConnectionError`:
1. **`doRefresh()` retry loop** — The outer loop that decides whether to
`continue` to the next attempt. Adding `ConnectionError` here lets
crashed requests get a retry with a fresh PET process, identical to how
timeouts are already handled.
2. **`doRefreshAttempt()` catch block** — The inner catch that marks the
process as exited before rethrowing. Without this, a `ConnectionError`
could propagate with `processExited` still `false` (if the async `exit`
event handler hasn't fired yet), causing `ensureProcessRunning()` on the
retry to skip the restart.
3. **`resolve()` catch block** — `resolve()` has no retry loop, but it
needs to mark the process as exited on `ConnectionError` so the *next*
request triggers a restart instead of trying to use the dead connection.
### How the retry works
The retry doesn't reuse the crashed process — it starts a brand new one:
1. `doRefresh` catches `ConnectionError` → `killProcess()` (no-op if
already dead) + `processExited = true` → `continue`
2. Next iteration → `doRefreshAttempt()` → `ensureProcessRunning()`
3. `ensureProcessRunning` sees `processExited === true` → calls
`restart()`
4. `restart()` spawns a fresh PET child process with a new JSON-RPC
connection (with exponential backoff)
5. The refresh request runs against the new process
This is the exact same path already taken for timeouts. The
`MAX_RESTART_ATTEMPTS = 3` limit still applies — if PET keeps crashing,
we don't retry forever.
### Safety
- **`killProcess()` is idempotent** — checks `proc.exitCode === null`
before killing; no-op on an already-dead process.
- **No callers depend on `ConnectionError` propagating** — the only
consumer is `classifyError()` in the telemetry error classifier, which
still works correctly since errors only reach callers when retries are
exhausted.
- **No new retry budget** — uses the existing `MAX_REFRESH_RETRIES = 1`
(one retry) and `MAX_RESTART_ATTEMPTS = 3` limits. No change to
worst-case timing.
### Expected impact
- ~20% of currently-unrecoverable failures become recoverable (process
crash → restart → retry succeeds)
- No change to timeout behavior (existing path unchanged)
- Improved log messages distinguish "crashed" vs "timed out" for easier
diagnostics1 parent d9e2031 commit e7ac499
1 file changed
+15
-10
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
260 | 260 | | |
261 | 261 | | |
262 | 262 | | |
263 | | - | |
| 263 | + | |
264 | 264 | | |
265 | | - | |
266 | | - | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
267 | 268 | | |
268 | 269 | | |
269 | 270 | | |
| |||
574 | 575 | | |
575 | 576 | | |
576 | 577 | | |
577 | | - | |
578 | | - | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
579 | 582 | | |
| 583 | + | |
580 | 584 | | |
581 | | - | |
| 585 | + | |
582 | 586 | | |
583 | 587 | | |
584 | 588 | | |
| |||
588 | 592 | | |
589 | 593 | | |
590 | 594 | | |
591 | | - | |
| 595 | + | |
592 | 596 | | |
593 | 597 | | |
594 | 598 | | |
| |||
652 | 656 | | |
653 | 657 | | |
654 | 658 | | |
655 | | - | |
| 659 | + | |
656 | 660 | | |
657 | | - | |
658 | | - | |
| 661 | + | |
| 662 | + | |
| 663 | + | |
659 | 664 | | |
660 | 665 | | |
661 | 666 | | |
| |||
0 commit comments