Skip to content

Commit e7ac499

Browse files
Retry PET refresh on process crash (ConnectionError), not just timeout (microsoft#1447)
### Problem Telemetry from Kusto shows that ~99% of all manager registration failures happen at the `nativeFinderRefresh` stage. These failures break down into two categories: | Error Type | % of Failures | Meaning | |---|---|---| | `spawn_timeout` | ~70% | PET process started but didn't respond within the 30s timeout | | `process_crash` | ~20% | PET process died mid-request (JSON-RPC connection dropped) | The `spawn_timeout` failures are already retried — when a `RpcTimeoutError` is caught, the code kills the hung process, sets `processExited = true`, and retries on a fresh PET instance. This retry path has existed since the retry logic was introduced. However, **`process_crash` failures get zero retries**. When PET crashes mid-request, the `vscode-jsonrpc` library throws a `ConnectionError`. The existing retry condition only checks for `RpcTimeoutError`, so `ConnectionError` falls straight through to `throw ex` — the request fails permanently even though the exact same restart infrastructure could recover it. This means ~20% of all failures are unrecoverable for no architectural reason. The restart machinery (`ensureProcessRunning` → `restart()` with exponential backoff) is already built and works — it just wasn't wired up for crash recovery. ### Fix Extend the retry condition in three locations to also handle `rpc.ConnectionError`: 1. **`doRefresh()` retry loop** — The outer loop that decides whether to `continue` to the next attempt. Adding `ConnectionError` here lets crashed requests get a retry with a fresh PET process, identical to how timeouts are already handled. 2. **`doRefreshAttempt()` catch block** — The inner catch that marks the process as exited before rethrowing. Without this, a `ConnectionError` could propagate with `processExited` still `false` (if the async `exit` event handler hasn't fired yet), causing `ensureProcessRunning()` on the retry to skip the restart. 3. **`resolve()` catch block** — `resolve()` has no retry loop, but it needs to mark the process as exited on `ConnectionError` so the *next* request triggers a restart instead of trying to use the dead connection. ### How the retry works The retry doesn't reuse the crashed process — it starts a brand new one: 1. `doRefresh` catches `ConnectionError` → `killProcess()` (no-op if already dead) + `processExited = true` → `continue` 2. Next iteration → `doRefreshAttempt()` → `ensureProcessRunning()` 3. `ensureProcessRunning` sees `processExited === true` → calls `restart()` 4. `restart()` spawns a fresh PET child process with a new JSON-RPC connection (with exponential backoff) 5. The refresh request runs against the new process This is the exact same path already taken for timeouts. The `MAX_RESTART_ATTEMPTS = 3` limit still applies — if PET keeps crashing, we don't retry forever. ### Safety - **`killProcess()` is idempotent** — checks `proc.exitCode === null` before killing; no-op on an already-dead process. - **No callers depend on `ConnectionError` propagating** — the only consumer is `classifyError()` in the telemetry error classifier, which still works correctly since errors only reach callers when retries are exhausted. - **No new retry budget** — uses the existing `MAX_REFRESH_RETRIES = 1` (one retry) and `MAX_RESTART_ATTEMPTS = 3` limits. No change to worst-case timing. ### Expected impact - ~20% of currently-unrecoverable failures become recoverable (process crash → restart → retry succeeds) - No change to timeout behavior (existing path unchanged) - Improved log messages distinguish "crashed" vs "timed out" for easier diagnostics
1 parent d9e2031 commit e7ac499

File tree

1 file changed

+15
-10
lines changed

1 file changed

+15
-10
lines changed

src/managers/common/nativePythonFinder.ts

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -260,10 +260,11 @@ class NativePythonFinderImpl implements NativePythonFinder {
260260
this.restartAttempts = 0;
261261
return environment;
262262
} catch (ex) {
263-
// On resolve timeout (not configure — configure handles its own timeout),
263+
// On resolve timeout or connection error (not configure — configure handles its own timeout),
264264
// kill the hung process so next request triggers restart
265-
if (ex instanceof RpcTimeoutError && ex.method !== 'configure') {
266-
this.outputChannel.warn('[pet] Resolve request timed out, killing hung process for restart');
265+
if ((ex instanceof RpcTimeoutError && ex.method !== 'configure') || ex instanceof rpc.ConnectionError) {
266+
const reason = ex instanceof rpc.ConnectionError ? 'crashed' : 'timed out';
267+
this.outputChannel.warn(`[pet] Resolve request ${reason}, killing process for restart`);
267268
this.killProcess();
268269
this.processExited = true;
269270
}
@@ -574,11 +575,14 @@ class NativePythonFinderImpl implements NativePythonFinder {
574575
} catch (ex) {
575576
lastError = ex;
576577

577-
// Only retry on timeout errors
578-
if (ex instanceof RpcTimeoutError && ex.method !== 'configure') {
578+
// Retry on timeout or connection errors (PET hung or crashed mid-request)
579+
const isRetryable =
580+
(ex instanceof RpcTimeoutError && ex.method !== 'configure') || ex instanceof rpc.ConnectionError;
581+
if (isRetryable) {
579582
if (attempt < MAX_REFRESH_RETRIES) {
583+
const reason = ex instanceof rpc.ConnectionError ? 'crashed' : 'timed out';
580584
this.outputChannel.warn(
581-
`[pet] Refresh timed out (attempt ${attempt + 1}/${MAX_REFRESH_RETRIES + 1}), restarting and retrying...`,
585+
`[pet] Refresh ${reason} (attempt ${attempt + 1}/${MAX_REFRESH_RETRIES + 1}), restarting and retrying...`,
582586
);
583587
// Kill and restart for retry
584588
this.killProcess();
@@ -588,7 +592,7 @@ class NativePythonFinderImpl implements NativePythonFinder {
588592
// Final attempt failed
589593
this.outputChannel.error(`[pet] Refresh failed after ${MAX_REFRESH_RETRIES + 1} attempts`);
590594
}
591-
// Non-timeout errors or final timeout - rethrow
595+
// Non-retryable errors or final attempt - rethrow
592596
throw ex;
593597
}
594598
}
@@ -652,10 +656,11 @@ class NativePythonFinderImpl implements NativePythonFinder {
652656
this.outputChannel.info(`[pet] Refresh succeeded on retry attempt ${attempt + 1}`);
653657
}
654658
} catch (ex) {
655-
// On refresh timeout (not configure — configure handles its own timeout),
659+
// On refresh timeout or connection error (not configure — configure handles its own timeout),
656660
// kill the hung process so next request triggers restart
657-
if (ex instanceof RpcTimeoutError && ex.method !== 'configure') {
658-
this.outputChannel.warn('[pet] Request timed out, killing hung process for restart');
661+
if ((ex instanceof RpcTimeoutError && ex.method !== 'configure') || ex instanceof rpc.ConnectionError) {
662+
const reason = ex instanceof rpc.ConnectionError ? 'crashed' : 'timed out';
663+
this.outputChannel.warn(`[pet] PET process ${reason}, killing for restart`);
659664
this.killProcess();
660665
this.processExited = true;
661666
}

0 commit comments

Comments
 (0)