fix: prevent refresh coordinator deadlock and reset missing-env reporting on reconfigure#409
Conversation
Performance Report (macOS)
Legend
|
Performance Report (Linux) ➖
Legend
|
Test Coverage Report (Linux)
Coverage increased! Great work! |
Performance Report (Windows) ✅
Legend
|
Test Coverage Report (Windows)
Coverage increased! Great work! |
There was a problem hiding this comment.
Pull request overview
This PR addresses two reliability issues in PET’s JSONRPC refresh flow: preventing a refresh coordinator deadlock when the refresh-owning thread exits unexpectedly, and ensuring missing-environment reporting can run again after a reconfigure.
Changes:
- Updates
RefreshCoordinator::force_complete_request()to recover from aRunningstate (in addition toCompleting) and to log mismatched-key calls. - Adds
RefreshSafetyGuardto ensure the coordinator returns toIdleif the refresh thread exits theStartpath without creating aRefreshCompletionGuard. - Resets
MISSING_ENVS_REPORTING_STATEduringhandle_configurewhen the configuration generation is bumped, plus adds tests for these scenarios.
Show a summary per file
| File | Description |
|---|---|
crates/pet/src/jsonrpc.rs |
Adds coordinator recovery + safety guard for panic/unwind scenarios, resets missing-env reporting on configure, and introduces new unit tests for the new recovery paths. |
Copilot's findings
Comments suppressed due to low confidence (1)
crates/pet/src/jsonrpc.rs:983
RefreshCompletionGuard::begin(...)(and the subsequent reply/error finishing) is still outside thecatch_unwind. Ifbegin_completion()panics (e.g., unexpected key), the safety guard will reset the coordinator, but the current refresh request IDs may never receive a JSONRPC response. Consider extending thecatch_unwindto cover completion-guard creation + reply/error finishing, or changingbegin_completionto return aResultso you can send an error reply before exiting.
match refresh_result {
Ok(execution) => {
let refresh_result = execution.result.clone();
let mut completion_guard = RefreshCompletionGuard::begin(
&context.refresh_coordinator,
&refresh_key,
);
safety_guard.disarm();
finish_refresh_replies(&mut completion_guard, &refresh_result);
report_refresh_follow_up(execution);
}
Err(_) => {
error!(
"Refresh panicked for generation {} and options {:?}",
configuration_state.generation, refresh_options
);
let mut completion_guard = RefreshCompletionGuard::begin(
&context.refresh_coordinator,
&refresh_key,
);
safety_guard.disarm();
finish_refresh_errors(
&mut completion_guard,
"Refresh failed unexpectedly",
);
- Files reviewed: 1/1 changed files
- Comments generated: 2
…tra key clone (PR #409)
There was a problem hiding this comment.
Pull request overview
This PR hardens PET’s JSONRPC refresh coordination and missing-environment reporting so the server can recover cleanly from edge-case failures and reconfiguration cycles, avoiding long-lived stuck state.
Changes:
- Extend
RefreshCoordinator::force_complete_request()to recover from aRunningstate (and log mismatched-key calls) to prevent coordinator deadlock (#396). - Reset process-global
MISSING_ENVS_REPORTING_STATEduringconfiguregeneration bumps so missing-env reporting can run again after reconfigure (#395). - Add unit tests covering the recovery paths and the missing-env reset behavior.
Show a summary per file
| File | Description |
|---|---|
crates/pet/src/jsonrpc.rs |
Adds RefreshSafetyGuard + expands coordinator recovery logic; resets missing-env reporting on reconfigure; introduces new regression tests. |
Copilot's findings
- Files reviewed: 1/1 changed files
- Comments generated: 0 new
Summary
Two high-impact bug fixes for the refresh coordinator and missing-env reporting state machine.
Changes
Bug: force_complete_request does not handle Running state — potential coordinator deadlock #396 —
force_complete_requestdeadlock: AddedRefreshSafetyGuardthat ensures the coordinator transitions back toIdleif the refresh-owning thread exits without constructing aRefreshCompletionGuard(e.g., ifbegin_completion()panics). Also updatedforce_complete_request()to handle theRunningstate and log mismatched-key cases instead of silently ignoring them.Bug: MISSING_ENVS_REPORTING_STATE is permanently exhausted after first refresh #395 —
MISSING_ENVS_REPORTING_STATEpermanently exhausted: ResetMISSING_ENVS_REPORTING_STATEtoMISSING_ENVS_AVAILABLEinsidehandle_configurewhen the generation is bumped. The store is placed inside the configuration write lock to avoid a TOCTOU window with concurrent refresh threads.Added 4 tests covering the new recovery paths and reset behavior.
Fixes #396
Fixes #395