Skip to content

fix: prevent refresh coordinator deadlock and reset missing-env reporting on reconfigure#409

Merged
karthiknadig merged 2 commits intomainfrom
bug/fix-396-395-refresh-bugs
Apr 6, 2026
Merged

fix: prevent refresh coordinator deadlock and reset missing-env reporting on reconfigure#409
karthiknadig merged 2 commits intomainfrom
bug/fix-396-395-refresh-bugs

Conversation

@karthiknadig
Copy link
Copy Markdown
Member

Summary

Two high-impact bug fixes for the refresh coordinator and missing-env reporting state machine.

Changes

Fixes #396
Fixes #395

@karthiknadig karthiknadig requested a review from Copilot April 6, 2026 18:51
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

Performance Report (macOS)

Metric PR (P50) PR (P95) Baseline (P50) Delta
Server Startup 83ms 710ms 60ms 23ms
Full Refresh 157ms 40886ms 122ms 35ms

Results based on 10 iterations. P50 = median, P95 = 95th percentile.


Legend
  • 🚀 Significant speedup (>100ms faster)
  • ✅ Faster than baseline
  • ➖ No significant change
  • 🔺 Slower than baseline (>100ms)
  • ⚠️ Significant slowdown (>500ms)

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

Performance Report (Linux) ➖

Metric PR (P50) PR (P95) Baseline (P50) Delta Change
Server Startup 1ms 1ms 1ms 0ms 0%
Full Refresh 70ms 287ms 70ms 0ms 0%

Results based on 10 iterations. P50 = median, P95 = 95th percentile.


Legend
  • 🚀 Significant speedup (>100ms faster)
  • ✅ Faster than baseline
  • ➖ No significant change
  • 🔺 Slower than baseline (>100ms)
  • ⚠️ Significant slowdown (>500ms)

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

Test Coverage Report (Linux)

Metric Value
Current Coverage 73.4%
Base Branch Coverage 73.2%
Delta .2% ✅

Coverage increased! Great work!

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

Performance Report (Windows) ✅

Metric PR (P50) PR (P95) Baseline (P50) Delta Change
Server Startup 9ms 12ms 9ms 0ms 0%
Full Refresh 145ms 1724ms 147ms -2ms -1.4%

Results based on 10 iterations. P50 = median, P95 = 95th percentile.


Legend
  • 🚀 Significant speedup (>100ms faster)
  • ✅ Faster than baseline
  • ➖ No significant change
  • 🔺 Slower than baseline (>100ms)
  • ⚠️ Significant slowdown (>500ms)

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

Test Coverage Report (Windows)

Metric Value
Current Coverage 69.64%
Base Branch Coverage 69.4%
Delta 0.24% ✅

Coverage increased! Great work!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses two reliability issues in PET’s JSONRPC refresh flow: preventing a refresh coordinator deadlock when the refresh-owning thread exits unexpectedly, and ensuring missing-environment reporting can run again after a reconfigure.

Changes:

  • Updates RefreshCoordinator::force_complete_request() to recover from a Running state (in addition to Completing) and to log mismatched-key calls.
  • Adds RefreshSafetyGuard to ensure the coordinator returns to Idle if the refresh thread exits the Start path without creating a RefreshCompletionGuard.
  • Resets MISSING_ENVS_REPORTING_STATE during handle_configure when the configuration generation is bumped, plus adds tests for these scenarios.
Show a summary per file
File Description
crates/pet/src/jsonrpc.rs Adds coordinator recovery + safety guard for panic/unwind scenarios, resets missing-env reporting on configure, and introduces new unit tests for the new recovery paths.

Copilot's findings

Comments suppressed due to low confidence (1)

crates/pet/src/jsonrpc.rs:983

  • RefreshCompletionGuard::begin(...) (and the subsequent reply/error finishing) is still outside the catch_unwind. If begin_completion() panics (e.g., unexpected key), the safety guard will reset the coordinator, but the current refresh request IDs may never receive a JSONRPC response. Consider extending the catch_unwind to cover completion-guard creation + reply/error finishing, or changing begin_completion to return a Result so you can send an error reply before exiting.
                            match refresh_result {
                                Ok(execution) => {
                                    let refresh_result = execution.result.clone();
                                    let mut completion_guard = RefreshCompletionGuard::begin(
                                        &context.refresh_coordinator,
                                        &refresh_key,
                                    );
                                    safety_guard.disarm();
                                    finish_refresh_replies(&mut completion_guard, &refresh_result);
                                    report_refresh_follow_up(execution);
                                }
                                Err(_) => {
                                    error!(
                                        "Refresh panicked for generation {} and options {:?}",
                                        configuration_state.generation, refresh_options
                                    );
                                    let mut completion_guard = RefreshCompletionGuard::begin(
                                        &context.refresh_coordinator,
                                        &refresh_key,
                                    );
                                    safety_guard.disarm();
                                    finish_refresh_errors(
                                        &mut completion_guard,
                                        "Refresh failed unexpectedly",
                                    );
  • Files reviewed: 1/1 changed files
  • Comments generated: 2

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens PET’s JSONRPC refresh coordination and missing-environment reporting so the server can recover cleanly from edge-case failures and reconfiguration cycles, avoiding long-lived stuck state.

Changes:

  • Extend RefreshCoordinator::force_complete_request() to recover from a Running state (and log mismatched-key calls) to prevent coordinator deadlock (#396).
  • Reset process-global MISSING_ENVS_REPORTING_STATE during configure generation bumps so missing-env reporting can run again after reconfigure (#395).
  • Add unit tests covering the recovery paths and the missing-env reset behavior.
Show a summary per file
File Description
crates/pet/src/jsonrpc.rs Adds RefreshSafetyGuard + expands coordinator recovery logic; resets missing-env reporting on reconfigure; introduces new regression tests.

Copilot's findings

  • Files reviewed: 1/1 changed files
  • Comments generated: 0 new

@karthiknadig karthiknadig merged commit bbe763f into main Apr 6, 2026
34 checks passed
@karthiknadig karthiknadig deleted the bug/fix-396-395-refresh-bugs branch April 6, 2026 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants