Skip to content

test(engine): cover envoy Sleep crash policy wake and recovery paths#4816

Draft
NathanFlurry wants to merge 1 commit intoengine-stabilize/sqlite-storage-vbare-protocolfrom
engine-stabilize/envoy-sleep-crash-policy-tests
Draft

test(engine): cover envoy Sleep crash policy wake and recovery paths#4816
NathanFlurry wants to merge 1 commit intoengine-stabilize/sqlite-storage-vbare-protocolfrom
engine-stabilize/envoy-sleep-crash-policy-tests

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 27, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

PR Review

PR: test(engine): cover envoy Sleep crash policy wake and recovery paths

This is a clean, additive test-only PR that covers three previously untested crash/sleep/wake scenarios. The tests are well-structured and consistent with the surrounding file conventions. A few observations below.


Issues

1. No timeout guard on initial sleep-wait polling loops (moderate)

In both envoy_crash_policy_sleep_wakes_on_request and envoy_crash_policy_sleep_recovers_after_wake_crash, the loop waiting for sleep_ts.is_some() && connectable_ts.is_none() has no inner timeout. If that transition never happens, the test will hang silently until the outer common::run budget expires (75s and 90s), making failures hard to diagnose. A tokio::time::timeout wrapping the loop with a descriptive panic message would make failures actionable without changing behavior:

tokio::time::timeout(Duration::from_secs(30), async {
    loop {
        // ... poll sleep_ts ...
    }
})
.await
.expect("actor did not enter crash-induced sleep within 30s");

This is consistent with the 30s inner guard already used for the connectable-wait loop in those same tests.

2. Exact crash count assertion may be brittle (low)

In envoy_crash_policy_sleep_recovers_after_wake_crash, assert_eq!(*crash_count.lock(), 2) assumes the engine will call the actor handler exactly twice. If the workflow internally retries within a single pegboard generation before surfacing a sleep state, the count could exceed 2 without the test's logic being wrong. Either document why exactly 2 is guaranteed by CrashNTimesThenSucceedActor::new(2, ...), or use assert!(*crash_count.lock() >= 2) to be defensive.

3. Discarded JoinHandle in spawn_wake_ping (cosmetic)

The spawned task's handle is silently dropped. If the test runtime exits before the task fires, the wake ping is cancelled without any diagnostic. This is fine in practice given the test structure, but a short comment explaining the fire-and-forget intent (similar to the existing comment) keeps it clear for readers.


Test Coverage

The three new tests cover:

  • Wake from crash-induced sleep via incoming request (happy path).
  • Multi-crash recovery (crash → sleep → wake → crash → sleep → wake → connectable).
  • crash_policy field persistence at actor creation time.

One gap worth noting (out of scope for this PR): there is no test for timeout-based wake, where the actor wakes from sleep after the sleep deadline expires with no incoming request.


Summary

No blocking issues. The main actionable items before leaving DRAFT are the missing inner timeout guard on the sleep-wait loops (for diagnosability) and clarifying or softening the exact crash-count assertion. Otherwise this is ready to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant