Skip to content

test(engine): stabilize runner and envoy suites#4798

Closed
NathanFlurry wants to merge 1 commit intodriver-fixes/refresh-runner-config-after-envoy-connectfrom
04-26-test_engine_stabilize_runner_and_envoy_suites
Closed

test(engine): stabilize runner and envoy suites#4798
NathanFlurry wants to merge 1 commit intodriver-fixes/refresh-runner-config-after-envoy-connectfrom
04-26-test_engine_stabilize_runner_and_envoy_suites

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 27, 2026

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Code Review: PR #4798 — test(engine): stabilize runner and envoy suites

This is a large test-stabilization PR (12k+ additions, 458 deletions) with several production fixes mixed in. The overall direction is solid — restoring test/runner protocol separation, fixing real bugs (wrong key type in evict, missing crash policy propagation, command-ack race, SQLite migration invalidation), and mitigating pre-existing flakes by skipping broken legacy runner tests with comments.


Confirmed Bug Fixes (Correct)

  • pegboard/ops/envoy/evict_actors.rs — Critical fix. Eviction op decoded Envoy actor keys as keys::runner::ActorKey causing 1011 core.internal_error: bad code, found 21. Switching to keys::envoy::ActorKey is correct.
  • pegboard/ops/actor/util.rsbuild_actors_from_workflows was hardcoding CrashPolicy::Sleep regardless of the stored actor state. Now reads s.crash_policy. This was silently dropping the stored crash policy for all actor2 actors.
  • pegboard/workflows/actor/mod.rs — Crash policy is now threaded through when spawning an actor2 workflow from the legacy actor1 workflow path.
  • envoy-client/src/actor.rssend_fetch_error_response is now called on fetch error so callers get an HTTP 500 instead of a silent timeout.
  • envoy-client/src/envoy.rssend_command_ack is now called immediately after handle_commands, closing the ack race that caused fast restart tests to stall.
  • envoy-client/src/events.rs — The received_stop guard is removed: actor entries are cleaned up on any ActorStateStopped event, not only when Rivet initiated the stop.
  • pegboard-envoy/src/lib.rs — Subscribe to the Envoy pubsub topic before calling init_conn (before insertion into the load balancer), closing the start-command drop window on reconnect.
  • sqlite-storage/src/takeover.rs — Changed the ensure! that rejected native V2 metadata during V1-migration invalidation to a soft early-return when require_stage_in_progress is set.
  • pegboard-envoy/src/sqlite_runtime.rs — Added actor_migration_lock serialization in maybe_load_sqlite_startup_data to prevent concurrent takeover from two start commands.
  • pegboard/src/workflows/actor2/runtime.rsstate.sleep_ts and state.reschedule_ts are cleared on allocation, and serverful restart sets Transition::Starting (not Transition::Allocating).

Issues to Address

1. Potential indentation corruption in sqlite_runtime.rs

File: engine/packages/pegboard-envoy/src/sqlite_runtime.rs, protocol_sqlite_meta function

The diff shows modified indentation around the closing braces of this function. Please verify the compiled output is correct — max_delta_bytes should be at one indent level and the struct literal should close cleanly before the function }.

2. Missing content-type header in send_fetch_error_response

File: engine/sdks/rust/envoy-client/src/actor.rs

The function constructs a JSON body but only sets content-length and x-rivet-error headers. There is no content-type: application/json header. Clients that check Content-Type before parsing JSON will not recognize the error body. Add:

headers.insert("content-type".to_string(), "application/json".to_string());

3. std::sync::Mutex in async test code (CLAUDE.md violation)

File: engine/packages/engine/tests/common/test_runner.rs

Several Arc<Mutex<Option<oneshot::Sender<()>>>> fields use std::sync::Mutex (imported via use std::sync::{Arc, Mutex}), but these are locked inside async on_start callbacks running in Tokio tasks. Per CLAUDE.md: "Async Rust code defaults to tokio::sync::Mutex... Do not use std::sync::Mutex." Use tokio::sync::Mutex or restructure to pass the oneshot::Sender directly without wrapping in Mutex.

4. Misleading _kv_tx naming in test_envoy.rs

File: engine/packages/engine/tests/common/test_envoy.rs

let (_kv_tx, kv_rx) = mpsc::unbounded_channel();
// ...
ActorConfig { kv_request_tx: _kv_tx, ... }

The leading underscore conventionally signals "intentionally unused / will be dropped," but _kv_tx is actually moved into ActorConfig. This is misleading. Rename to kv_tx to match the parallel event_tx pattern.

5. crash_policy serde default — verify the default value

File: engine/packages/pegboard/src/workflows/actor2/mod.rs

The new crash_policy field on State uses #[serde(default)]. In-flight workflow states from before this change will deserialize with CrashPolicy::default(). Please verify that CrashPolicy::default() equals CrashPolicy::Sleep (the previous hardcoded behavior), or this introduces a silent behavior change for actors mid-flight during a deploy.

6. Stale // NOTE: Assumes input is validated. comment

File: engine/packages/pegboard/src/workflows/actor2/mod.rs

The comment above Input still reads // NOTE: Assumes input is validated. but validation now happens inside the workflow via the ValidateInput activity. Update the comment to reflect the actual contract.

7. ValidateInput.input type — verify compatibility

File: engine/packages/pegboard/src/workflows/actor2/mod.rs

ValidateInput::input is typed Option<String> but actor2::Input::input appears to be Option<Vec<u8>>. Confirm these are compatible when the activity input is constructed from input.input.clone(), or add a conversion/note.

8. Polling loops in tests lack required justification comments

File: engine/packages/engine/tests/envoy/actors_lifecycle.rs

The wait_for_envoy_actor helper (and similar loop { sleep(50ms) } loops) lacks an inline comment explaining why event-driven coordination is not used here. Per CLAUDE.md testing conventions: "every vi.waitFor call must have a one-line comment explaining why polling rather than direct awaiting is necessary." The same principle applies to Rust polling loops in tests. Add a comment like // Poll because the actor's start notification arrives asynchronously from a separate Tokio task.


Minor Style Notes

  • Several #[ignore = "..."] strings use fragmented phrasing. Per CLAUDE.md, comments should be complete sentences — the ignore reason strings could be cleaner (e.g. "This test is broken because..." rather than "broken: ...").
  • The new upsert_normal_runner_config retry loop polls at 100ms intervals up to 60s. Consider a short exponential backoff (100ms → 200ms → 400ms, capped at 1s) to reduce call volume during startup contention.

Summary

Category Items
Must-fix correctness Missing content-type header; crash_policy default verification
Must-fix CLAUDE.md violations std::sync::Mutex in async test code; misleading _kv_tx name
Should-fix Stale comment; ValidateInput type compatibility; polling loop justification comments
Suggestions Retry backoff in test helpers

The core production fixes are correct and well-targeted. The test framework rebuild is architecturally sound. Please address the std::sync::Mutex in async contexts and the missing content-type header before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant