Skip to content

fix(envoy-client): complete command and request lifecycles#4801

Draft
NathanFlurry wants to merge 1 commit intoengine-stabilize/envoy-protocol-versionfrom
engine-stabilize/envoy-client-lifecycle
Draft

fix(envoy-client): complete command and request lifecycles#4801
NathanFlurry wants to merge 1 commit intoengine-stabilize/envoy-protocol-versionfrom
engine-stabilize/envoy-client-lifecycle

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 27, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Code Review: fix(envoy-client): complete command and request lifecycles

Three targeted fixes across the envoy-client. Overall the changes are correct and close real protocol holes. A few minor issues worth addressing are noted below.


engine/sdks/rust/envoy-client/src/actor.rssend_fetch_error_response

Good fix. Before this PR, when the actor's fetch callback returned an error the code logged the error and returned without sending anything back over the tunnel, leaving the gateway request open indefinitely. The new send_fetch_error_response helper closes that lifecycle correctly.

Minor observations:

  1. Hard-coded error body. The JSON body {"code":"envoy_fetch_failed","message":"actor fetch failed"} drops the actual error details. For internal/trusted paths (envoy → engine) this is fine from a security perspective, but the original error value is already logged and never surfaced. Consider adding a content-type: application/json header so downstream consumers do not have to guess at the body format. (The existing send_response path does not set this header either, so this is a pre-existing omission rather than a regression.)

  2. message_index: 0 with stream: false. This correctly mirrors how send_response handles a non-streaming, single-shot response. Looks intentional and correct.


engine/sdks/rust/envoy-client/src/envoy.rssend_command_ack after handle_commands

Good fix. Previously, command acknowledgements were only sent on the periodic ACK_COMMANDS_INTERVAL_MS timer (every 5 minutes). This meant the engine could wait up to 5 minutes before knowing the envoy had processed a CommandStartActor or CommandStopActor. The PR now sends an ack immediately after each batch of commands is processed, which is the correct behaviour.

No issues with this change.


engine/sdks/rust/envoy-client/src/events.rs — remove received_stop guard

Good fix. The old code only removed an actor entry from the local registry when a stopped event arrived and received_stop was true (i.e. a CommandStopActor had already been received). The deleted TODO comment explicitly acknowledged the gap: if the envoy disconnected after the actor stopped but before the stop command arrived, the actor entry would leak. Removing the actor unconditionally on a stopped-state event is correct because an actor cannot un-stop.

One minor follow-up item:

  1. Dead parameter in test helper. The insert_actor helper in the #[cfg(test)] block still accepts a received_stop: bool argument and sets the field on the inserted entry. Since received_stop no longer affects actor removal, the parameter is now unused logic. Both call sites happen to pass true, but the stop_event_only_removes_the_stopped_generation test passes received_stop: false for the live actor — previously this was load-bearing; now it is dead. Consider removing the received_stop parameter from the test helper to avoid misleading future readers. The field on ActorEntry can likely be removed entirely in a follow-up since it is only set in commands.rs and its only consumer (the event guard) has been removed.

General observations

  • Logging follows project conventions (tracing::error!(?error, ...)). No issues.
  • No new std::sync::Mutex, no Mutex<HashMap>. No issues.
  • No new _ => fall-through arms on enums. No issues.
  • The PR is small, focused, and does not touch unrelated code.

Summary

All three fixes address real correctness gaps: stuck tunnel requests, delayed command acks, and actor entry leaks on reconnect. The changes are safe to merge. The only actionable item before merging is the stale received_stop parameter in the test helper, which is low-risk but should be cleaned up to avoid confusion. If the received_stop field on ActorEntry is now truly dead code, removing it in a follow-up would complete the simplification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant