Skip to content

fix(envoy-client): fix handling actor event ack#4599

Open
MasterPtato wants to merge 1 commit into04-09-do_not_merge_sandbox_benchesfrom
04-09-fix_envoy-client_fix_handling_actor_event_ack
Open

fix(envoy-client): fix handling actor event ack#4599
MasterPtato wants to merge 1 commit into04-09-do_not_merge_sandbox_benchesfrom
04-09-fix_envoy-client_fix_handling_actor_event_ack

Conversation

@MasterPtato
Copy link
Copy Markdown
Contributor

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Contributor Author

MasterPtato commented Apr 9, 2026

@MasterPtato MasterPtato mentioned this pull request Apr 9, 2026
11 tasks
@claude
Copy link
Copy Markdown

claude bot commented Apr 9, 2026

PR Review: fix(envoy-client): fix handling actor event ack

Summary

This PR fixes actor event ack handling in the envoy-client by introducing a received_stop flag to distinguish between Rivet-initiated stops and other actor exits. It also includes minor fixes for the serverless router token precedence, a Node.js CloseEvent polyfill, and removal of old load-test scripts.


Issues

1. Actor removal removes all generations, not just the current one

In events.rs (handle_send_events):

if entry.received_stop {
    ctx.actors.remove(&event.checkpoint.actor_id);  // removes ALL generations
}

This removes the entire actor ID entry including all generations. The old cleanup code correctly scoped removal to the specific generation:

actor_entry.remove(&checkpoint.generation);
if actor_entry.is_empty() {
    ctx.actors.remove(&checkpoint.actor_id);
}

If multiple generations of the same actor exist (e.g., rapid restart scenarios), this could prematurely remove entries for other generations that are still active.

2. Regression: actor entries from crashed actors never cleaned up

The old handle_ack_events cleaned up entries when all events were acked and handle.is_closed() — covering actors that exited without a stop command. The new implementation only cleans up entries in handle_send_events when received_stop is true. Actors that crash or exit for other reasons (outside a Rivet-initiated stop) will have their entries persist indefinitely.

The acknowledged TODO compounds this:

// TODO: If the envoy disconnects, actor stops, then envoy reconnects, we will send the stop event
// but there is no mechanism to remove the actor entry afterwards.

That's two known entry-leak paths introduced by this change. Consider whether a TTL or the handle.is_closed() check can be restored in handle_ack_events.

3. Stop event pushed to history then entry is immediately removed

In handle_send_events, the stop event is pushed to event_history and then the entire entry (including that history) is removed. This means the stop event can never be re-sent if the runner didn't ack it. For a stop transition this is likely fine since the actor is being torn down, but a comment stating this is intentional would prevent future confusion.


Suggestions

CloseEvent polyfill (wrapper.js)

The polyfill looks functionally correct for practical use. One minor note: the native CloseEvent constructor sets code, reason, and wasClean as read-only properties via the event init dict. The polyfill sets them as writable instance properties. This is unlikely to matter in practice, but worth noting if downstream code checks property descriptors.

serverless router (router.ts)

token: config.token ?? token,

Simple and correct. The intent (prefer explicitly configured token over generated) is clear.

LTO removal (sqlite-native/Cargo.toml)

Removing package-level lto = true defers to workspace-level LTO settings (if any). Fine as a cleanup, but worth confirming the workspace Cargo.toml has appropriate LTO settings for release builds.

Script deletions

The removed scripts (actor_spam.ts, load-test/) represent useful load-testing infrastructure. If they're being removed rather than migrated, confirm any CI/CD jobs or package.json scripts referencing them are also updated.


PR Checklist

The PR description template was not filled in — no summary of the issue being fixed, no test description, and checklist items are unchecked. Please fill in at minimum the type of change (bug fix) and a brief description of what was broken and how this fixes it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant