Skip to content

fix(envoy-client): fix handling actor event ack#4599

Merged
MasterPtato merged 1 commit intomainfrom
04-09-fix_envoy-client_fix_handling_actor_event_ack
Apr 13, 2026
Merged

fix(envoy-client): fix handling actor event ack#4599
MasterPtato merged 1 commit intomainfrom
04-09-fix_envoy-client_fix_handling_actor_event_ack

Conversation

@MasterPtato
Copy link
Copy Markdown
Contributor

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Contributor Author

MasterPtato commented Apr 9, 2026

@MasterPtato MasterPtato mentioned this pull request Apr 9, 2026
11 tasks
@claude
Copy link
Copy Markdown

claude bot commented Apr 9, 2026

Code Review: PR #4599fix(envoy-client): fix handling actor event ack

PR Status: MERGED
Author: MasterPtato (Max)
Net change: +35 / -1671 (mostly deletions of old test/load-test scripts)


Summary of Changes

  1. engine/sdks/rust/envoy-client/src/events.rs — Bug fix: properly remove actor entries from the context map after a stop event is acknowledged, and simplify the ack handler.
  2. engine/sdks/rust/envoy-client/src/commands.rs + envoy.rs — Add received_stop: bool field to ActorEntry to track whether Rivet sent a stop command.
  3. rivetkit-typescript/packages/rivetkit-native/wrapper.js — Polyfill CloseEvent for Node.js < v22.
  4. rivetkit-typescript/packages/rivetkit/src/serverless/router.ts — Fix: prefer config.token over the ephemeral start token in the runner config.
  5. rivetkit-typescript/packages/sqlite-native/Cargo.toml — Remove [profile.release] lto = true that was placed in the wrong crate.
  6. scripts/tests/actor_spam.ts, scripts/tests/load-test/ — Delete stale test scripts.

Core Bug Fix (events.rs) — Positive Assessment

The primary fix is logically correct. The old handle_ack_events had a two-level cleanup: trimming acknowledged events from history, then removing the entry if history was empty and the actor handle was closed. This created a bug where actors stopped by Rivet would linger in ctx.actors if the ack arrived before the ActorStateStopped event. The new approach is cleaner:

  • Actor removal now happens immediately when an ActorStateStopped event is received and a stop command was previously sent (received_stop == true). This is deterministic and event-driven.
  • handle_ack_events is simplified to only retain unacknowledged events, with no cleanup responsibility.

Issue 1 — Actor entry removal uses actor_id key without considering generation (Medium)

ctx.actors.remove(&event.checkpoint.actor_id) removes the entire outer map entry for actor_id (all generations), rather than removing only the specific generation (event.checkpoint.generation). If two generations of the same actor ever coexist in ctx.actors, stopping one would silently evict the other. The previous code was more precise. Suggested fix:

if let Some(gens) = ctx.actors.get_mut(&event.checkpoint.actor_id) {
    gens.remove(&event.checkpoint.generation);
    if gens.is_empty() {
        ctx.actors.remove(&event.checkpoint.actor_id);
    }
}

Issue 2 — Actor self-stop leaves entry in ctx.actors until lost-timeout fires (Low)

The new handle_ack_events never removes entries. Cleanup now relies solely on handle_send_events observing ActorStateStopped with received_stop == true. Actors that self-stopped without a prior CommandStopActor are never removed from ctx.actors until the lost_timeout path fires. This is exactly the memory leak scenario documented in the TODO comment, but it also applies to normal actor self-termination without a prior stop command.

Issue 3 — Deferred work item not tracked (Low)

The TODO comment describes a known memory leak scenario (envoy disconnect, actor self-stops, envoy reconnects, stop event resent but entry never cleaned up). Per CLAUDE.md conventions, deferred work items with context should be tracked in .agent/todo/*.md rather than left only as inline comments.


serverless/router.ts Fix — Positive

The change from token to config.token ?? token is correct. The previous code always overwrote config.token with the ephemeral start-request token, breaking setups where a static application token should take precedence. This aligns runnerConfig with the existing pattern in clientConfig.


wrapper.js CloseEvent Polyfill — Looks Good

Correct and minimal. Node.js added CloseEvent as a global in v22, so the guard on typeof CloseEvent === 'undefined' is appropriate. The implementation matches the browser CloseEvent spec for the properties used (code, reason, wasClean).


sqlite-native/Cargo.toml — LTO Removal

Correct. Per Cargo semantics, [profile.*] in a non-root crate is silently ignored — it only applies at the workspace root. This was dead configuration.


CLAUDE.md Convention Compliance

All new Rust code uses hard tabs, tracing structured logging (no println!/eprintln!), and follows existing patterns. Comment style is sentence-case and complete. No new dependencies introduced. Clean.


Summary

Severity Location Issue
Medium events.rs ctx.actors.remove(&actor_id) removes all generations, not just the stopped one
Low events.rs Actor self-stop without prior CommandStopActor leaves entry until lost-timeout fires
Low events.rs TODO Known memory leak not tracked in .agent/todo/

The PR is a net improvement. The original ack handler had a real bug where event history trimming could prevent cleanup and cause duplicate resends on reconnect. The new event-driven approach is cleaner. Issue 1 (generation-scoped removal) is the most important thing to address in a follow-up.

@MasterPtato MasterPtato changed the base branch from 04-09-do_not_merge_sandbox_benches to graphite-base/4599 April 12, 2026 21:21
@MasterPtato MasterPtato force-pushed the 04-09-fix_envoy-client_fix_handling_actor_event_ack branch from db25562 to c694cb5 Compare April 12, 2026 21:21
@MasterPtato MasterPtato changed the base branch from graphite-base/4599 to 04-08-fix_envoy-client_add_tls April 12, 2026 21:22
@NathanFlurry NathanFlurry changed the base branch from 04-08-fix_envoy-client_add_tls to graphite-base/4599 April 12, 2026 22:30
@MasterPtato MasterPtato force-pushed the 04-09-fix_envoy-client_fix_handling_actor_event_ack branch from c694cb5 to 91328b0 Compare April 12, 2026 23:38
@MasterPtato MasterPtato changed the base branch from graphite-base/4599 to 04-08-fix_envoy-client_add_tls April 12, 2026 23:38
@MasterPtato MasterPtato force-pushed the 04-08-fix_envoy-client_add_tls branch from 084b772 to 9ae76e7 Compare April 13, 2026 00:20
@MasterPtato MasterPtato force-pushed the 04-09-fix_envoy-client_fix_handling_actor_event_ack branch from 91328b0 to 2a54f2e Compare April 13, 2026 00:20
Copy link
Copy Markdown
Contributor Author

MasterPtato commented Apr 13, 2026

Merge activity

  • Apr 13, 12:21 AM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Apr 13, 12:27 AM UTC: Graphite rebased this pull request as part of a merge.
  • Apr 13, 12:27 AM UTC: @MasterPtato merged this pull request with Graphite.

@MasterPtato MasterPtato changed the base branch from 04-08-fix_envoy-client_add_tls to graphite-base/4599 April 13, 2026 00:23
@MasterPtato MasterPtato changed the base branch from graphite-base/4599 to main April 13, 2026 00:25
@MasterPtato MasterPtato force-pushed the 04-09-fix_envoy-client_fix_handling_actor_event_ack branch from 2a54f2e to 8747631 Compare April 13, 2026 00:26
@MasterPtato MasterPtato merged commit ae1122b into main Apr 13, 2026
5 of 12 checks passed
@MasterPtato MasterPtato deleted the 04-09-fix_envoy-client_fix_handling_actor_event_ack branch April 13, 2026 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants