Skip to content

fix(envoy-client): clear processed_command_idx after ack send#4812

Draft
NathanFlurry wants to merge 1 commit intoengine-stabilize/envoy-client-command-dedupfrom
04-27-fix_envoy-client_clear_processed_command_idx_after_ack_send
Draft

fix(envoy-client): clear processed_command_idx after ack send#4812
NathanFlurry wants to merge 1 commit intoengine-stabilize/envoy-client-command-dedupfrom
04-27-fix_envoy-client_clear_processed_command_idx_after_ack_send

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 27, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Code Review

Overview

This PR fixes a memory management issue in envoy-client by clearing processed_command_idx entries after the command ack is sent to pegboard-envoy. The two-line change to call signature + the cleanup loop together address a scenario where the dedup map grows unboundedly for long-lived processes with many actor lifecycles.


Correctness: Scope of the Clear is Right

The cleanup loop iterates last_command_checkpoints, which is built from ctx.actors — meaning only currently active actors get their entries cleared. Entries for already-stopped actors (the ones that survive remove_actor intentionally) are untouched. This is correct per the field's doc comment:

Persists across remove_actor so a replayed CommandStartActor for an already-stopped actor cannot resurrect it.


Race Condition (Acknowledged TODO)

The TODO comment is accurate and well-written. The specific scenario:

  1. ws_send returns (bytes accepted by OS)
  2. processed_command_idx entries cleared ← this PR
  3. WS drops / pegboard-envoy crashes before committing the FDB clear_range
  4. Reconnect → pegboard-envoy replays the commands
  5. No dedup entry → CommandStartActor re-executes → live actor gets replaced

The window is narrow (bytes in flight vs. FDB commit), but it exists. A proper fix would require an explicit ack-of-ack from pegboard-envoy before clearing. Leaving this as a TODO is acceptable, but consider filing it in .agent/todo/ so it doesn't get lost.


Minor Issues

Unnecessary full-Vec clone (commands.rs:108)

last_command_checkpoints: last_command_checkpoints.clone(),

The clone is needed because last_command_checkpoints is later iterated to build the removal keys. But cloning the whole Vec<ActorCheckpoint> (each entry contains a heap-allocated String UUID) is avoidable. Prefer extracting keys before the send to keep the original ownership clear:

// Extract keys before consuming last_command_checkpoints
let keys_to_remove: Vec<(String, u32)> = last_command_checkpoints
    .iter()
    .map(|cp| (cp.actor_id.clone(), cp.generation))
    .collect();

ws_send(
    &ctx.shared,
    protocol::ToRivet::ToRivetAckCommands(protocol::ToRivetAckCommands {
        last_command_checkpoints,  // moved, no clone
    }),
)
.await;

for key in keys_to_remove {
    ctx.processed_command_idx.remove(&key);
}

This is semantically clearer (the original goes into the wire message) and avoids allocating a second copy of the index: i64 fields.

Inner clone in the removal loop (commands.rs:124)

ctx.processed_command_idx.remove(&(cp.actor_id.clone(), cp.generation));

HashMap::remove requires an owned key here because the key type is (String, u32). If this ever becomes a hot path, HashMap<(Arc<str>, u32), i64> would allow remove with a borrow. Not a blocker for this change.


What's NOT Addressed (pre-existing)

  • Stopped-actor entries accumulate forever. After remove_actor, entries stay in processed_command_idx indefinitely. This is intentional to block resurrection, but there's no pruning once pegboard-envoy has confirmed it deleted those commands from FDB. This PR doesn't make it worse, but the map will keep growing for processes that run many actor lifecycles.

Checklist

  • Signature change &EnvoyContext&mut EnvoyContext is minimal and correct
  • Only active actors (in ctx.actors) are cleared; stopped-actor dedup protection preserved
  • Race condition is documented with an actionable TODO
  • PR description not filled out (template only) — please describe the motivation
  • Consider tracking the ack-of-ack TODO in .agent/todo/
  • Minor: restructure to avoid the full-Vec clone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant