Skip to content

feat: US-001 - Define SqliteKv trait in rivetkit-sqlite-native#4584

Open
NathanFlurry wants to merge 1 commit into02-23-feat_rust_engine-runner_feature_parity_with_typescriptfrom
rivetkit-native
Open

feat: US-001 - Define SqliteKv trait in rivetkit-sqlite-native#4584
NathanFlurry wants to merge 1 commit into02-23-feat_rust_engine-runner_feature_parity_with_typescriptfrom
rivetkit-native

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app bot commented Apr 7, 2026

🚅 Deployed to the rivet-pr-4584 environment in rivet-frontend

Service Status Web Updated (UTC)
website 😴 Sleeping (View Logs) Web Apr 8, 2026 at 9:31 am
frontend-inspector ❌ Build Failed (View Logs) Web Apr 7, 2026 at 5:43 pm
frontend-cloud ❌ Build Failed (View Logs) Web Apr 7, 2026 at 5:43 pm
kitchen-sink ❌ Build Failed (View Logs) Web Apr 7, 2026 at 5:43 pm
mcp-hub ✅ Success (View Logs) Web Apr 7, 2026 at 5:09 pm
ladle ❌ Build Failed (View Logs) Web Apr 7, 2026 at 5:09 pm

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces a new rivet-envoy-client Rust crate (a 1:1 port of the TypeScript envoy client), a rivetkit-native N-API module bridging the Rust client to JavaScript, SQLite KV trait integration, and a large restructuring of the driver test suite. The architecture is sound and the code is generally well-structured. Below are findings organized by severity.


Bugs / Correctness Issues

handle.rs — O(n²) key matching in kv_get + unnecessary pre-clone of keys

kv_get clones the keys vec before passing ownership into the request (line 99), then uses a nested loop to match response keys back to requested keys (lines 110-122). The known-acknowledged P10 note in .agent/notes/rust-envoy-client-issues.md tracks the O(n²) complexity but the pre-clone on line 99 is an extra copy that can be avoided. Since the response keys are owned after the call, you can build a HashMap from them instead:

// keys is consumed by send_kv_request; capture the count first
let key_count = keys.len();
// ... send request ...
// In the response handler, build a lookup map from response keys to values
let mut lookup: HashMap<Vec<u8>, Vec<u8>> = resp.keys.into_iter().zip(resp.values).collect();
let result: Vec<Option<Vec<u8>>> = request_keys.into_iter().map(|k| lookup.remove(&k)).collect();

This avoids both the pre-clone and the quadratic loop.

actor.rsreq.headers cloned unnecessarily (line 301)

req is owned at that point (moved into handle_req_start), so req.headers.iter().map(|(k, v)| (k.clone(), v.clone())) can be replaced with req.headers.into_iter(). The field is not used after this line.

actor.rs — double full_headers clone in handle_ws_open (lines 404-405, 428)

full_headers is cloned once into request.headers and again passed directly as headers to the websocket callback. Since request is passed before full_headers, you can reorder the callback args or move full_headers into request and reconstruct from it, but at minimum the two clones of the same map should be noted.

handle.rsstart_serverless_actor checks version equality only (lines 302-307)

The version check uses strict equality (!=), which will reject any future protocol minor-version bump. This may be intentional for security but should be called out with a comment explaining why a range check was not used.

connection.rsexpect on protocol encode (line 229)

.expect("failed to encode message");

A panic on a serialization failure in a hot path (every outbound WebSocket message) is unrecoverable. This should propagate as an error or at minimum log and close the connection gracefully.


CLAUDE.md Convention Violations

anyhow! macro used instead of .context()

The project convention is to prefer .context() over anyhow!. Multiple files use anyhow::anyhow!(...) directly:

  • connection.rs:79anyhow::anyhow!("failed to build ws request: {e}")
  • handle.rs:332, 350, 351 — channel-closed error paths
  • envoy.rs:277, 306 — shutdown paths
  • kv.rs:55, 129 — error/timeout paths
  • bridge_actor.rs:85, 107, 144 — callback channel-closed paths

Where the source is an existing ?-propagated error, use .context("...") instead. For terminal bail! calls where there is no underlying error, anyhow::bail! (not anyhow::anyhow!) is acceptable.

Arc<Mutex<HashMap>> used in bridge_actor.rs (line 19)

pub type ResponseMap = Arc<Mutex<HashMap<String, oneshot::Sender<serde_json::Value>>>>;

Per CLAUDE.md, prefer scc::HashMap or moka::Cache over Arc<Mutex<HashMap>> for concurrent containers. The ResponseMap is accessed from multiple threads (Rust async tasks and NAPI callbacks).

Arc<Mutex<_>> redundancy in context.rs (lines 15-16)

SharedContext is already behind Arc<SharedContext>. The inner Arc<Mutex<...>> wrappers on ws_tx and protocol_metadata add a second heap allocation and refcount. This was acknowledged as P8 in .agent/notes/rust-envoy-client-issues.md as "negligible," but CLAUDE.md's preference for scc structures still applies to ws_tx—a tokio::sync::watch or an scc-based pattern could replace the mutex.

Log messages with uppercase / non-code-symbol words

CLAUDE.md requires log messages to be lowercase unless mentioning specific code symbols:

  • bridge_actor.rs:74"calling JS actor_start callback via TSFN""calling js actor_start callback via TSFN" (or "calling JS on_actor_start callback" since TSFN is a code-adjacent acronym)
  • bridge_actor.rs:76"TSFN call returned" → acceptable since TSFN is a code symbol
  • kv.rs:62 (log in handle_kv_response)"received kv response for unknown request id" — this is correct
  • commands.rs:10"received commands" — correct

Comment style violations

CLAUDE.md requires comments to be complete sentences. Some comments are fragments:

  • connection.rs:214/// Send a message over the WebSocket. Returns true if the message could not be sent. — "Returns true if the message could not be sent" is inverted/confusing; should be "Returns true if the socket was unavailable."
  • envoy.rs:400-401 — multi-line comment // Wait for all actors to finish. The process manager (Docker,\n// k8s, etc.) provides the ultimate shutdown deadline. — this is fine, but uses an abbreviation list style that's acceptable.
  • Several inline comments use // ... fragment style but are acceptable.

Performance / Design Notes

send_actor_message in actor.rs clones message_kind and msg on every call (lines 877, 880)

Every outbound tunnel message constructs msg, then immediately clones it to buffer_msg before sending (even when the WS send succeeds and buffering never happens). The buffer_msg clone could be deferred behind the if failed branch by restructuring the buffering logic, since ws_send consumes msg by value already.

kv.rsdata.clone() in send_single_kv_request (line 91)

request.data.clone() is called because request is a &mut KvRequestEntry and data must be borrowed to construct the outbound message. This requires KvRequestData: Clone. For large KV payloads this clone is on the hot retry path. Consider storing data as Option<KvRequestData> and taking it on first send to avoid the clone.

Double ws_tx lock acquisition in handle_kv_request and process_unsent_kv_requests

handle_kv_request and process_unsent_kv_requests each acquire shared.ws_tx.lock() only to check guard.is_some(), then release it before calling send_single_kv_request, which acquires the lock again. This double-acquisition is a TOCTOU window. Consider passing the lock guard down or restructuring so the send path holds the lock once.


Architecture / Structural Concerns

events.rs — dead code in handle_ack_events (lines 23-25)

if matches!(state_update.state, protocol::ActorState::ActorStateStopped(_)) {
    // Mark handle as done - actor task will finish on its own
}

This if block is empty and does nothing. It should either be removed or have the intended behavior implemented.

actor.rshandle_ws_open pending request is not cleaned up on can_hibernate error

If the websocket callback returns an Err, the pending request entry inserted at line 390 is removed (line 543) but the request_to_actor map in EnvoyContext already has an entry for this (gateway_id, request_id) pair (inserted in tunnel.rs:handle_ws_open). That request_to_actor entry is never cleaned up on error in the actor task.

Draft PR with open blockers

The PR is in DRAFT state and .agent/notes/native-bridge-bugs.md explicitly documents two unresolved bugs:

  • WebSocket connections time out client-side (client never receives open event).
  • SQLite sqlite3_open_v2 fails with code 14 (SQLITE_CANTOPEN).

The .agent/notes/driver-test-status.md also documents that the EngineActorDriver test suite is blocked on the engine's v2 actor workflow not processing the Running event. These are significant and confirm the draft status is appropriate—this review should be revisited once those blockers are resolved.

Test coverage

The PR removes all existing engine-runner integration tests (tests/e2e_counter_runner.rs, e2e_counter_serverless.rs, e2e_counter_serverless_upsert.rs, e2e_websocket.rs) without replacing them with equivalent tests for the new Rust envoy client. The crate currently has no test coverage at the integration level.


Minor / Nits

  • connection.rs:257unwrap_or("localhost") on host extraction is a silent fallback that will produce a misleading Host header on malformed endpoints.
  • envoy.rs:148uuid::Uuid::new_v4().to_string() for envoy key generation is fine but the envoy_key field on EnvoyConfig being Option<String> means callers who forget to set it get a random UUID silently. A builder pattern or explicit required field would make this contract clearer.
  • .opencode/package-lock.json appears to be a stray editor config/lock file committed unintentionally.
  • .agent/notes/ and .agent/specs/ files are tracked per CLAUDE.md convention, which is correct.

@claude
Copy link
Copy Markdown

claude bot commented Apr 7, 2026

PR Review: feat: US-001 - Define SqliteKv trait / Rust envoy-client + rivetkit-native

This PR replaces the TypeScript engine/sdks/typescript/envoy-client with a new Rust implementation (engine/sdks/rust/envoy-client) and adds a new rivetkit-typescript/packages/rivetkit-native NAPI crate that bridges the Rust envoy client into Node.js. The self-contained notes in .agent/notes/ honestly document current status (actions pass, WebSocket and SQLite fail).


Bugs

B1: WebSocket outgoing message index collides with send_actor_message

In engine/sdks/rust/envoy-client/src/actor.rs, the spawned outgoing-WS task sends ToRivetWebSocketMessage with an independent idx counter starting at 0. send_actor_message also uses a per-request envoy_message_index starting at 0 via PendingRequest. Both counters advance independently producing duplicate message_index values. The comment block even allocates an Arc<AtomicU16> and then immediately does nothing with it. The .agent/notes/native-bridge-bugs.md correctly identifies this as a likely cause of the WebSocket failure. This needs a unified counter before the feature can ship.

B2: Hibernation restore silently drops all outgoing WebSocket messages

handle_hws_restore creates (hws_outgoing_tx, _hws_outgoing_rx) and passes hws_sender to the websocket() callback. The receiver is immediately dropped. Any message the actor sends during restore via that sender is silently discarded. The restore path needs to either spawn the outgoing forward task (same as the normal open path) or explicitly reject sends.

B3: shutting_down is duplicated with inconsistent types

SharedContext.shutting_down is an AtomicBool (context.rs) while EnvoyContext.shutting_down is a plain bool (envoy.rs). Only EnvoyContext.shutting_down is read in handle_shutdown; SharedContext.shutting_down is set in start_envoy_sync but never read. One of these is dead.

B4: start_envoy_js advertises async but is synchronous

In rivetkit-typescript/packages/rivetkit-native/src/lib.rs, start_envoy_js is documented as "Start the native envoy client asynchronously" but its body calls start_envoy_sync_js and returns immediately. The misleading name/doc should be corrected or the function should be removed.

B5: respond_callback silently drops responses after the first

JsEnvoyHandle::respond_callback removes the oneshot sender from response_map and sends. If JS accidentally calls respond_callback twice with the same response_id, the second call silently does nothing. The function should log a warning when no sender is found.


Dead Code / Incomplete Implementation

  • actor.rs: ws_msg_counter (Arc<AtomicU16>) is created and immediately unused. Remove it or connect it to the outgoing task.
  • handle.rs start_serverless_actor: if the envoy never starts (e.g., engine unreachable), self.started().await hangs forever. Consider adding a timeout or documenting the expectation.
  • bridge_actor.rs: can_hibernate always returns false but is_hibernatable is passed to the websocket() callback and silently ignored. Add debug_assert!(!is_hibernatable) to make the intent clear.

Performance

  • connection.rs ws_url(): calls .replace() three times, allocating on each call. A single-pass approach would be cleaner.
  • handle.rs kv_get: clones the full keys vec before the request just to do a linear scan for response correlation. A HashMap keyed by the key bytes would eliminate both.
  • kv.rs cleanup_old_kv_requests: collects expired IDs into a Vec for a two-phase delete. A retain-based approach avoids the allocation.
  • utils.rs BufferMap: uses cyrb53 (53-bit non-cryptographic hash) as the sole key for WebSocket connection state. A collision would route messages from one connection into another. Prefer a composite (gateway_id, request_id) key.

Security / Robustness

  • utils.rs cyrb53: hash collisions are possible and would cross-contaminate WebSocket connection state. Use the composite (gateway_id, request_id) tuple directly as the key.
  • connection.rs extract_host: uses string replace() and splits on /. If the URL ever contains a username/password, the host extraction will include user:pass@host. Use the url crate or tokio_tungstenite::tungstenite::http::Uri instead.
  • rivetkit-native/src/lib.rs: each call to start_envoy_sync_js creates a new tokio::runtime::Runtime. Multiple runtimes will coexist if the JS layer calls this more than once. Consider a global lazy runtime or document the single-call contract.

Code Style

  • Comments using em dashes violate CLAUDE.md conventions. Use periods to separate sentences instead.
  • bridge_actor.rs on_actor_start contains tracing::info! calls that appear to be debug instrumentation left from development. Standardize or remove.
  • test-envoy/src/server.rs start_serverless: if a second serverless start arrives before the first envoy is ready, the first handle is silently dropped. Consider returning a 409 if an envoy is already running.
  • rivetkit-typescript/packages/rivetkit-native/index.js and wrapper.js are checked in as JS in the source tree. Ensure these are generated by the build and not maintained by hand, otherwise they will drift from the Rust definitions.

Test Coverage

The PR deletes all tests from engine/sdks/rust/engine-runner/tests/ (~1,370 lines across 6 files) with no replacement. Only fixture files are added to the driver test suite. No new test files that run against the native driver are present. The PR should not merge with zero test coverage of the new Rust envoy-client code. At minimum, the test-envoy integration scenario (actions: ping, counter) should be gated in CI.


Summary

The architecture is solid and the porting approach is methodical. The two blocking issues before merge are:

  1. WebSocket message index collision (B1) - the counter coordination is half-implemented and actively causes the WebSocket failure.
  2. Deleted tests with no replacement - coverage gap on the most important new code.

The hibernation restore outgoing-channel issue (B2) and the BufferMap hash-collision risk are the next-highest priority items.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant