rivet-dev · MasterPtato · Apr 8, 2026 · Apr 7, 2026
diff --git a/.agent/notes/driver-test-status.md b/.agent/notes/driver-test-status.md
@@ -0,0 +1,30 @@
+# Driver Test Suite Status
+
+## What works
+- rivet-envoy-client (Rust) fully functional
+- rivetkit-native NAPI module, TSFN callbacks, envoy lifecycle all work
+- Standalone test: create actor + ping = 22-32ms (both test-envoy and native)
+- Gateway query path (getOrCreate) works: 112ms
+- E2e actor test passes (HTTP ping + WS echo)
+- Driver test suite restored (2282 tests), type-checks, loads
+
+## Blocker: engine actor2 workflow doesn't process Running event
+
+### Evidence
+- Fresh namespace, fresh pool config, generation 1
+- Actor starts in 11ms, Running event sent immediately
+- Guard times out after 10s with `actor_ready_timeout`
+- The Running event goes: envoy WS → pegboard-envoy → actor_event_demuxer → signal to actor2 workflow
+- But actor2 workflow never marks the actor as connectable
+
+### Why test-envoy works but EngineActorDriver doesn't
+- test-envoy uses a PERSISTENT envoy on a PERSISTENT pool
+- The pool existed before the engine restarted, so the actor workflow may be v1 (not actor2)
+- v1 actors process events through the serverless/conn SSE path, which works
+- The force-v2 change routes ALL new serverless actors to actor2, where events aren't processed
+
+### Root cause
+The engine's v2 actor workflow (`pegboard_actor2`) receives the `Events` signal from `pegboard-envoy`'s `actor_event_demuxer`, but it does not correctly transition to the connectable state. The guard polls `connectable_ts` in the DB which is never set.
+
+### Fix needed (engine side)
+Check `engine/packages/pegboard/src/workflows/actor2/mod.rs` - specifically how `process_signal` handles the `Events` signal with `EventActorStateUpdate{Running}`. It should set `connectable_ts` in the DB and transition to `Transition::Running`.
diff --git a/.agent/notes/native-bridge-bugs.md b/.agent/notes/native-bridge-bugs.md
@@ -0,0 +1,67 @@
+# Native Bridge Bugs
+
+## Status
+- **Actions**: PASS (increment, getCount, state persistence all work)
+- **WebSocket**: FAIL (client-side connect timeout)
+- **SQLite**: FAIL (sqlite3_open_v2 code 14 - CANTOPEN)
+
+## Test command (actions - works)
+```bash
+cd rivetkit-typescript/packages/rivetkit
+npx tsx tests/standalone-native-test.mts
+```
+Requires engine + test-envoy running, default namespace with metadata refreshed.
+
+## WebSocket Bug
+
+### Symptom
+Client SDK `handle.connect()` times out. Server-side works fully: envoy receives `ToEnvoyWebSocketOpen`, wrapper fires `config.websocket()`, `EngineActorDriver.#envoyWebSocket` attaches listeners, open event dispatches, actor sends 128-byte message back. Envoy sends `ToRivetWebSocketOpen` AND `ToRivetWebSocketMessage`. But client-side WS never opens.
+
+### Root cause
+The engine's guard/gateway receives `ToRivetWebSocketOpen` from the envoy but does NOT complete the client-side WS upgrade. This is likely a guard bug with v2 actors - the guard's WS proxy code path may not handle the v2 tunnel response correctly.
+
+### Evidence
+- Envoy sends `ToRivetWebSocketOpen{canHibernate: false}` at timestamp X ✓
+- Envoy sends `ToRivetWebSocketMessage{128 bytes}` immediately after ✓
+- Engine log: `websocket failed: Connection reset without closing handshake` for the gateway WS
+- Client-side WS closes without ever receiving the open event
+
+### NOT a rivetkit-native issue
+The server-side flow (TSFN, EventTarget ws, ws.send via WebSocketSender, actor framework) all work correctly. The bug is in how the engine's guard handles v2 actor WS connections.
+
+### Additional issue: message_index conflict
+The outgoing task in `actor.rs` (line ~459) sends `ToRivetWebSocketMessage` with hardcoded `message_index: 0`. But `send_actor_message` also sends messages starting at index 0. The guard may see duplicate indices and drop messages. Need to coordinate the message index between both paths.
+
+### Reproduce
+```bash
+cd rivetkit-typescript/packages/rivetkit
+npx tsx tests/standalone-native-test.mts
+```
+Actions pass (3/3), WebSocket fails with connect timeout. Check Rust logs with `RIVET_LOG_LEVEL=debug`.
+
+### Code locations
+- `engine/packages/guard-core/src/proxy_service.rs` line 1548-1554 - CustomServe WS handler
+- `engine/packages/guard-core/src/proxy_service.rs` line 927 - handle_websocket_upgrade  
+- `engine/sdks/rust/envoy-client/src/actor.rs` line ~459 - outgoing task with hardcoded message_index
+- The guard's CustomServe handler (from the routing fn) should proxy ToRivetWebSocketOpen back to the client but doesn't complete the upgrade
+
+## SQLite Bug
+
+### Symptom
+`sqlite3_open_v2 failed with code 14` (SQLITE_CANTOPEN)
+
+### Root cause
+The native SQLite VFS (`rivetkit-native/src/database.rs`) creates an `EnvoyKv` adapter that routes KV operations through the `EnvoyHandle`. But the VFS registration or database open may fail because:
+1. The actor isn't ready when the DB tries to open
+2. The VFS name conflicts
+3. The KV batch_get returns unexpected data format
+
+### What to investigate
+- Add logging to `EnvoyKv` trait methods in `rivetkit-native/src/database.rs`
+- Check if `open_database_from_envoy` is called at the right time
+- Verify the envoy handle's KV methods work for the actor
+
+### Code locations
+- `rivetkit-native/src/database.rs` - EnvoyKv impl + open_database_from_envoy
+- `rivetkit-typescript/packages/sqlite-native/src/vfs.rs` - KvVfs::register + open_database
+- `src/drivers/engine/actor-driver.ts` line ~570 - getNativeSqliteProvider
diff --git a/.agent/notes/rust-envoy-client-issues.md b/.agent/notes/rust-envoy-client-issues.md
@@ -0,0 +1,121 @@
+# Rust Envoy Client: Known Issues
+
+Audit of `engine/sdks/rust/envoy-client/src/` performed 2026-04-07.
+
+---
+
+## Behavioral Bugs
+
+### B1: `destroy_actor` bypasses engine protocol -- FIXED
+
+**File:** `handle.rs:60-65`, `envoy.rs:239-241`, `actor.rs:192-195`
+
+The TS version sends `ActorIntentStop` and waits for the engine to issue a `CommandStopActor` with `StopActorReason::Destroy`. The Rust version sent `ToActor::Destroy` directly to the actor, force-killing it locally without engine confirmation.
+
+**Fix applied:** `destroy_actor` now sends `ActorIntentStop` event (matching TS behavior). Removed `DestroyActor` message variant and `ToActor::Destroy`.
+
+### B2: Graceful shutdown force-kills after 1 second -- FIXED
+
+**File:** `envoy.rs:409-416`
+
+`handle_shutdown` spawned a task that slept 1 second then sent `Stop`, which dropped all actor channels. Actors got no `on_actor_stop` callback.
+
+**Fix applied:** Now polls actor handle closure with a deadline from `serverlessDrainGracePeriod` in `ProtocolMetadata` (falls back to 30s). All actors get a chance to stop cleanly.
+
+---
+
+## Performance: Fixed
+
+### P1: `ws_send` clones entire outbound message -- FIXED
+
+**File:** `connection.rs:213-223`
+
+Took `&protocol::ToRivet`, then called `message.clone()` because `wrap_latest` needed ownership.
+
+**Fix applied:** Changed signature to take `protocol::ToRivet` by value. All callers construct fresh values inline (except `send_actor_message` which clones for potential buffering).
+
+### P2: Stringify allocates even when debug logging is off -- FIXED
+
+**File:** `connection.rs:160, 214`, `actor.rs:897`
+
+`tracing::debug!(data = stringify_to_envoy(&decoded), ...)` eagerly evaluated the stringify function regardless of log level.
+
+**Fix applied:** Gated behind `tracing::enabled!(tracing::Level::DEBUG)`.
+
+### P3: `handle_commands` clones config, hibernating_requests, preloaded_kv -- FIXED
+
+**File:** `commands.rs:18-23`
+
+`val.config.clone()`, `val.hibernating_requests.clone()`, `val.preloaded_kv.clone()` when the fields could be moved.
+
+**Fix applied:** Only clone `val.config.name` for the `ActorEntry`, then move the rest into `create_actor`.
+
+### P4: `_config` field in `ActorContext` is unused, forces a clone -- FIXED
+
+**File:** `actor.rs:81, 129`
+
+`_config: config.clone()` stored a full `ActorConfig` that was never read.
+
+**Fix applied:** Removed the `_config` field. `config` is passed directly to `on_actor_start` without cloning.
+
+### P5: `kv_put` clones keys and values -- FIXED
+
+**File:** `handle.rs:202-203`
+
+`entries.iter().map(|(k, _)| k.clone()).collect()` when `entries` is owned and could be consumed.
+
+**Fix applied:** `let (keys, values): (Vec<_>, Vec<_>) = entries.into_iter().unzip();`
+
+### P6: `parse_list_response` clones values -- FIXED
+
+**File:** `handle.rs:358-366`
+
+Keys were consumed via `into_iter()` but values were indexed and cloned.
+
+**Fix applied:** `resp.keys.into_iter().zip(resp.values).collect()`
+
+---
+
+## Performance: Not Worth the Effort
+
+### P7: BufferMap uses `HashMap<String, T>` with string key allocation
+
+**File:** `utils.rs:122-172`
+
+`cyrb53()` returns a hex `String` on every lookup. Used on hot paths (tunnel message dispatch). However, inputs are only 8 bytes total, producing ~13-char strings. Tiny allocations served from thread-local caches.
+
+### P8: Redundant inner `Arc` on `ws_tx` and `protocol_metadata`
+
+**File:** `context.rs:15-16`
+
+`SharedContext` is already behind `Arc<SharedContext>`. Inner `Arc` wrappers add one extra indirection and refcount. Negligible impact.
+
+### P9: `tokio::sync::Mutex` vs `std::sync::Mutex`
+
+**File:** `context.rs:15-16`
+
+Neither lock is held across `.await`. `protocol_metadata` is a clear-cut candidate for `std::sync::Mutex`. `ws_tx` holds the lock during `serde_bare` encode, making `tokio::sync::Mutex` defensible.
+
+### P10: O(n*m) key matching in `kv_get`
+
+**File:** `handle.rs:107-119`
+
+Nested loop over requested keys and response keys. Real quadratic complexity, but n is typically small for KV gets.
+
+### P11: Double actor lookups in tunnel.rs
+
+**File:** `tunnel.rs:45-59, 112-148`
+
+`get_actor` called twice (once to check existence, once to use). Borrow checker prevents naive fix since `get_actor(&self)` borrows all of `ctx`.
+
+### P12: Headers cloned from HashableMap instead of moved
+
+**File:** `actor.rs:309, tunnel.rs:140`
+
+`req.headers.iter().map(|(k, v)| (k.clone(), v.clone())).collect()` when `req` is owned. Could use `into_iter()`.
+
+### P13: `handle_ack_events` iterates checkpoints twice
+
+**File:** `events.rs:37-67`
+
+First pass retains events, second pass checks for actor removal. Could collect removals in first pass.
diff --git a/.agent/notes/v2-metadata-delay-bug.md b/.agent/notes/v2-metadata-delay-bug.md
@@ -0,0 +1,97 @@
+# Bug: v2 actor dispatch requires ~5s delay after metadata refresh
+
+## Reproduce
+
+```bash
+# 1. Start engine with the force-v2 hack (see below)
+rm -rf ~/.local/share/rivet-engine/db
+cargo run --bin rivet-engine -- start
+
+# 2. Start test-envoy
+RIVET_ENDPOINT=http://127.0.0.1:6420 RIVET_TOKEN=dev RIVET_NAMESPACE=default \
+  RIVET_POOL_NAME=test-envoy AUTOSTART_ENVOY=0 AUTOSTART_SERVER=1 \
+  AUTOCONFIGURE_SERVERLESS=0 cargo run -p rivet-test-envoy
+
+# 3. In another terminal, run this:
+NS="repro-$(date +%s)"
+curl -s -X POST -H "Authorization: Bearer dev" -H "Content-Type: application/json" \
+  http://localhost:6420/namespaces -d "{\"name\":\"$NS\",\"display_name\":\"$NS\"}"
+curl -s -X PUT -H "Authorization: Bearer dev" -H "Content-Type: application/json" \
+  "http://localhost:6420/runner-configs/test-envoy?namespace=$NS" \
+  -d '{"datacenters":{"default":{"serverless":{"url":"http://localhost:5051/api/rivet","request_lifespan":300,"max_concurrent_actors":10000,"slots_per_runner":1,"min_runners":0,"max_runners":10000}}}}'
+curl -s -X POST -H "Authorization: Bearer dev" -H "Content-Type: application/json" \
+  "http://localhost:6420/runner-configs/test-envoy/refresh-metadata?namespace=$NS" -d '{}'
+
+# THIS FAILS (no delay):
+curl -s -X POST -H "Authorization: Bearer dev" -H "Content-Type: application/json" \
+  "http://localhost:6420/actors?namespace=$NS" \
+  -d "{\"name\":\"test\",\"key\":\"k-$(date +%s)\",\"runner_name_selector\":\"test-envoy\",\"crash_policy\":\"sleep\"}" \
+  | python3 -c "import json,sys; a=json.load(sys.stdin)['actor']['actor_id']; print(a)" \
+  | xargs -I{} curl -s --max-time 12 -H "X-Rivet-Token: dev" -H "X-Rivet-Target: actor" -H "X-Rivet-Actor: {}" http://localhost:6420/ping
+# Expected: actor_ready_timeout
+
+# THIS WORKS (5s delay):
+sleep 5
+curl -s -X POST -H "Authorization: Bearer dev" -H "Content-Type: application/json" \
+  "http://localhost:6420/actors?namespace=$NS" \
+  -d "{\"name\":\"test\",\"key\":\"k2-$(date +%s)\",\"runner_name_selector\":\"test-envoy\",\"crash_policy\":\"sleep\"}" \
+  | python3 -c "import json,sys; a=json.load(sys.stdin)['actor']['actor_id']; print(a)" \
+  | xargs -I{} curl -s --max-time 12 -H "X-Rivet-Token: dev" -H "X-Rivet-Target: actor" -H "X-Rivet-Actor: {}" http://localhost:6420/ping
+# Expected: 200 with JSON body
+```
+
+## Symptom
+
+Actor is created (200), envoy receives CommandStartActor, actor starts in ~10ms, EventActorStateUpdate{Running} is sent back via WS, but the guard returns `actor_ready_timeout` after 10 seconds. The actor never becomes connectable.
+
+## Root cause
+
+After `refresh-metadata` stores `envoyProtocolVersion` in the DB, the runner pool workflow (`pegboard_runner_pool`) needs to restart its serverless connection cycle to use v2 POST instead of v1 GET. This takes ~2-5 seconds because:
+
+1. The `pegboard_runner_pool_metadata_poller` workflow runs on a polling interval
+2. The `pegboard_serverless_conn` workflow needs to cycle its existing connections
+3. The `pegboard_runner_pool` workflow reads the updated config and spawns new v2 connections
+
+Until this happens, the engine dispatches via v1 GET SSE which doesn't deliver the start payload to the envoy.
+
+## Code locations
+
+### Force-v2 hack (temporary)
+`engine/packages/pegboard/src/workflows/actor/runtime.rs` line ~268:
+```rust
+// Changed from: if pool.and_then(|p| p.protocol_version).is_some()
+// To force v2 for all serverless pools:
+if pool.as_ref().and_then(|p| p.protocol_version).is_some() || for_serverless {
+```
+
+### Where protocol_version is stored
+`engine/packages/pegboard/src/workflows/runner_pool_metadata_poller.rs` line ~214:
+```rust
+if let Some(protocol_version) = metadata.envoy_protocol_version {
+    tx.write(&protocol_version_key, protocol_version)?;
+}
+```
+
+### Where protocol_version is read for v1→v2 migration decision
+`engine/packages/pegboard/src/workflows/actor/runtime.rs` in `allocate_actor_v2`:
+```rust
+let pool_res = ctx.op(crate::ops::runner_config::get::Input { ... }).await?;
+// ...
+if pool.and_then(|p| p.protocol_version).is_some() {
+    return Ok(AllocateActorOutputV2 { status: AllocateActorStatus::MigrateToV2, ... });
+}
+```
+
+### Where runner config is cached (may need invalidation)
+`engine/packages/pegboard/src/ops/runner_config/get.rs` - reads ProtocolVersionKey from DB
+
+### Where v1 (GET) vs v2 (POST) connection is made
+- v1: `engine/packages/pegboard/src/workflows/serverless/conn.rs` line ~301: `client.get(endpoint_url)`
+- v2: `engine/packages/pegboard-outbound/src/lib.rs` line ~316: `client.post(endpoint_url).body(payload)`
+
+## Fix needed
+
+After `refresh-metadata` stores `envoyProtocolVersion`, the runner pool should immediately use v2 POST without waiting for the metadata poller cycle. Either:
+1. Signal the runner pool workflow to restart connections when metadata changes
+2. Make the `refresh-metadata` API synchronously update the runner pool state
+3. Have the serverless conn workflow check protocol_version before each connection attempt instead of relying on the metadata poller cycle