Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions .agent/notes/driver-test-status.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Driver Test Suite Status

## What works
- rivet-envoy-client (Rust) fully functional
- rivetkit-native NAPI module, TSFN callbacks, envoy lifecycle all work
- Standalone test: create actor + ping = 22-32ms (both test-envoy and native)
- Gateway query path (getOrCreate) works: 112ms
- E2e actor test passes (HTTP ping + WS echo)
- Driver test suite restored (2282 tests), type-checks, loads

## Blocker: engine actor2 workflow doesn't process Running event

### Evidence
- Fresh namespace, fresh pool config, generation 1
- Actor starts in 11ms, Running event sent immediately
- Guard times out after 10s with `actor_ready_timeout`
- The Running event goes: envoy WS → pegboard-envoy → actor_event_demuxer → signal to actor2 workflow
- But actor2 workflow never marks the actor as connectable

### Why test-envoy works but EngineActorDriver doesn't
- test-envoy uses a PERSISTENT envoy on a PERSISTENT pool
- The pool existed before the engine restarted, so the actor workflow may be v1 (not actor2)
- v1 actors process events through the serverless/conn SSE path, which works
- The force-v2 change routes ALL new serverless actors to actor2, where events aren't processed

### Root cause
The engine's v2 actor workflow (`pegboard_actor2`) receives the `Events` signal from `pegboard-envoy`'s `actor_event_demuxer`, but it does not correctly transition to the connectable state. The guard polls `connectable_ts` in the DB which is never set.

### Fix needed (engine side)
Check `engine/packages/pegboard/src/workflows/actor2/mod.rs` - specifically how `process_signal` handles the `Events` signal with `EventActorStateUpdate{Running}`. It should set `connectable_ts` in the DB and transition to `Transition::Running`.
67 changes: 67 additions & 0 deletions .agent/notes/native-bridge-bugs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Native Bridge Bugs

## Status
- **Actions**: PASS (increment, getCount, state persistence all work)
- **WebSocket**: FAIL (client-side connect timeout)
- **SQLite**: FAIL (sqlite3_open_v2 code 14 - CANTOPEN)

## Test command (actions - works)
```bash
cd rivetkit-typescript/packages/rivetkit
npx tsx tests/standalone-native-test.mts
```
Requires engine + test-envoy running, default namespace with metadata refreshed.

## WebSocket Bug

### Symptom
Client SDK `handle.connect()` times out. Server-side works fully: envoy receives `ToEnvoyWebSocketOpen`, wrapper fires `config.websocket()`, `EngineActorDriver.#envoyWebSocket` attaches listeners, open event dispatches, actor sends 128-byte message back. Envoy sends `ToRivetWebSocketOpen` AND `ToRivetWebSocketMessage`. But client-side WS never opens.

### Root cause
The engine's guard/gateway receives `ToRivetWebSocketOpen` from the envoy but does NOT complete the client-side WS upgrade. This is likely a guard bug with v2 actors - the guard's WS proxy code path may not handle the v2 tunnel response correctly.

### Evidence
- Envoy sends `ToRivetWebSocketOpen{canHibernate: false}` at timestamp X ✓
- Envoy sends `ToRivetWebSocketMessage{128 bytes}` immediately after ✓
- Engine log: `websocket failed: Connection reset without closing handshake` for the gateway WS
- Client-side WS closes without ever receiving the open event

### NOT a rivetkit-native issue
The server-side flow (TSFN, EventTarget ws, ws.send via WebSocketSender, actor framework) all work correctly. The bug is in how the engine's guard handles v2 actor WS connections.

### Additional issue: message_index conflict
The outgoing task in `actor.rs` (line ~459) sends `ToRivetWebSocketMessage` with hardcoded `message_index: 0`. But `send_actor_message` also sends messages starting at index 0. The guard may see duplicate indices and drop messages. Need to coordinate the message index between both paths.

### Reproduce
```bash
cd rivetkit-typescript/packages/rivetkit
npx tsx tests/standalone-native-test.mts
```
Actions pass (3/3), WebSocket fails with connect timeout. Check Rust logs with `RIVET_LOG_LEVEL=debug`.

### Code locations
- `engine/packages/guard-core/src/proxy_service.rs` line 1548-1554 - CustomServe WS handler
- `engine/packages/guard-core/src/proxy_service.rs` line 927 - handle_websocket_upgrade
- `engine/sdks/rust/envoy-client/src/actor.rs` line ~459 - outgoing task with hardcoded message_index
- The guard's CustomServe handler (from the routing fn) should proxy ToRivetWebSocketOpen back to the client but doesn't complete the upgrade

## SQLite Bug

### Symptom
`sqlite3_open_v2 failed with code 14` (SQLITE_CANTOPEN)

### Root cause
The native SQLite VFS (`rivetkit-native/src/database.rs`) creates an `EnvoyKv` adapter that routes KV operations through the `EnvoyHandle`. But the VFS registration or database open may fail because:
1. The actor isn't ready when the DB tries to open
2. The VFS name conflicts
3. The KV batch_get returns unexpected data format

### What to investigate
- Add logging to `EnvoyKv` trait methods in `rivetkit-native/src/database.rs`
- Check if `open_database_from_envoy` is called at the right time
- Verify the envoy handle's KV methods work for the actor

### Code locations
- `rivetkit-native/src/database.rs` - EnvoyKv impl + open_database_from_envoy
- `rivetkit-typescript/packages/sqlite-native/src/vfs.rs` - KvVfs::register + open_database
- `src/drivers/engine/actor-driver.ts` line ~570 - getNativeSqliteProvider
121 changes: 121 additions & 0 deletions .agent/notes/rust-envoy-client-issues.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Rust Envoy Client: Known Issues

Audit of `engine/sdks/rust/envoy-client/src/` performed 2026-04-07.

---

## Behavioral Bugs

### B1: `destroy_actor` bypasses engine protocol -- FIXED

**File:** `handle.rs:60-65`, `envoy.rs:239-241`, `actor.rs:192-195`

The TS version sends `ActorIntentStop` and waits for the engine to issue a `CommandStopActor` with `StopActorReason::Destroy`. The Rust version sent `ToActor::Destroy` directly to the actor, force-killing it locally without engine confirmation.

**Fix applied:** `destroy_actor` now sends `ActorIntentStop` event (matching TS behavior). Removed `DestroyActor` message variant and `ToActor::Destroy`.

### B2: Graceful shutdown force-kills after 1 second -- FIXED

**File:** `envoy.rs:409-416`

`handle_shutdown` spawned a task that slept 1 second then sent `Stop`, which dropped all actor channels. Actors got no `on_actor_stop` callback.

**Fix applied:** Now polls actor handle closure with a deadline from `serverlessDrainGracePeriod` in `ProtocolMetadata` (falls back to 30s). All actors get a chance to stop cleanly.

---

## Performance: Fixed

### P1: `ws_send` clones entire outbound message -- FIXED

**File:** `connection.rs:213-223`

Took `&protocol::ToRivet`, then called `message.clone()` because `wrap_latest` needed ownership.

**Fix applied:** Changed signature to take `protocol::ToRivet` by value. All callers construct fresh values inline (except `send_actor_message` which clones for potential buffering).

### P2: Stringify allocates even when debug logging is off -- FIXED

**File:** `connection.rs:160, 214`, `actor.rs:897`

`tracing::debug!(data = stringify_to_envoy(&decoded), ...)` eagerly evaluated the stringify function regardless of log level.

**Fix applied:** Gated behind `tracing::enabled!(tracing::Level::DEBUG)`.

### P3: `handle_commands` clones config, hibernating_requests, preloaded_kv -- FIXED

**File:** `commands.rs:18-23`

`val.config.clone()`, `val.hibernating_requests.clone()`, `val.preloaded_kv.clone()` when the fields could be moved.

**Fix applied:** Only clone `val.config.name` for the `ActorEntry`, then move the rest into `create_actor`.

### P4: `_config` field in `ActorContext` is unused, forces a clone -- FIXED

**File:** `actor.rs:81, 129`

`_config: config.clone()` stored a full `ActorConfig` that was never read.

**Fix applied:** Removed the `_config` field. `config` is passed directly to `on_actor_start` without cloning.

### P5: `kv_put` clones keys and values -- FIXED

**File:** `handle.rs:202-203`

`entries.iter().map(|(k, _)| k.clone()).collect()` when `entries` is owned and could be consumed.

**Fix applied:** `let (keys, values): (Vec<_>, Vec<_>) = entries.into_iter().unzip();`

### P6: `parse_list_response` clones values -- FIXED

**File:** `handle.rs:358-366`

Keys were consumed via `into_iter()` but values were indexed and cloned.

**Fix applied:** `resp.keys.into_iter().zip(resp.values).collect()`

---

## Performance: Not Worth the Effort

### P7: BufferMap uses `HashMap<String, T>` with string key allocation

**File:** `utils.rs:122-172`

`cyrb53()` returns a hex `String` on every lookup. Used on hot paths (tunnel message dispatch). However, inputs are only 8 bytes total, producing ~13-char strings. Tiny allocations served from thread-local caches.

### P8: Redundant inner `Arc` on `ws_tx` and `protocol_metadata`

**File:** `context.rs:15-16`

`SharedContext` is already behind `Arc<SharedContext>`. Inner `Arc` wrappers add one extra indirection and refcount. Negligible impact.

### P9: `tokio::sync::Mutex` vs `std::sync::Mutex`

**File:** `context.rs:15-16`

Neither lock is held across `.await`. `protocol_metadata` is a clear-cut candidate for `std::sync::Mutex`. `ws_tx` holds the lock during `serde_bare` encode, making `tokio::sync::Mutex` defensible.

### P10: O(n*m) key matching in `kv_get`

**File:** `handle.rs:107-119`

Nested loop over requested keys and response keys. Real quadratic complexity, but n is typically small for KV gets.

### P11: Double actor lookups in tunnel.rs

**File:** `tunnel.rs:45-59, 112-148`

`get_actor` called twice (once to check existence, once to use). Borrow checker prevents naive fix since `get_actor(&self)` borrows all of `ctx`.

### P12: Headers cloned from HashableMap instead of moved

**File:** `actor.rs:309, tunnel.rs:140`

`req.headers.iter().map(|(k, v)| (k.clone(), v.clone())).collect()` when `req` is owned. Could use `into_iter()`.

### P13: `handle_ack_events` iterates checkpoints twice

**File:** `events.rs:37-67`

First pass retains events, second pass checks for actor removal. Could collect removals in first pass.
97 changes: 97 additions & 0 deletions .agent/notes/v2-metadata-delay-bug.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Bug: v2 actor dispatch requires ~5s delay after metadata refresh

## Reproduce

```bash
# 1. Start engine with the force-v2 hack (see below)
rm -rf ~/.local/share/rivet-engine/db
cargo run --bin rivet-engine -- start

# 2. Start test-envoy
RIVET_ENDPOINT=http://127.0.0.1:6420 RIVET_TOKEN=dev RIVET_NAMESPACE=default \
RIVET_POOL_NAME=test-envoy AUTOSTART_ENVOY=0 AUTOSTART_SERVER=1 \
AUTOCONFIGURE_SERVERLESS=0 cargo run -p rivet-test-envoy

# 3. In another terminal, run this:
NS="repro-$(date +%s)"
curl -s -X POST -H "Authorization: Bearer dev" -H "Content-Type: application/json" \
http://localhost:6420/namespaces -d "{\"name\":\"$NS\",\"display_name\":\"$NS\"}"
curl -s -X PUT -H "Authorization: Bearer dev" -H "Content-Type: application/json" \
"http://localhost:6420/runner-configs/test-envoy?namespace=$NS" \
-d '{"datacenters":{"default":{"serverless":{"url":"http://localhost:5051/api/rivet","request_lifespan":300,"max_concurrent_actors":10000,"slots_per_runner":1,"min_runners":0,"max_runners":10000}}}}'
curl -s -X POST -H "Authorization: Bearer dev" -H "Content-Type: application/json" \
"http://localhost:6420/runner-configs/test-envoy/refresh-metadata?namespace=$NS" -d '{}'

# THIS FAILS (no delay):
curl -s -X POST -H "Authorization: Bearer dev" -H "Content-Type: application/json" \
"http://localhost:6420/actors?namespace=$NS" \
-d "{\"name\":\"test\",\"key\":\"k-$(date +%s)\",\"runner_name_selector\":\"test-envoy\",\"crash_policy\":\"sleep\"}" \
| python3 -c "import json,sys; a=json.load(sys.stdin)['actor']['actor_id']; print(a)" \
| xargs -I{} curl -s --max-time 12 -H "X-Rivet-Token: dev" -H "X-Rivet-Target: actor" -H "X-Rivet-Actor: {}" http://localhost:6420/ping
# Expected: actor_ready_timeout

# THIS WORKS (5s delay):
sleep 5
curl -s -X POST -H "Authorization: Bearer dev" -H "Content-Type: application/json" \
"http://localhost:6420/actors?namespace=$NS" \
-d "{\"name\":\"test\",\"key\":\"k2-$(date +%s)\",\"runner_name_selector\":\"test-envoy\",\"crash_policy\":\"sleep\"}" \
| python3 -c "import json,sys; a=json.load(sys.stdin)['actor']['actor_id']; print(a)" \
| xargs -I{} curl -s --max-time 12 -H "X-Rivet-Token: dev" -H "X-Rivet-Target: actor" -H "X-Rivet-Actor: {}" http://localhost:6420/ping
# Expected: 200 with JSON body
```

## Symptom

Actor is created (200), envoy receives CommandStartActor, actor starts in ~10ms, EventActorStateUpdate{Running} is sent back via WS, but the guard returns `actor_ready_timeout` after 10 seconds. The actor never becomes connectable.

## Root cause

After `refresh-metadata` stores `envoyProtocolVersion` in the DB, the runner pool workflow (`pegboard_runner_pool`) needs to restart its serverless connection cycle to use v2 POST instead of v1 GET. This takes ~2-5 seconds because:

1. The `pegboard_runner_pool_metadata_poller` workflow runs on a polling interval
2. The `pegboard_serverless_conn` workflow needs to cycle its existing connections
3. The `pegboard_runner_pool` workflow reads the updated config and spawns new v2 connections

Until this happens, the engine dispatches via v1 GET SSE which doesn't deliver the start payload to the envoy.

## Code locations

### Force-v2 hack (temporary)
`engine/packages/pegboard/src/workflows/actor/runtime.rs` line ~268:
```rust
// Changed from: if pool.and_then(|p| p.protocol_version).is_some()
// To force v2 for all serverless pools:
if pool.as_ref().and_then(|p| p.protocol_version).is_some() || for_serverless {
```

### Where protocol_version is stored
`engine/packages/pegboard/src/workflows/runner_pool_metadata_poller.rs` line ~214:
```rust
if let Some(protocol_version) = metadata.envoy_protocol_version {
tx.write(&protocol_version_key, protocol_version)?;
}
```

### Where protocol_version is read for v1→v2 migration decision
`engine/packages/pegboard/src/workflows/actor/runtime.rs` in `allocate_actor_v2`:
```rust
let pool_res = ctx.op(crate::ops::runner_config::get::Input { ... }).await?;
// ...
if pool.and_then(|p| p.protocol_version).is_some() {
return Ok(AllocateActorOutputV2 { status: AllocateActorStatus::MigrateToV2, ... });
}
```

### Where runner config is cached (may need invalidation)
`engine/packages/pegboard/src/ops/runner_config/get.rs` - reads ProtocolVersionKey from DB

### Where v1 (GET) vs v2 (POST) connection is made
- v1: `engine/packages/pegboard/src/workflows/serverless/conn.rs` line ~301: `client.get(endpoint_url)`
- v2: `engine/packages/pegboard-outbound/src/lib.rs` line ~316: `client.post(endpoint_url).body(payload)`

## Fix needed

After `refresh-metadata` stores `envoyProtocolVersion`, the runner pool should immediately use v2 POST without waiting for the metadata poller cycle. Either:
1. Signal the runner pool workflow to restart connections when metadata changes
2. Make the `refresh-metadata` API synchronously update the runner pool state
3. Have the serverless conn workflow check protocol_version before each connection attempt instead of relying on the metadata poller cycle
Loading
Loading