Skip to content

Commit 4c94230

Browse files
authored
feat: US-001 - Define SqliteKv trait in rivetkit-sqlite-native (#4584)
# Description Please include a summary of the changes and the related issue. Please also include relevant motivation and context. ## Type of change - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] This change requires a documentation update ## How Has This Been Tested? Please describe the tests that you ran to verify your changes. ## Checklist: - [ ] My code follows the style guidelines of this project - [ ] I have performed a self-review of my code - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have made corresponding changes to the documentation - [ ] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] New and existing unit tests pass locally with my changes
1 parent 91b199b commit 4c94230

File tree

292 files changed

+41836
-12010
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

292 files changed

+41836
-12010
lines changed

.agent/notes/driver-test-status.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Driver Test Suite Status
2+
3+
## What works
4+
- rivet-envoy-client (Rust) fully functional
5+
- rivetkit-native NAPI module, TSFN callbacks, envoy lifecycle all work
6+
- Standalone test: create actor + ping = 22-32ms (both test-envoy and native)
7+
- Gateway query path (getOrCreate) works: 112ms
8+
- E2e actor test passes (HTTP ping + WS echo)
9+
- Driver test suite restored (2282 tests), type-checks, loads
10+
11+
## Blocker: engine actor2 workflow doesn't process Running event
12+
13+
### Evidence
14+
- Fresh namespace, fresh pool config, generation 1
15+
- Actor starts in 11ms, Running event sent immediately
16+
- Guard times out after 10s with `actor_ready_timeout`
17+
- The Running event goes: envoy WS → pegboard-envoy → actor_event_demuxer → signal to actor2 workflow
18+
- But actor2 workflow never marks the actor as connectable
19+
20+
### Why test-envoy works but EngineActorDriver doesn't
21+
- test-envoy uses a PERSISTENT envoy on a PERSISTENT pool
22+
- The pool existed before the engine restarted, so the actor workflow may be v1 (not actor2)
23+
- v1 actors process events through the serverless/conn SSE path, which works
24+
- The force-v2 change routes ALL new serverless actors to actor2, where events aren't processed
25+
26+
### Root cause
27+
The engine's v2 actor workflow (`pegboard_actor2`) receives the `Events` signal from `pegboard-envoy`'s `actor_event_demuxer`, but it does not correctly transition to the connectable state. The guard polls `connectable_ts` in the DB which is never set.
28+
29+
### Fix needed (engine side)
30+
Check `engine/packages/pegboard/src/workflows/actor2/mod.rs` - specifically how `process_signal` handles the `Events` signal with `EventActorStateUpdate{Running}`. It should set `connectable_ts` in the DB and transition to `Transition::Running`.

.agent/notes/native-bridge-bugs.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
# Native Bridge Bugs
2+
3+
## Status
4+
- **Actions**: PASS (increment, getCount, state persistence all work)
5+
- **WebSocket**: FAIL (client-side connect timeout)
6+
- **SQLite**: FAIL (sqlite3_open_v2 code 14 - CANTOPEN)
7+
8+
## Test command (actions - works)
9+
```bash
10+
cd rivetkit-typescript/packages/rivetkit
11+
npx tsx tests/standalone-native-test.mts
12+
```
13+
Requires engine + test-envoy running, default namespace with metadata refreshed.
14+
15+
## WebSocket Bug
16+
17+
### Symptom
18+
Client SDK `handle.connect()` times out. Server-side works fully: envoy receives `ToEnvoyWebSocketOpen`, wrapper fires `config.websocket()`, `EngineActorDriver.#envoyWebSocket` attaches listeners, open event dispatches, actor sends 128-byte message back. Envoy sends `ToRivetWebSocketOpen` AND `ToRivetWebSocketMessage`. But client-side WS never opens.
19+
20+
### Root cause
21+
The engine's guard/gateway receives `ToRivetWebSocketOpen` from the envoy but does NOT complete the client-side WS upgrade. This is likely a guard bug with v2 actors - the guard's WS proxy code path may not handle the v2 tunnel response correctly.
22+
23+
### Evidence
24+
- Envoy sends `ToRivetWebSocketOpen{canHibernate: false}` at timestamp X ✓
25+
- Envoy sends `ToRivetWebSocketMessage{128 bytes}` immediately after ✓
26+
- Engine log: `websocket failed: Connection reset without closing handshake` for the gateway WS
27+
- Client-side WS closes without ever receiving the open event
28+
29+
### NOT a rivetkit-native issue
30+
The server-side flow (TSFN, EventTarget ws, ws.send via WebSocketSender, actor framework) all work correctly. The bug is in how the engine's guard handles v2 actor WS connections.
31+
32+
### Additional issue: message_index conflict
33+
The outgoing task in `actor.rs` (line ~459) sends `ToRivetWebSocketMessage` with hardcoded `message_index: 0`. But `send_actor_message` also sends messages starting at index 0. The guard may see duplicate indices and drop messages. Need to coordinate the message index between both paths.
34+
35+
### Reproduce
36+
```bash
37+
cd rivetkit-typescript/packages/rivetkit
38+
npx tsx tests/standalone-native-test.mts
39+
```
40+
Actions pass (3/3), WebSocket fails with connect timeout. Check Rust logs with `RIVET_LOG_LEVEL=debug`.
41+
42+
### Code locations
43+
- `engine/packages/guard-core/src/proxy_service.rs` line 1548-1554 - CustomServe WS handler
44+
- `engine/packages/guard-core/src/proxy_service.rs` line 927 - handle_websocket_upgrade
45+
- `engine/sdks/rust/envoy-client/src/actor.rs` line ~459 - outgoing task with hardcoded message_index
46+
- The guard's CustomServe handler (from the routing fn) should proxy ToRivetWebSocketOpen back to the client but doesn't complete the upgrade
47+
48+
## SQLite Bug
49+
50+
### Symptom
51+
`sqlite3_open_v2 failed with code 14` (SQLITE_CANTOPEN)
52+
53+
### Root cause
54+
The native SQLite VFS (`rivetkit-native/src/database.rs`) creates an `EnvoyKv` adapter that routes KV operations through the `EnvoyHandle`. But the VFS registration or database open may fail because:
55+
1. The actor isn't ready when the DB tries to open
56+
2. The VFS name conflicts
57+
3. The KV batch_get returns unexpected data format
58+
59+
### What to investigate
60+
- Add logging to `EnvoyKv` trait methods in `rivetkit-native/src/database.rs`
61+
- Check if `open_database_from_envoy` is called at the right time
62+
- Verify the envoy handle's KV methods work for the actor
63+
64+
### Code locations
65+
- `rivetkit-native/src/database.rs` - EnvoyKv impl + open_database_from_envoy
66+
- `rivetkit-typescript/packages/sqlite-native/src/vfs.rs` - KvVfs::register + open_database
67+
- `src/drivers/engine/actor-driver.ts` line ~570 - getNativeSqliteProvider
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# Rust Envoy Client: Known Issues
2+
3+
Audit of `engine/sdks/rust/envoy-client/src/` performed 2026-04-07.
4+
5+
---
6+
7+
## Behavioral Bugs
8+
9+
### B1: `destroy_actor` bypasses engine protocol -- FIXED
10+
11+
**File:** `handle.rs:60-65`, `envoy.rs:239-241`, `actor.rs:192-195`
12+
13+
The TS version sends `ActorIntentStop` and waits for the engine to issue a `CommandStopActor` with `StopActorReason::Destroy`. The Rust version sent `ToActor::Destroy` directly to the actor, force-killing it locally without engine confirmation.
14+
15+
**Fix applied:** `destroy_actor` now sends `ActorIntentStop` event (matching TS behavior). Removed `DestroyActor` message variant and `ToActor::Destroy`.
16+
17+
### B2: Graceful shutdown force-kills after 1 second -- FIXED
18+
19+
**File:** `envoy.rs:409-416`
20+
21+
`handle_shutdown` spawned a task that slept 1 second then sent `Stop`, which dropped all actor channels. Actors got no `on_actor_stop` callback.
22+
23+
**Fix applied:** Now polls actor handle closure with a deadline from `serverlessDrainGracePeriod` in `ProtocolMetadata` (falls back to 30s). All actors get a chance to stop cleanly.
24+
25+
---
26+
27+
## Performance: Fixed
28+
29+
### P1: `ws_send` clones entire outbound message -- FIXED
30+
31+
**File:** `connection.rs:213-223`
32+
33+
Took `&protocol::ToRivet`, then called `message.clone()` because `wrap_latest` needed ownership.
34+
35+
**Fix applied:** Changed signature to take `protocol::ToRivet` by value. All callers construct fresh values inline (except `send_actor_message` which clones for potential buffering).
36+
37+
### P2: Stringify allocates even when debug logging is off -- FIXED
38+
39+
**File:** `connection.rs:160, 214`, `actor.rs:897`
40+
41+
`tracing::debug!(data = stringify_to_envoy(&decoded), ...)` eagerly evaluated the stringify function regardless of log level.
42+
43+
**Fix applied:** Gated behind `tracing::enabled!(tracing::Level::DEBUG)`.
44+
45+
### P3: `handle_commands` clones config, hibernating_requests, preloaded_kv -- FIXED
46+
47+
**File:** `commands.rs:18-23`
48+
49+
`val.config.clone()`, `val.hibernating_requests.clone()`, `val.preloaded_kv.clone()` when the fields could be moved.
50+
51+
**Fix applied:** Only clone `val.config.name` for the `ActorEntry`, then move the rest into `create_actor`.
52+
53+
### P4: `_config` field in `ActorContext` is unused, forces a clone -- FIXED
54+
55+
**File:** `actor.rs:81, 129`
56+
57+
`_config: config.clone()` stored a full `ActorConfig` that was never read.
58+
59+
**Fix applied:** Removed the `_config` field. `config` is passed directly to `on_actor_start` without cloning.
60+
61+
### P5: `kv_put` clones keys and values -- FIXED
62+
63+
**File:** `handle.rs:202-203`
64+
65+
`entries.iter().map(|(k, _)| k.clone()).collect()` when `entries` is owned and could be consumed.
66+
67+
**Fix applied:** `let (keys, values): (Vec<_>, Vec<_>) = entries.into_iter().unzip();`
68+
69+
### P6: `parse_list_response` clones values -- FIXED
70+
71+
**File:** `handle.rs:358-366`
72+
73+
Keys were consumed via `into_iter()` but values were indexed and cloned.
74+
75+
**Fix applied:** `resp.keys.into_iter().zip(resp.values).collect()`
76+
77+
---
78+
79+
## Performance: Not Worth the Effort
80+
81+
### P7: BufferMap uses `HashMap<String, T>` with string key allocation
82+
83+
**File:** `utils.rs:122-172`
84+
85+
`cyrb53()` returns a hex `String` on every lookup. Used on hot paths (tunnel message dispatch). However, inputs are only 8 bytes total, producing ~13-char strings. Tiny allocations served from thread-local caches.
86+
87+
### P8: Redundant inner `Arc` on `ws_tx` and `protocol_metadata`
88+
89+
**File:** `context.rs:15-16`
90+
91+
`SharedContext` is already behind `Arc<SharedContext>`. Inner `Arc` wrappers add one extra indirection and refcount. Negligible impact.
92+
93+
### P9: `tokio::sync::Mutex` vs `std::sync::Mutex`
94+
95+
**File:** `context.rs:15-16`
96+
97+
Neither lock is held across `.await`. `protocol_metadata` is a clear-cut candidate for `std::sync::Mutex`. `ws_tx` holds the lock during `serde_bare` encode, making `tokio::sync::Mutex` defensible.
98+
99+
### P10: O(n*m) key matching in `kv_get`
100+
101+
**File:** `handle.rs:107-119`
102+
103+
Nested loop over requested keys and response keys. Real quadratic complexity, but n is typically small for KV gets.
104+
105+
### P11: Double actor lookups in tunnel.rs
106+
107+
**File:** `tunnel.rs:45-59, 112-148`
108+
109+
`get_actor` called twice (once to check existence, once to use). Borrow checker prevents naive fix since `get_actor(&self)` borrows all of `ctx`.
110+
111+
### P12: Headers cloned from HashableMap instead of moved
112+
113+
**File:** `actor.rs:309, tunnel.rs:140`
114+
115+
`req.headers.iter().map(|(k, v)| (k.clone(), v.clone())).collect()` when `req` is owned. Could use `into_iter()`.
116+
117+
### P13: `handle_ack_events` iterates checkpoints twice
118+
119+
**File:** `events.rs:37-67`
120+
121+
First pass retains events, second pass checks for actor removal. Could collect removals in first pass.
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# Bug: v2 actor dispatch requires ~5s delay after metadata refresh
2+
3+
## Reproduce
4+
5+
```bash
6+
# 1. Start engine with the force-v2 hack (see below)
7+
rm -rf ~/.local/share/rivet-engine/db
8+
cargo run --bin rivet-engine -- start
9+
10+
# 2. Start test-envoy
11+
RIVET_ENDPOINT=http://127.0.0.1:6420 RIVET_TOKEN=dev RIVET_NAMESPACE=default \
12+
RIVET_POOL_NAME=test-envoy AUTOSTART_ENVOY=0 AUTOSTART_SERVER=1 \
13+
AUTOCONFIGURE_SERVERLESS=0 cargo run -p rivet-test-envoy
14+
15+
# 3. In another terminal, run this:
16+
NS="repro-$(date +%s)"
17+
curl -s -X POST -H "Authorization: Bearer dev" -H "Content-Type: application/json" \
18+
http://localhost:6420/namespaces -d "{\"name\":\"$NS\",\"display_name\":\"$NS\"}"
19+
curl -s -X PUT -H "Authorization: Bearer dev" -H "Content-Type: application/json" \
20+
"http://localhost:6420/runner-configs/test-envoy?namespace=$NS" \
21+
-d '{"datacenters":{"default":{"serverless":{"url":"http://localhost:5051/api/rivet","request_lifespan":300,"max_concurrent_actors":10000,"slots_per_runner":1,"min_runners":0,"max_runners":10000}}}}'
22+
curl -s -X POST -H "Authorization: Bearer dev" -H "Content-Type: application/json" \
23+
"http://localhost:6420/runner-configs/test-envoy/refresh-metadata?namespace=$NS" -d '{}'
24+
25+
# THIS FAILS (no delay):
26+
curl -s -X POST -H "Authorization: Bearer dev" -H "Content-Type: application/json" \
27+
"http://localhost:6420/actors?namespace=$NS" \
28+
-d "{\"name\":\"test\",\"key\":\"k-$(date +%s)\",\"runner_name_selector\":\"test-envoy\",\"crash_policy\":\"sleep\"}" \
29+
| python3 -c "import json,sys; a=json.load(sys.stdin)['actor']['actor_id']; print(a)" \
30+
| xargs -I{} curl -s --max-time 12 -H "X-Rivet-Token: dev" -H "X-Rivet-Target: actor" -H "X-Rivet-Actor: {}" http://localhost:6420/ping
31+
# Expected: actor_ready_timeout
32+
33+
# THIS WORKS (5s delay):
34+
sleep 5
35+
curl -s -X POST -H "Authorization: Bearer dev" -H "Content-Type: application/json" \
36+
"http://localhost:6420/actors?namespace=$NS" \
37+
-d "{\"name\":\"test\",\"key\":\"k2-$(date +%s)\",\"runner_name_selector\":\"test-envoy\",\"crash_policy\":\"sleep\"}" \
38+
| python3 -c "import json,sys; a=json.load(sys.stdin)['actor']['actor_id']; print(a)" \
39+
| xargs -I{} curl -s --max-time 12 -H "X-Rivet-Token: dev" -H "X-Rivet-Target: actor" -H "X-Rivet-Actor: {}" http://localhost:6420/ping
40+
# Expected: 200 with JSON body
41+
```
42+
43+
## Symptom
44+
45+
Actor is created (200), envoy receives CommandStartActor, actor starts in ~10ms, EventActorStateUpdate{Running} is sent back via WS, but the guard returns `actor_ready_timeout` after 10 seconds. The actor never becomes connectable.
46+
47+
## Root cause
48+
49+
After `refresh-metadata` stores `envoyProtocolVersion` in the DB, the runner pool workflow (`pegboard_runner_pool`) needs to restart its serverless connection cycle to use v2 POST instead of v1 GET. This takes ~2-5 seconds because:
50+
51+
1. The `pegboard_runner_pool_metadata_poller` workflow runs on a polling interval
52+
2. The `pegboard_serverless_conn` workflow needs to cycle its existing connections
53+
3. The `pegboard_runner_pool` workflow reads the updated config and spawns new v2 connections
54+
55+
Until this happens, the engine dispatches via v1 GET SSE which doesn't deliver the start payload to the envoy.
56+
57+
## Code locations
58+
59+
### Force-v2 hack (temporary)
60+
`engine/packages/pegboard/src/workflows/actor/runtime.rs` line ~268:
61+
```rust
62+
// Changed from: if pool.and_then(|p| p.protocol_version).is_some()
63+
// To force v2 for all serverless pools:
64+
if pool.as_ref().and_then(|p| p.protocol_version).is_some() || for_serverless {
65+
```
66+
67+
### Where protocol_version is stored
68+
`engine/packages/pegboard/src/workflows/runner_pool_metadata_poller.rs` line ~214:
69+
```rust
70+
if let Some(protocol_version) = metadata.envoy_protocol_version {
71+
tx.write(&protocol_version_key, protocol_version)?;
72+
}
73+
```
74+
75+
### Where protocol_version is read for v1→v2 migration decision
76+
`engine/packages/pegboard/src/workflows/actor/runtime.rs` in `allocate_actor_v2`:
77+
```rust
78+
let pool_res = ctx.op(crate::ops::runner_config::get::Input { ... }).await?;
79+
// ...
80+
if pool.and_then(|p| p.protocol_version).is_some() {
81+
return Ok(AllocateActorOutputV2 { status: AllocateActorStatus::MigrateToV2, ... });
82+
}
83+
```
84+
85+
### Where runner config is cached (may need invalidation)
86+
`engine/packages/pegboard/src/ops/runner_config/get.rs` - reads ProtocolVersionKey from DB
87+
88+
### Where v1 (GET) vs v2 (POST) connection is made
89+
- v1: `engine/packages/pegboard/src/workflows/serverless/conn.rs` line ~301: `client.get(endpoint_url)`
90+
- v2: `engine/packages/pegboard-outbound/src/lib.rs` line ~316: `client.post(endpoint_url).body(payload)`
91+
92+
## Fix needed
93+
94+
After `refresh-metadata` stores `envoyProtocolVersion`, the runner pool should immediately use v2 POST without waiting for the metadata poller cycle. Either:
95+
1. Signal the runner pool workflow to restart connections when metadata changes
96+
2. Make the `refresh-metadata` API synchronously update the runner pool state
97+
3. Have the serverless conn workflow check protocol_version before each connection attempt instead of relying on the metadata poller cycle

0 commit comments

Comments
 (0)