Skip to content

Commit 28a55d1

Browse files
NathanFlurryclaude
andcommitted
fix: resolve Rust CI dead code and biome formatting issues
Gate test-only `NEXT_STAGE_ID` and `next_stage_id()` with `#[cfg(test)]` in sqlite-native vfs.rs. Apply biome formatting to native.ts and rivetkit-napi index.d.ts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 0092050 commit 28a55d1

File tree

144 files changed

+8849
-4058763
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

144 files changed

+8849
-4058763
lines changed
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# Production Review Checklist
2+
3+
Consolidated from deep review (2026-04-19) + existing notes. Verified against actual code 2026-04-19.
4+
5+
---
6+
7+
## CRITICAL — Data Corruption / Crashes
8+
9+
- [ ] **C1: Connection hibernation encoding mismatch**`gateway_id`/`request_id` are fixed 4-byte in TS BARE v4 (`bare.readFixedData(bc, 4)`) but variable-length `Vec<u8>` in Rust serde_bare (length-prefixed). Wire format incompatibility confirmed. Actors persisted by TS and loaded by Rust (or vice versa) get corrupted connection metadata. Fix: change Rust to `[u8; 4]` with custom serde. (`rivetkit-core/src/actor/connection.rs:58-69`)
10+
11+
- [ ] **C2: Missing on_state_change idle wait during shutdown** — Action dispatch waits for `on_state_change` idle (`action.rs:98`), but sleep and destroy shutdown do not. In-flight `on_state_change` callback can race with final `save_state`. Fix: add `wait_for_on_state_change_idle().await` with deadline after `set_started(false)` in both paths. (`rivetkit-core/src/actor/lifecycle.rs:215` sleep, `:303` destroy)
12+
13+
- [ ] **C3: NAPI string leaking via Box::leak()**`leak_str()` in `parse_bridge_rivet_error` leaks every unique error group/code/message as `&'static str`. Bounded by error message uniqueness in practice (group/code are finite, but message can include user context). (`rivetkit-napi/src/actor_factory.rs:889-903`)
14+
15+
---
16+
17+
## HIGH — Real Issues Worth Fixing
18+
19+
- [ ] **H1: Scheduled event panic not caught**`run` handler is wrapped in `catch_unwind`, but scheduled event dispatch (`invoke_action_by_name`) is not. Low practical risk since actions go through serialization boundaries, but a defensive gap. (`rivetkit-core/src/actor/schedule.rs:199-264`)
20+
21+
- [ ] **H2: Action timeout/size enforcement in wrong layer** — TS `native.ts` enforces `withTimeout()` and message size for HTTP actions. Rust `handle_fetch` bypasses these. Different execution paths (not double enforcement), but HTTP path lacks Rust-side enforcement. Should consolidate into Rust.
22+
23+
- [ ] **H3: Mutex\<HashMap\> violations (5 instances)** — CLAUDE.md forbids this. Replace with `scc::HashMap` (preferred) or `DashMap`. Locations: `rivetkit-core/src/actor/queue.rs:105` (completion_waiters), `client/src/connection.rs:70` (in_flight_rpcs), `client/src/connection.rs:72` (event_subscriptions), `rivetkit-sqlite/src/vfs.rs:1632` (stores), `rivetkit-sqlite/src/vfs.rs:1633` (op_log)
24+
25+
---
26+
27+
## MEDIUM — Pre-existing TS Issues (Not Regressions)
28+
29+
These existed before the Rust migration. Tracked here for visibility but are not caused by the migration.
30+
31+
- [ ] **M1: Traces exceed KV value limits**`DEFAULT_MAX_CHUNK_BYTES = 1MB`, KV max value = 128KB. (`rivetkit-typescript/packages/traces/src/traces.ts:63`)
32+
33+
- [ ] **M2: SQLite VFS unsplit putBatch/deleteBatch** — Can exceed 128 entries and/or 976KB payload. (`rivetkit-typescript/packages/sqlite-vfs/src/vfs.ts:856,908,979`)
34+
35+
- [ ] **M3: Workflow persistence unsplit write arrays**`storage.flush` builds unbounded writes, calls `driver.batch(writes)` once. (`rivetkit-typescript/packages/workflow-engine/src/storage.ts:270,346`)
36+
37+
- [ ] **M4: Workflow flush clears dirty flags before write success** — If batch fails, dirty markers lost. (`rivetkit-typescript/packages/workflow-engine/src/storage.ts:296,308`)
38+
39+
- [ ] **M5: State persistence can exceed batch limits**`savePersistInner` aggregates actor + all changed connections into one batch. (`rivetkit-typescript/packages/rivetkit/src/actor/instance/state-manager.ts:422,503`)
40+
41+
- [ ] **M6: Queue batch delete can exceed limits** — Removes all selected messages in one `kvBatchDelete(keys)`. (`rivetkit-typescript/packages/rivetkit/src/actor/instance/queue-manager.ts:520,530`)
42+
43+
- [ ] **M7: Traces write queue poison after KV failure**`writeChain` promise chain has no rejection recovery. (`rivetkit-typescript/packages/traces/src/traces.ts:545,767`)
44+
45+
- [ ] **M8: Queue metadata mutates before storage write** — Enqueue increments `nextId`/`size` before `kvBatchPut`. If write fails, in-memory metadata drifts. (`rivetkit-typescript/packages/rivetkit/src/actor/instance/queue-manager.ts:163,168,523`)
46+
47+
- [ ] **M9: Connection cleanup swallows KV delete failures** — Stale connection KV may remain. (`rivetkit-typescript/packages/rivetkit/src/actor/instance/connection-manager.ts:372,379`)
48+
49+
- [ ] **M10: Cloudflare driver KV divergence** — No engine-equivalent limit validation. (`rivetkit-typescript/packages/cloudflare-workers/src/actor-kv.ts:14`)
50+
51+
- [ ] **M11: v2 actor dispatch requires ~5s delay after metadata refresh** — Engine-side issue. (`v2-metadata-delay-bug.md`)
52+
53+
---
54+
55+
## LOW — Code Quality / Cleanup
56+
57+
- [ ] **L1: BARE codec extraction**~380 lines of hand-rolled BARE across `registry.rs` (~257 lines) and `client/src/protocol/codec.rs` (~123 lines). Should be replaced by generated protocol crate.
58+
59+
- [ ] **L2: registry.rs is 3668 lines** — Biggest file by far. Needs splitting.
60+
61+
- [ ] **L3: Metrics registry panic**`expect()` on prometheus gauge creation. Should fallback to no-op. (`rivetkit-core/src/actor/metrics.rs:62-77`)
62+
63+
- [ ] **L4: Response map orphaned entries (NAPI)** — Minor: on error paths, response_id entry not cleaned up from map. Cleaned on actor stop. (`rivetkit-napi/src/bridge_actor.rs:200-223`)
64+
65+
- [ ] **L5: Unused serde derives on protocol structs**`registry.rs` protocol types derive Serialize/Deserialize but use hand-rolled BARE.
66+
67+
- [ ] **L6: _is_restoring_hibernatable unused**`registry.rs` accepts but ignores this param.
68+
69+
---
70+
71+
## SEPARATE EFFORTS (not blocking ship)
72+
73+
- [ ] **S1: Workflow replay refactor** — 6 action items in `workflow-replay-review.md`.
74+
75+
- [ ] **S2: Rust client parity** — Full spec in `.agent/specs/rust-client-parity.md`.
76+
77+
- [ ] **S3: WASM shell shebang** — Blocks agentOS host tool shims. (`.agent/todo/wasm-shell-shebang.md`)
78+
79+
- [ ] **S4: Native bridge bugs (engine-side)** — WebSocket guard + message_index conflict. (`native-bridge-bugs.md`)
80+
81+
---
82+
83+
## REMOVED — Verified as Not Issues
84+
85+
Items from original checklist that were verified as bullshit or already fixed:
86+
87+
- ~~Ready state vs connection restore race~~ — OVERSTATED. Microsecond window, alarms gated by `started` flag.
88+
- ~~Queue completion waiter leak~~ — BULLSHIT. Rust drop semantics clean up when Arc is dropped.
89+
- ~~Unbounded HTTP body size~~ — OVERSTATED. Envoy/engine enforce limits upstream.
90+
- ~~BARE-only encoding~~ — ALREADY FIXED. Accepts json/cbor/bare.
91+
- ~~Error metadata dropped~~ — ALREADY FIXED. Metadata field exists and is passed through.
92+
- ~~Action timeout double enforcement~~ — BULLSHIT. Different execution paths, not overlapping.
93+
- ~~Lock poisoning pattern~~ — BULLSHIT. Standard Rust practice with `.expect()`.
94+
- ~~State lock held across I/O~~ — BULLSHIT. Data cloned first, lock released before I/O.
95+
- ~~SQLite startup cache leak~~ — BULLSHIT. Cleanup exists in on_actor_stop.
96+
- ~~WebSocket callback accumulation~~ — BULLSHIT. Callbacks are replaced via `configure_*_callback(Some(...))`, not accumulated.
97+
- ~~Inspector DB access~~ — BULLSHIT. No raw SQL in inspector.
98+
- ~~Raw WS outgoing size~~ — BULLSHIT. Enforced at handler level.
99+
- ~~Unbounded tokio::spawn~~ — BULLSHIT. Tracked via keep_awake counters.
100+
- ~~Error format changed~~ — SAME AS TS. Internal bridge format, not external.
101+
- ~~Queue send() returns Promise~~ — SAME AS TS. Always was async.
102+
- ~~Error visibility forced~~ — SAME AS TS. Pre-existing normalization.
103+
- ~~Queue complete() double call~~ — Expected behavior, not breaking.
104+
- ~~Negative queue timeout~~ — Stricter validation, unlikely to break real code.
105+
- ~~SQLite schema version cached~~ — Required by design, not a bug.
106+
- ~~Connection state write-through proxy~~ — Unclear claim, unverifiable.
107+
- ~~WebSocket setEventCallback~~ — Internal API, handled by adapter.
108+
- Code quality items (actor key file, Request/Response file, rename callbacks, rename FlatActorConfig, context.rs issues, #[allow(dead_code)], move kv.rs/sqlite.rs) — Moved to `production-review-complaints.md`.
109+
110+
---
111+
112+
## VERIFIED OK
113+
114+
- Architecture layering: CLEAN
115+
- Actor state BARE encoding v4: compatible
116+
- Queue message/metadata BARE encoding: compatible
117+
- KV key layout (prefixes [1]-[7]): identical
118+
- SQLite v1 chunk storage (4096-byte chunks): compatible
119+
- BARE codec overflow/underflow protection: correct
120+
- WebSocket init/reconnect/close: correct
121+
- Authentication (bearer token on inspector): enforced
122+
- SQL injection: parameterized queries, read-only enforcement
123+
- Envoy client bugs B1/B2: FIXED
124+
- Envoy client perf P1-P6: FIXED
125+
- Driver test suite: all fast+slow tests PASS (excluding agent-os, cross-backend-vfs)
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Production Review Complaints
2+
3+
Tracking issues and complaints about the rivetkit Rust implementation for production readiness.
4+
5+
Verified 2026-04-19. Fixed items removed.
6+
7+
---
8+
9+
## TS/NAPI Cleanup & Routing (fix first)
10+
11+
28. **Unify HTTP routing in core** — Framework routes are split across two layers with no clean boundary. Rust `handle_fetch` owns `/metrics` and `/inspector/*`. TS `on_request` callback owns `/action/*` and `/queue/*` via regex matching in `maybeHandleNativeActionRequest` and `maybeHandleNativeQueueRequest`. Path parsing happens twice (Rust prefix checks, then TS regex). The `on_request` callback became a fallback router instead of a user handler. Fix: core should own all framework routes (`/metrics`, `/inspector/*`, `/action/*`, `/queue/*`), and only delegate unmatched paths to the user's `on_request` callback.
12+
13+
25. **Move HTTP action/queue dispatch from TS to core** — TS `native.ts` owns HTTP action dispatch (`maybeHandleNativeActionRequest`, ~lines 2656-2871) and queue dispatch (`maybeHandleNativeQueueRequest`, ~lines 2873-3041) with routing, timeout enforcement, message size limits, and response encoding. Core already dispatches actions via WebSocket (`ActionInvoker::dispatch()` in `action.rs`). Move HTTP routing + dispatch + timeout + size checks into core's `handle_fetch()`. Schema validation stays in TS (pre-validated before calling core). Fixes checklist item H2 and enables Rust runtime parity.
14+
15+
10. **Action timeout/size enforcement lives in TS instead of Rust**`native.ts` enforces `withTimeout()` and `maxIncomingMessageSize`/`maxOutgoingMessageSize` for HTTP actions. Rust `handle_fetch` in `registry.rs` bypasses these checks entirely. WebSocket path enforces them in Rust. Consolidate into Rust.
16+
17+
27. **Action execution should not be serialized** — Rust core serializes actions with `tokio::sync::Mutex<()>` (`action.rs:60`, `context.rs:770-772`). The TS NAPI bridge added a matching `AsyncMutex` per actor (`native.ts`, commit `00920501a`). The original TS runtime had NO serialization — `invokeActionByName` called the handler directly, allowing concurrent actions per actor via the JS event loop. This is a behavioral regression: read-heavy actors that relied on concurrent action execution now serialize unnecessarily. Remove the action lock from core and the `AsyncMutex` from the native bridge.
18+
19+
13. **Delete `openDatabaseFromEnvoy` and its supporting caches**`rivetkit-typescript/packages/rivetkit-napi/src/database.rs:189-221` plus the `sqlite_startup_map` and `sqlite_schema_version_map` on `JsEnvoyHandle` (`src/envoy_handle.rs:32-33, 55-68`) and the matching insert/remove sites in `src/bridge_actor.rs:27-30, 44-45, 84-99, 143-148`. Verified: zero callers in `rivetkit-typescript/packages/rivetkit/`. The production path goes through `ActorContext::sql()` which already has the schema version + startup data via `RegistryCallbacks::on_actor_start`.
20+
21+
14. **Delete `BridgeCallbacks` JSON-envelope path**`rivetkit-typescript/packages/rivetkit-napi/src/bridge_actor.rs` (entire file) plus `start_envoy_sync_js` / `start_envoy_js` entry points in `src/lib.rs:80-156` and the `wrapper.js` adapter layer (`startEnvoySync`/`startEnvoy`/`wrapHandle` ~lines 36-174). Production uses `NapiActorFactory` + `CoreRegistry` via direct rivetkit-core callbacks, not this JSON-envelope bridge. ~700 lines of Rust + ~490 lines of JS removable.
22+
23+
15. **Delete standalone `SqliteDb` wrapper**`rivetkit-typescript/packages/rivetkit-napi/src/sqlite_db.rs`. Verified: production sql access goes through `JsNativeDatabase` via `ctx.sql()`, not this class.
24+
25+
16. **Delete `JsEnvoyHandle::start_serverless` method**`rivetkit-typescript/packages/rivetkit-napi/src/envoy_handle.rs:378-387`. Verified dead: serverless support was removed from the TypeScript routing stack and `Runtime.startServerless()` in `rivetkit/runtime/index.ts:117` already throws `removedLegacyRoutingError`. The NAPI method is unreachable.
26+
27+
17. **Drop the `wrapper.js` adapter layer once items 13-14 land**`rivetkit-typescript/packages/rivetkit-napi/wrapper.js` exists to translate JSON envelopes back into `EnvoyConfig` callbacks for the dead BridgeCallbacks path. After deletion, rivetkit can import `index.js` directly and the wrapper module disappears.
28+
29+
24. **Fix `Box::leak` in NAPI error handling**`actor_factory.rs:890,897` leaks strings and the `RivetErrorSchema` struct itself via `Box::leak`. Fix: change `RivetErrorSchema` fields from `&'static str` to `Cow<'static, str>` in the `rivet_error` crate, then use `Cow::Owned(...)` instead of `leak_str(...)`. Only 2 call sites, both in `parse_bridge_rivet_error`.
30+
31+
---
32+
33+
## Core Architecture
34+
35+
5. **`context.rs` passes `ActorConfig::default()` to Queue and ConnectionManager**`build()` receives a `config` param but ignores it for Queue (line 145) and ConnectionManager (line 152) and SleepController. Possible bug: these subsystems get default timeouts instead of the actor's configured values.
36+
37+
6. **`sleep()` spawns fire-and-forget task with untracked JoinHandle**`context.rs:286-297`. Spawned task persists connections and requests sleep. Not tracked, can be orphaned on destroy.
38+
39+
7. **`Default` impl creates empty context with `actor_id: ""`**`context.rs:997-1001`. Footgun for any code calling `ActorContext::default()`.
40+
41+
11. **`registry.rs` is 3668 lines** — Now the biggest file by far. Needs splitting.
42+
43+
18. **Review all `tokio::spawn` and replace with JoinSets** — Audit every `tokio::spawn` in rivetkit-core and rivetkit-sqlite for untracked fire-and-forget tasks. Replace with `JoinSet` so shutdown can abort and join all spawned tasks cleanly. Ensure JoinSets are cancelled/aborted on actor completion (sleep, destroy) so no orphaned tasks outlive the actor.
44+
45+
26. **Merge `active_instances` and `stopping_instances` maps** — Registry tracks actors across 4 concurrent maps (`starting_instances`, `active_instances`, `stopping_instances`, `pending_stops`). `active_instances` and `stopping_instances` both store `ActiveActorInstance` (same type). Merge into a single `SccHashMap<String, ActorInstanceState>` with an enum `{ Active(ActiveActorInstance), Stopping(ActiveActorInstance) }`. Eliminates the multi-map lookup in `active_actor()` which currently searches both maps sequentially. `starting_instances` (Arc\<Notify\>) and `pending_stops` (PendingStop) have different value types and should stay separate. (`rivetkit-core/src/registry.rs:78-81`)
46+
47+
25b. **Remove `ActorContext::new_runtime`, make `build` pub(crate)**`new_runtime` is a misleading name ("runtime" isn't a concept in the system). It's just the fully-configured constructor vs the test-only `new`/`new_with_kv` convenience constructors. Delete `new_runtime`, make `build()` `pub(crate)`, and have callers use `build()` directly. (`rivetkit-core/src/actor/context.rs:110-128`)
48+
49+
---
50+
51+
## Wire Compatibility
52+
53+
23. **`gateway_id`/`request_id` must be `[u8; 4]`, not `Vec<u8>`** — Runner protocol BARE schema defines `type GatewayId data[4]` and `type RequestId data[4]` (fixed 4-byte). Rust `PersistedConnection` uses `Vec<u8>` which serializes with a length prefix, breaking wire compatibility with TS. Fix: change to `[u8; 4]` with fixed-size serde. This is NOT the same as the engine `Id` type (which is 19 bytes). (`rivetkit-core/src/actor/connection.rs:58-69`, `engine/sdks/schemas/runner-protocol/v7.bare:8-9`)
54+
55+
12. **Use native `Id` type from engine instead of `Vec<u8>` for IDs** — Connection `gateway_id`/`request_id` and other IDs use `Vec<u8>` instead of the engine's native `Id` type. Should switch to the proper type.
56+
57+
---
58+
59+
## Code Quality
60+
61+
1. **Actor key ser/de should be in its own file** — Currently in `types.rs` alongside unrelated types. Move to `utils/key.rs`.
62+
63+
2. **Request and Response structs need their own file** — Currently in `actor/callbacks.rs` (364 lines, 19 structs). Move to a dedicated file.
64+
65+
3. **Rename `callbacks` to `lifecycle_hooks`**`actor/callbacks.rs` should be `actor/lifecycle_hooks.rs`.
66+
67+
4. **Rename `FlatActorConfig` to `ActorConfigInput`** — Add doc comment: "Sparse, serialization-friendly actor configuration. All fields are optional with millisecond integers instead of Duration. Used at runtime boundaries (NAPI, config files). Convert to ActorConfig via ActorConfig::from_input()." Rename `from_flat()` to `from_input()`.
68+
69+
8. **Remove all `#[allow(dead_code)]`** — 57 instances across rivetkit-core. All decorated methods are actually called from external modules. Attributes are unnecessary cargo-cult suppressions. Safe to remove all.
70+
71+
9. **Move `kv.rs` and `sqlite.rs` out of top-level `src/`** — They're actor subsystems. Move to `src/actor/kv.rs` and `src/actor/sqlite.rs`.
72+
73+
---
74+
75+
## Safety & Correctness
76+
77+
19. **Review inspector security** — General audit of inspector endpoints in `registry.rs:704-900`. Check auth is enforced on all paths, no unintended state mutations, and that the TS and Rust inspector surfaces match.
78+
79+
20. **No panics unless absolutely necessary** — rivetkit-core, rivetkit, and rivetkit-napi should never panic. There are ~146 `.expect("lock poisoned")` calls that should be replaced with non-poisoning locks (e.g. `parking_lot::RwLock`/`Mutex`) or proper error propagation. Audit all `unwrap()`, `expect()`, and `panic!()` calls across these three crates and eliminate them.
80+
81+
22. **Standardize error handling with rivetkit-core** — Investigate whether errors across rivetkit-core, rivetkit, and rivetkit-napi are consistently using `RivetError` with proper group/code/message. Look for places using raw `anyhow!()` or string errors that should be structured `RivetError` types instead.
82+
83+
---
84+
85+
## Investigation
86+
87+
21. **Investigate v1 vs v2 SQLite wiring** — Need to understand how v1 and v2 VFS are dispatched, whether both paths are correctly wired through rivetkit-core, and if there are any gaps in the v1-to-v2 migration path.

0 commit comments

Comments
 (0)