Skip to content

Commit aeb5da6

Browse files
committed
DO NOT MERGE: serverless restart race condition
1 parent a20fe59 commit aeb5da6

3 files changed

Lines changed: 2075 additions & 0 deletions

File tree

Lines changed: 276 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,276 @@
1+
# Engine Restart Serverless Investigation
2+
3+
## Scope
4+
5+
This documents the local reproduction work for serverless Rivet Actor access after restarting `rivet-engine`.
6+
7+
The harness runs:
8+
9+
- `tests/driver/engine-restart-serverless.ts`
10+
- `tests/fixtures/engine-restart-serverless-runtime.ts`
11+
12+
The runtime is a normal Node process, not Vitest. It listens on a port with `serverless.basePath = "/api/rivet"`. The engine is started as a real `rivet-engine` binary and configured with a serverless runner config pointing at that runtime.
13+
14+
## Main Finding
15+
16+
The bug is not SQLite-specific.
17+
18+
After `rivet-engine` restarts, there is a short window where gateway traffic to already-existing warmed actors can hang. This affects:
19+
20+
- action calls through the client
21+
- HTTP actor requests through `/gateway/{actor}/request/...`
22+
- raw actor WebSockets through `/gateway/{actor}/websocket/...`
23+
24+
New actor keys work immediately while warmed existing keys can hang, so this looks like stale or not-yet-settled gateway/serverless actor routing state for existing actors.
25+
26+
## Timing Window
27+
28+
Timing is measured from the moment engine `/health` returns after restart.
29+
30+
Observed local window:
31+
32+
- Unsafe: `0-2400ms`
33+
- Flaky edge: roughly `2500-3250ms`
34+
- Local minimum to trust: `3000ms`
35+
- Conservative diagnostic delay: `5000ms`
36+
37+
This is not a crisp timer. The edge moves run-to-run.
38+
39+
## Baseline Actor Action Results
40+
41+
These probes use the same existing actor key and a new key after engine restart.
42+
43+
No heartbeat, immediate probe:
44+
45+
- same-handle `getCount`: timed out
46+
- same-handle `tick`: timed out
47+
- fresh handle with same key `getCount`: timed out
48+
- new key `tick`: succeeded
49+
50+
This reproduced the "actor bricked" symptom:
51+
52+
```text
53+
bricked actor symptom reproduced. mode=idle failedPostRestartProbes=3 before=0
54+
```
55+
56+
No heartbeat, delayed probe:
57+
58+
- same existing key succeeded
59+
- new key succeeded
60+
61+
This means the actor is not permanently broken in the local harness. It is unreachable through the existing-key gateway path during the post-restart race window.
62+
63+
## Heartbeat Results
64+
65+
The runtime supports `RIVETKIT_HEARTBEAT_MODE=none|sqlite|kv`.
66+
67+
Immediate post-restart probes still reproduced the brick with:
68+
69+
- no heartbeat
70+
- SQLite heartbeat
71+
- KV heartbeat
72+
73+
Both SQLite and KV actor-originated heartbeat traffic could recover after restart while gateway/client traffic to the same warmed existing actor key still hung.
74+
75+
Conclusion: heartbeat success does not prove gateway readiness, and SQLite is not special.
76+
77+
## HTTP Gateway Health Sweep
78+
79+
The actor exposes an `onRequest` health endpoint at:
80+
81+
```text
82+
/gateway/{actor}/request/health
83+
```
84+
85+
The sweep warms one existing actor key per delay before restart, then probes each key once after engine `/health` returns. This avoids an early failed probe poisoning later delay measurements.
86+
87+
Coarse sweep:
88+
89+
```text
90+
0ms: timeout
91+
250ms: timeout
92+
500ms: timeout
93+
1000ms: timeout
94+
2000ms: timeout
95+
3000ms: 200 OK
96+
5000ms: 200 OK
97+
8000ms: 200 OK
98+
12000ms: 200 OK
99+
```
100+
101+
Narrow sweeps:
102+
103+
```text
104+
2000ms: timeout
105+
2250ms: timeout
106+
2500ms: timeout in one run, success in another
107+
2750ms: success in one run
108+
3000ms: success
109+
3250ms: success
110+
```
111+
112+
Low repeat:
113+
114+
```text
115+
1000ms: timeout
116+
1500ms: timeout
117+
1800ms: timeout
118+
2000ms: timeout
119+
2200ms: timeout
120+
2400ms: timeout
121+
```
122+
123+
HTTP conclusion: gateway HTTP actor requests become reliably usable around `3s` locally, with `5s` as the safer diagnostic delay.
124+
125+
## WebSocket Ping/Pong Sweep
126+
127+
The actor exposes `onWebSocket` ping/pong at:
128+
129+
```text
130+
/gateway/{actor}/websocket/ping
131+
```
132+
133+
The client opens the WebSocket with Rivet gateway subprotocols:
134+
135+
```text
136+
rivet
137+
rivet_encoding.bare
138+
```
139+
140+
Then it sends:
141+
142+
```json
143+
{"type":"ping","sentAt":...}
144+
```
145+
146+
And waits for:
147+
148+
```json
149+
{"type":"pong","sentAt":...}
150+
```
151+
152+
Coarse sweep:
153+
154+
```text
155+
0ms: timeout
156+
250ms: timeout
157+
500ms: timeout
158+
1000ms: timeout
159+
2000ms: timeout
160+
3000ms: pong
161+
5000ms: pong
162+
8000ms: pong
163+
12000ms: pong
164+
```
165+
166+
Narrow sweeps:
167+
168+
```text
169+
2000ms: timeout
170+
2250ms: timeout
171+
2500ms: timeout
172+
2750ms: timeout in one run
173+
3000ms: timeout in one run, success in another
174+
3250ms: pong
175+
```
176+
177+
Edge repeat:
178+
179+
```text
180+
2800ms: pong
181+
3000ms: pong
182+
3200ms: pong
183+
3400ms: pong
184+
3600ms: pong
185+
```
186+
187+
Low repeat:
188+
189+
```text
190+
2200ms: timeout
191+
2400ms: timeout
192+
2600ms: timeout
193+
2800ms: pong
194+
```
195+
196+
WebSocket conclusion: raw actor WebSocket ping/pong has the same post-restart readiness window as HTTP gateway traffic. It is not action-specific.
197+
198+
## Commit During Restart
199+
200+
The harness can coordinate an actor action that starts a SQLite transaction, signals the driver, then attempts `COMMIT` after the engine is stopped.
201+
202+
Immediate post-restart probes after this failed commit reproduced the same shape:
203+
204+
- failed in-flight operation
205+
- same existing key probes timed out
206+
- new key succeeded
207+
208+
Delayed post-restart probes passed.
209+
210+
This is consistent with the gateway/serverless actor routing race, not durable SQLite session poisoning.
211+
212+
## Important Corrections
213+
214+
Earlier, it looked like a SQLite heartbeat caused later gateway probes to pass. That was wrong. A no-heartbeat delayed control also passed.
215+
216+
The actual variable was delay after engine health, not SQLite activity.
217+
218+
## Useful Logs
219+
220+
Representative local logs:
221+
222+
```text
223+
/tmp/rivet-idle-none-immediate.log
224+
/tmp/rivet-idle-kv-immediate.log
225+
/tmp/rivet-idle-sqlite-immediate.log
226+
/tmp/rivet-idle-none-after.log
227+
/tmp/rivet-idle-kv-after.log
228+
/tmp/rivet-idle-sqlite-after.log
229+
/tmp/rivet-commit-none-immediate.log
230+
/tmp/rivet-commit-none-after.log
231+
/tmp/rivet-gateway-health-sweep.log
232+
/tmp/rivet-gateway-health-sweep-narrow.log
233+
/tmp/rivet-gateway-health-sweep-edge.log
234+
/tmp/rivet-gateway-health-sweep-low.log
235+
/tmp/rivet-gateway-websocket-sweep.log
236+
/tmp/rivet-gateway-websocket-sweep-narrow.log
237+
/tmp/rivet-gateway-websocket-sweep-edge.log
238+
/tmp/rivet-gateway-websocket-sweep-low.log
239+
```
240+
241+
## Commands
242+
243+
Action/client repro:
244+
245+
```bash
246+
RIVETKIT_ENGINE_RESTART_MODE=idle \
247+
RIVETKIT_HEARTBEAT_MODE=none \
248+
RIVETKIT_POST_RESTART_PROBE_TIMING=immediate \
249+
node --import tsx tests/driver/engine-restart-serverless.ts
250+
```
251+
252+
HTTP health sweep:
253+
254+
```bash
255+
RIVETKIT_ENGINE_RESTART_MODE=idle \
256+
RIVETKIT_HEARTBEAT_MODE=none \
257+
RIVETKIT_GATEWAY_HEALTH_DELAYS_MS=0,250,500,1000,2000,3000,5000,8000,12000 \
258+
node --import tsx tests/driver/engine-restart-serverless.ts
259+
```
260+
261+
WebSocket ping/pong sweep:
262+
263+
```bash
264+
RIVETKIT_ENGINE_RESTART_MODE=idle \
265+
RIVETKIT_HEARTBEAT_MODE=none \
266+
RIVETKIT_GATEWAY_WEBSOCKET_DELAYS_MS=0,250,500,1000,2000,3000,5000,8000,12000 \
267+
node --import tsx tests/driver/engine-restart-serverless.ts
268+
```
269+
270+
## Current Theory
271+
272+
Engine `/health` returning does not mean gateway/serverless routing for previously warmed existing actors is fully settled.
273+
274+
Requests in the first few seconds can attach to stale or incomplete actor routing/request state. Those requests hang until the caller timeout. A new key succeeds because it follows a fresh actor allocation path instead of the stale existing-key path.
275+
276+
The likely fix area is the gateway/serverless/pegboard-envoy readiness and stale actor state path after engine restart, especially around existing actor generation, in-flight wake, stopped state, and pending request routing.

0 commit comments

Comments
 (0)