Skip to content

Commit 0dc1a65

Browse files
Sayan-claude
andauthored
OOM and process telemetry (#254)
### Why Customers running browsers on Kernel currently get no signal when something inside the VM dies. Chrome renderer gets OOM-killed by the kernel? Silent. Mutter segfaults? Silent. The session just degrades and the customer has to guess. This PR adds two always-on event types that flow through the existing `EventStream` → SSE/S2 pipeline so they show up on the dashboard like any other browser event. ### Two sources, no overlap Different failures only surface on different channels, so we need both: | Failure | kmsg | supervisord | | --- | :-: | :-: | | Chrome renderer subprocess OOM | ✓ | ✗ (not supervised) | | Chromium parent OOM | ✓ | ✓ | | Chromium segfault / V8 abort | ✗ | ✓ | | `kernel-images-api` panic | ✗ | ✓ (on respawn) | No de-dup if both fire for the same kill — that overlap itself is a signal (RAM exhaustion vs. process bug). ### Design choices **Public schema is decoupled from impl.** `phase: startup|running|gave_up` instead of supervisord's `RUNNING|STARTING|BACKOFF`; `Source.Event: "service.crashed"` instead of `"supervisord.process_exited"`. If we ever swap supervisord for systemd-style, the contract holds. **OOM payload answers the customer's actual question.** "Chrome died at 2GB RSS" tells you nothing; "chrome held 4.8GB on a 2GB system with 17MB free, top tasks were chrome/mutter/sshd/systemd" tells you it was a chrome memory leak, not an infra problem. We parse the kernel's full Mem-Info + Tasks-state dump (atomically present in kmsg before the kill line) for total/free memory and the top-5 processes by RSS. No race, no separate `/proc/meminfo` read. **Custom state machine on `euank/go-kmsg-parser` rather than `cadvisor/oomparser`.** cadvisor's parser is solid but doesn't extract `rss_kb`, hard-codes `/dev/kmsg` open-from-start (which would replay the ring buffer on every API restart), and drags `klog`. Building on the line-level parser directly lets us `SeekEnd()` on open, extract the full payload, and stay in our logger. **Shim is a separate binary talking to the API over HTTP.** Supervisord's eventlistener model requires a separate process; the API already exposes `/telemetry/events`; running on a localhost trusted boundary so no auth needed. Adding a unix socket would gain nothing. **`startretries=999999` on the shim.** Supervisord doesn't emit events about its own eventlisteners, so if the shim ever entered FATAL state we'd lose all `service_crashed` telemetry silently. The shim is tiny and side-effect-free; effectively infinite restarts is the right safety bias. ### Known limitations (intentional) - **API-self-crash isn't captured** — when `kernel-images-api` dies, the shim's POST fails and the event is lost. Unikraft Cloud's VM-level monitoring catches it at the platform layer. Buffering inside the shim would close the gap but isn't worth the scope here. - **`process_name` truncates at 15 chars** — fundamental kernel `TASK_COMM_LEN` limit. Documented. - **Page size hard-coded to 4 KiB** — correct on x86_64 Unikraft; would be wrong on ARM 16K/64K. Documented. ### Test plan - [x] State machine unit tests: canonical 5.x dump, legacy 4.x (no Mem-Info), top-N capping, sequential kills, abandoned section, watchdog discriminates noise vs recognised lines - [x] Round-trip: stub kmsg source → Monitor → EventStream → JSON - [x] Shim: byte-level regression test for `RESULT 2\nOK` (no trailing newline; previous version deadlocked the listener after one event), phase mapping for RUNNING/STARTING/BACKOFF, unknown-state skip - [x] `go vet`, `go test -race` (full suite minus e2e), cross-platform build end to end tests | Scenario | Mechanism | Expected event | | --- | --- | --- | | Service crash mid-running | `docker exec ... kill -KILL $(pgrep -f /opt/chrome-for-testing/chromium)` | `service_crashed` with `phase=running` | | Service exhausts restart retries | install a flaky supervisord program (`sleep 0.1; exit 1`, `startretries=3`) and `supervisorctl start` it | `service_crashed` with `phase=gave_up`, no `pid` | | Clean stop suppressed (negative) | `docker exec ... supervisorctl stop chromium` | no event (only SSE keepalive) | | Synthetic OOM dump | inject canonical kmsg lines via `echo "<6>$line" > /dev/kmsg` (opener, `CPU/PID/Comm`, Mem-Info, Tasks-state, `oom-kill:constraint=...`, `Killed process`) | one `system_oom_kill` with `constraint=none`, `mem_total_kb`, top-1 task `chromium`, `trigger_process_name=chromium` | | Real cgroup OOM | `docker run --memory 512m ...` then run a python memory-hog inside | `system_oom_kill` with `constraint=memcg`; `mem_total_kb`/`mem_free_kb` omitted (memcg dumps skip global Mem-Info); `top_tasks` names are single tokens | | Real Linux 6.x VM | `echo 1 > /proc/sys/kernel/sysrq; echo f > /proc/sysrq-trigger` | `system_oom_kill` visible in `/var/log/supervisord/kernel-images-api` | Full reproduction steps for each row live in [`server/lib/sysmon/README.md`](server/lib/sysmon/README.md). <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Introduces always-on system telemetry and a new supervisord eventlistener in browser images; failures are mostly best-effort (dropped POSTs, optional kmsg) but mis-parsing or shim protocol bugs could affect crash/OOM visibility. > > **Overview** > Adds **always-on VM failure telemetry** on the existing `EventStream` → SSE/S2 path: **`system_oom_kill`** (kernel OOM via `/dev/kmsg`) and **`service_crashed`** (unexpected supervisord exits / restart give-up). > > **In-process:** `lib/sysmon` tails kmsg (with `SeekEnd` on start), parses full OOM dumps (Mem-Info, Tasks state, constraint, trigger vs victim), and publishes `system_oom_kill`. Wired from `cmd/api/main`; a failed kmsg open is logged and the API keeps running. > > **Sidecar:** `cmd/supervisord-shim` is a supervisord eventlistener that maps `PROCESS_STATE_EXITED` (`expected=0`) and `PROCESS_STATE_FATAL` to `service_crashed` phases (`startup` / `running` / `gave_up`) and **POSTs** to `POST /telemetry/events` on localhost. It always ACKs supervisord (`RESULT 2\nOK` without a trailing newline). > > **Images & schema:** Headful/headless Dockerfiles build/install the shim and ship `supervisord-shim.conf` (larger event buffer, very high `startretries`). OpenAPI/oapi gain the new event types and payloads; `go-kmsg-parser` is a new dependency. Unit tests cover the kmsg state machine, shim protocol, and end-to-end publish. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 5cdbd63. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Sayan- <1415138+Sayan-@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 3ca0415 commit 0dc1a65

18 files changed

Lines changed: 2554 additions & 326 deletions

File tree

images/chromium-headful/Dockerfile

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,12 @@ RUN --mount=type=cache,target=/root/.cache/go-build,id=$CACHEIDPREFIX-go-build \
3333
GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH:-amd64} \
3434
go build -ldflags="-s -w" -o /out/wrapper ./cmd/wrapper
3535

36+
# Build supervisord eventlistener shim
37+
RUN --mount=type=cache,target=/root/.cache/go-build,id=$CACHEIDPREFIX-go-build \
38+
--mount=type=cache,target=/go/pkg/mod,id=$CACHEIDPREFIX-go-pkg-mod \
39+
GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH:-amd64} \
40+
go build -ldflags="-s -w" -o /out/kernel-images-supervisord-shim ./cmd/supervisord-shim
41+
3642
# webrtc client
3743
FROM node:22-bullseye-slim AS client
3844
WORKDIR /src
@@ -378,6 +384,7 @@ RUN chmod +x /usr/local/bin/init-envoy.sh
378384
# copy the kernel-images API binary built in the builder stage
379385
COPY --from=server-builder /out/kernel-images-api /usr/local/bin/kernel-images-api
380386
COPY --from=server-builder /out/chromium-launcher /usr/local/bin/chromium-launcher
387+
COPY --from=server-builder /out/kernel-images-supervisord-shim /usr/local/bin/kernel-images-supervisord-shim
381388
COPY --from=server-builder /out/wrapper /wrapper
382389

383390
# Copy and compile the Playwright daemon
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
[eventlistener:supervisord-shim]
2+
command=/usr/local/bin/kernel-images-supervisord-shim
3+
events=PROCESS_STATE_EXITED,PROCESS_STATE_FATAL
4+
; buffer_size defaults to 10 which overflows when several supervised
5+
; services flap in quick succession. Bump it so a burst of crashes
6+
; doesn't cause supervisord to drop events before the shim drains them.
7+
buffer_size=100
8+
autostart=true
9+
autorestart=true
10+
; Effectively infinite restart attempts. Supervisord does not emit
11+
; events about its own eventlisteners, so if the shim ever enters FATAL
12+
; state we lose all service_crashed telemetry with no signal. The shim
13+
; is tiny and side-effect-free; a transient failure should never reach
14+
; the default startretries=3 cap.
15+
startretries=999999
16+
stderr_logfile=/var/log/supervisord/supervisord-shim
17+
; stdout is the eventlistener protocol channel; do not redirect.

images/chromium-headless/image/Dockerfile

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,12 @@ RUN --mount=type=cache,target=/root/.cache/go-build,id=$CACHEIDPREFIX-go-build \
3434
GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH:-amd64} \
3535
go build -ldflags="-s -w" -o /out/wrapper ./cmd/wrapper
3636

37+
# Build supervisord eventlistener shim
38+
RUN --mount=type=cache,target=/root/.cache/go-build,id=$CACHEIDPREFIX-go-build \
39+
--mount=type=cache,target=/go/pkg/mod,id=$CACHEIDPREFIX-go-pkg-mod \
40+
GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH:-amd64} \
41+
go build -ldflags="-s -w" -o /out/kernel-images-supervisord-shim ./cmd/supervisord-shim
42+
3743
FROM docker.io/ubuntu:22.04 AS ffmpeg-downloader
3844

3945
# Allow cross-compilation when building with BuildKit platforms
@@ -256,6 +262,7 @@ RUN chmod +x /usr/local/bin/bake-certs.sh && /usr/local/bin/bake-certs.sh && rm
256262
# Copy the kernel-images API binary built in the builder stage
257263
COPY --from=server-builder /out/kernel-images-api /usr/local/bin/kernel-images-api
258264
COPY --from=server-builder /out/chromium-launcher /usr/local/bin/chromium-launcher
265+
COPY --from=server-builder /out/kernel-images-supervisord-shim /usr/local/bin/kernel-images-supervisord-shim
259266

260267
# Copy and compile the Playwright daemon
261268
COPY server/runtime/playwright-daemon.ts /tmp/playwright-daemon.ts
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
[eventlistener:supervisord-shim]
2+
command=/usr/local/bin/kernel-images-supervisord-shim
3+
events=PROCESS_STATE_EXITED,PROCESS_STATE_FATAL
4+
; buffer_size defaults to 10 which overflows when several supervised
5+
; services flap in quick succession. Bump it so a burst of crashes
6+
; doesn't cause supervisord to drop events before the shim drains them.
7+
buffer_size=100
8+
autostart=true
9+
autorestart=true
10+
; Effectively infinite restart attempts. Supervisord does not emit
11+
; events about its own eventlisteners, so if the shim ever enters FATAL
12+
; state we lose all service_crashed telemetry with no signal. The shim
13+
; is tiny and side-effect-free; a transient failure should never reach
14+
; the default startretries=3 cap.
15+
startretries=999999
16+
stderr_logfile=/var/log/supervisord/supervisord-shim
17+
; stdout is the eventlistener protocol channel; do not redirect.

server/cmd/api/main.go

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ import (
3030
oapi "github.com/kernel/kernel-images/server/lib/oapi"
3131
"github.com/kernel/kernel-images/server/lib/recorder"
3232
"github.com/kernel/kernel-images/server/lib/scaletozero"
33+
"github.com/kernel/kernel-images/server/lib/sysmon"
3334
"github.com/kernel/kernel-images/server/lib/telemetry"
3435
)
3536

@@ -103,6 +104,14 @@ func main() {
103104
}
104105
telemetrySession := telemetry.NewTelemetrySession(eventStream)
105106

107+
// VM-internal failure telemetry. OOM kills come from /dev/kmsg here;
108+
// service_crashed events arrive via POST /telemetry/events from the
109+
// supervisord-shim child process. Failure to open /dev/kmsg is not
110+
// fatal — the rest of the API should stay usable without CAP_SYSLOG.
111+
if err := sysmon.New(eventStream, slogger).Start(ctx); err != nil {
112+
slogger.Error("sysmon: kmsg OOM monitor disabled", "err", err)
113+
}
114+
106115
// Optional S2 storage sink.
107116
var s2Writer *events.S2StorageWriter
108117
if config.S2Basin != "" && config.S2AccessToken != "" && config.S2Stream != "" {
Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
// Command supervisord-shim is a tiny supervisord eventlistener that
2+
// translates PROCESS_STATE_EXITED (expected=0) and PROCESS_STATE_FATAL
3+
// events into BrowserServiceCrashedEvent payloads and POSTs them to the
4+
// local kernel-images-api telemetry endpoint.
5+
//
6+
// All schema-mapping and event publishing logic lives here; lib/sysmon
7+
// does not handle supervisord events. Keeping the shim as the sole owner
8+
// of the supervisord protocol means lib/sysmon stays single-purpose
9+
// (kmsg only).
10+
//
11+
// Wire protocol per supervisord docs (http://supervisord.org/events.html):
12+
//
13+
// stdout: "READY\n"
14+
// stdin: header line ("ver:3.0 ... eventname:PROCESS_STATE_EXITED len:54\n")
15+
// stdin: payload of `len` bytes (no trailing newline)
16+
// stdout: "RESULT 2\nOK" (always; ACK regardless of downstream success)
17+
//
18+
// The result frame intentionally has NO trailing newline: supervisord
19+
// reads exactly the declared number of bytes after the header newline,
20+
// and a trailing newline would leak into the buffer and corrupt the
21+
// subsequent READY token, deadlocking the listener after one event.
22+
//
23+
// We always ACK with OK so supervisord doesn't quarantine us when the
24+
// downstream HTTP target is briefly unavailable. The events are
25+
// best-effort; if the API is down, we drop and log.
26+
//
27+
// All logging goes to stderr — stdout is the supervisord protocol channel.
28+
package main
29+
30+
import (
31+
"bufio"
32+
"bytes"
33+
"context"
34+
"fmt"
35+
"io"
36+
"log"
37+
"net/http"
38+
"os"
39+
"os/signal"
40+
"strconv"
41+
"strings"
42+
"syscall"
43+
"time"
44+
45+
oapi "github.com/kernel/kernel-images/server/lib/oapi"
46+
)
47+
48+
const (
49+
defaultAPIBaseURL = "http://127.0.0.1:10001"
50+
httpTimeout = 2 * time.Second
51+
)
52+
53+
func main() {
54+
log.SetOutput(os.Stderr)
55+
log.SetFlags(log.LstdFlags | log.Lmicroseconds)
56+
57+
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
58+
defer stop()
59+
go func() {
60+
<-ctx.Done()
61+
_ = os.Stdin.Close()
62+
}()
63+
64+
baseURL := os.Getenv("KERNEL_IMAGES_API_BASE_URL")
65+
if baseURL == "" {
66+
baseURL = defaultAPIBaseURL
67+
}
68+
69+
client, err := oapi.NewClientWithResponses(baseURL, oapi.WithHTTPClient(&http.Client{Timeout: httpTimeout}))
70+
if err != nil {
71+
log.Fatalf("init oapi client: %v", err)
72+
}
73+
74+
in := bufio.NewReader(os.Stdin)
75+
out := bufio.NewWriter(os.Stdout)
76+
77+
for {
78+
if _, err := out.WriteString("READY\n"); err != nil {
79+
log.Fatalf("write READY: %v", err)
80+
}
81+
if err := out.Flush(); err != nil {
82+
log.Fatalf("flush READY: %v", err)
83+
}
84+
85+
header, payload, err := readEvent(in)
86+
if err != nil {
87+
if err == io.EOF {
88+
return
89+
}
90+
log.Fatalf("read event: %v", err)
91+
}
92+
93+
// Try to publish but always ACK supervisord.
94+
ev, ok := mapEvent(header, payload)
95+
switch {
96+
case ok:
97+
if perr := publish(ctx, client, ev); perr != nil {
98+
log.Printf("publish telemetry event: %v", perr)
99+
}
100+
case isCrashEvent(header["eventname"]):
101+
// We subscribed to this event type but couldn't map it.
102+
// Most likely cause: supervisord emitted a from_state we
103+
// don't have a public phase for. Logging means a future
104+
// supervisord behavior change shows up in stderr instead
105+
// of silent telemetry loss.
106+
log.Printf("skipped crash event: eventname=%q from_state=%q processname=%q expected=%q",
107+
header["eventname"], payload["from_state"], payload["processname"], payload["expected"])
108+
}
109+
110+
if err := writeResultOK(out); err != nil {
111+
log.Fatalf("write RESULT: %v", err)
112+
}
113+
}
114+
}
115+
116+
// writeResultOK ACKs a single event. See the file header for why the
117+
// frame body has no trailing newline.
118+
func writeResultOK(out *bufio.Writer) error {
119+
if _, err := out.WriteString("RESULT 2\nOK"); err != nil {
120+
return err
121+
}
122+
return out.Flush()
123+
}
124+
125+
// readEvent reads one supervisord event: a header line followed by a
126+
// payload of declared length.
127+
func readEvent(in *bufio.Reader) (map[string]string, map[string]string, error) {
128+
headerLine, err := in.ReadString('\n')
129+
if err != nil {
130+
return nil, nil, err
131+
}
132+
header := parseFields(strings.TrimRight(headerLine, "\n"))
133+
134+
lenStr, ok := header["len"]
135+
if !ok {
136+
return nil, nil, fmt.Errorf("missing len in header: %q", headerLine)
137+
}
138+
n, err := strconv.Atoi(lenStr)
139+
if err != nil {
140+
return nil, nil, fmt.Errorf("invalid len %q: %w", lenStr, err)
141+
}
142+
143+
buf := make([]byte, n)
144+
if _, err := io.ReadFull(in, buf); err != nil {
145+
return nil, nil, fmt.Errorf("read payload: %w", err)
146+
}
147+
payload := parseFields(string(buf))
148+
return header, payload, nil
149+
}
150+
151+
// parseFields parses supervisord's "key:value key:value" tokenization.
152+
// Values are split on the first colon; supervisord does not escape colons
153+
// in values, but in practice the values we care about (process names,
154+
// states, ints) never contain them.
155+
func parseFields(s string) map[string]string {
156+
out := make(map[string]string)
157+
for _, tok := range strings.Fields(s) {
158+
i := strings.IndexByte(tok, ':')
159+
if i < 0 {
160+
continue
161+
}
162+
out[tok[:i]] = tok[i+1:]
163+
}
164+
return out
165+
}
166+
167+
// phaseForExited maps the supervisord state a process exited from to the
168+
// public lifecycle phase. EXITED in supervisord always originates from
169+
// RUNNING (post-startsecs); STARTING-during-startsecs-violation routes
170+
// through BACKOFF→FATAL, not EXITED. We still defend against STARTING
171+
// here in case a future supervisord version changes the state machine,
172+
// and we treat anything else as "unknown" so the caller logs and skips
173+
// rather than inventing a phase.
174+
func phaseForExited(fromState string) (oapi.BrowserServiceCrashedEventDataPhase, bool) {
175+
switch fromState {
176+
case "RUNNING":
177+
return oapi.BrowserServiceCrashedEventDataPhaseRunning, true
178+
case "STARTING":
179+
return oapi.BrowserServiceCrashedEventDataPhaseStartup, true
180+
default:
181+
return "", false
182+
}
183+
}
184+
185+
// isCrashEvent reports whether the supervisord eventname is one we
186+
// subscribed to. Used by the main loop to log when a target event was
187+
// dropped instead of silently skipping it.
188+
func isCrashEvent(eventName string) bool {
189+
return eventName == "PROCESS_STATE_EXITED" || eventName == "PROCESS_STATE_FATAL"
190+
}
191+
192+
// mapEvent decides whether to publish and constructs the event payload.
193+
// Returns ok=false for events we deliberately skip (intentional stops,
194+
// non-crash event types, or unknown lifecycle transitions).
195+
func mapEvent(header, payload map[string]string) (oapi.PublishEventRequest, bool) {
196+
var phase oapi.BrowserServiceCrashedEventDataPhase
197+
switch header["eventname"] {
198+
case "PROCESS_STATE_EXITED":
199+
// expected=0 means the exit was not in `exitcodes` — i.e. a
200+
// crash. expected=1 means clean shutdown (operator-initiated
201+
// stop, or a configured exit code). Skip the latter.
202+
if payload["expected"] != "0" {
203+
return oapi.PublishEventRequest{}, false
204+
}
205+
p, ok := phaseForExited(payload["from_state"])
206+
if !ok {
207+
return oapi.PublishEventRequest{}, false
208+
}
209+
phase = p
210+
case "PROCESS_STATE_FATAL":
211+
// FATAL is reached exclusively by the BACKOFF→FATAL edge after
212+
// supervisord exhausts startretries. The from_state is always
213+
// BACKOFF here, and the semantic is "gave up trying to start".
214+
phase = oapi.BrowserServiceCrashedEventDataPhaseGaveUp
215+
default:
216+
return oapi.PublishEventRequest{}, false
217+
}
218+
219+
name := payload["processname"]
220+
if name == "" {
221+
return oapi.PublishEventRequest{}, false
222+
}
223+
224+
data := oapi.BrowserServiceCrashedEventData{
225+
ServiceName: name,
226+
Phase: phase,
227+
}
228+
if pidStr := payload["pid"]; pidStr != "" {
229+
if pid, err := strconv.Atoi(pidStr); err == nil {
230+
data.Pid = &pid
231+
}
232+
}
233+
234+
category := oapi.PublishEventRequestCategory(oapi.TelemetryEventCategorySystem)
235+
sourceEvent := "service.crashed"
236+
return oapi.PublishEventRequest{
237+
Type: string(oapi.ServiceCrashed),
238+
Category: &category,
239+
Source: &oapi.BrowserEventSource{
240+
Kind: oapi.LocalProcess,
241+
Event: &sourceEvent,
242+
},
243+
Data: data,
244+
}, true
245+
}
246+
247+
func publish(ctx context.Context, client *oapi.ClientWithResponses, body oapi.PublishEventRequest) error {
248+
resp, err := client.PublishTelemetryEventWithResponse(ctx, body)
249+
if err != nil {
250+
return err
251+
}
252+
if resp.StatusCode() >= 300 {
253+
return fmt.Errorf("status %d: %s", resp.StatusCode(), bytes.TrimSpace(resp.Body))
254+
}
255+
return nil
256+
}

0 commit comments

Comments
 (0)