Skip to content

Commit 2279cd0

Browse files
authored
CDP telemetry fixes (#251)
## Overview ### Bug When chromium restarted mid-session (`PATCH /chromium/flags`, `/configure`, etc., or a crash + supervisord autorestart), the CDP proxy emitted `cdp_disconnect.reason = "client_close"` instead of `"upstream_changed"`. Live repro confirmed: ~1.6s after the WS opened, reason was `client_close` even though chromium clearly went away. ### Root cause A goroutine watched `mgr.Subscribe()` for the new chromium URL and stamped `reasonOverride = upstream_changed` before triggering pump teardown. But the upstream's TCP EOF (chromium dying) reaches the pump *before* the new "DevTools listening on" log line is emitted (~5–8 s gap). So `cleanup` always ran first, saw `reasonOverride == nil`, and fell through to `client_close`. ### Fix - `wsproxy.Pump`'s `onClose` callback now receives a `PumpExitCause` (`client` / `upstream` / `context`). - Deleted the URL-watcher goroutine + `reasonOverride` in `devtoolsproxy`. Reason resolution moved into cleanup as a small helper, `resolveDisconnectReason`, which: - returns `client_close` / `context_cancelled` immediately from the cause, - on upstream cause, checks `mgr.Current()` and otherwise waits up to `restartConfirmWait` (10 s) on `urlCh` for a new URL — new URL ⇒ `upstream_changed`, timeout ⇒ `upstream_error`. ### Tests - `TestResolveDisconnectReason` — 6 table cases covering all four reasons (sub-100 ms). - `TestWebSocketProxyHandler_EmitsUpstreamChangedOnMidStreamRestart` — integration through the handler with `restartConfirmWait` shortened. - Existing dial-time `upstream_error` + client_close tests continue to pass. ### Verification Repro'd live against the headful image. Before fix: `reason=client_close, duration_ms=1620`. After fix: `reason=upstream_changed, duration_ms=10388`. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Changes CDP WebSocket teardown and telemetry classification on Chromium restarts; well-tested but affects automation clients that rely on disconnect reasons and session timing. > > **Overview** > Fixes **CDP disconnect telemetry** when Chromium restarts mid-session: `cdp_disconnect` was often emitted as `client_close` because upstream EOF arrived before the new DevTools URL was published. > > **`wsproxy.Pump`** now passes a **`PumpExitCause`** (`client` / `upstream` / `context`) into cleanup. **`devtoolsproxy`** drops the URL-watcher `reasonOverride` path; on upstream exit it **`resolveDisconnectReason`** polls `UpstreamManager.Current()` for up to **`restartConfirmWait`** (10s) to choose **`upstream_changed`** vs **`upstream_error`**. When the upstream URL changes while a session is still open, the watcher **closes the upstream WebSocket** so the pump exits with the right cause. **`cdp_disconnect`** uses a pinned **`disconnectedAt`** for **`duration_ms`** and event timestamp so metrics are not inflated by the restart poll. > > New unit/integration tests cover reason resolution, mid-stream restart, and proactive kick on URL change. The rest of the diff is mostly **import ordering**, **gofmt**, and comment/whitespace cleanup across API, events, policy, and e2e helpers. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit f43d049. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Sayan- <1415138+Sayan-@users.noreply.github.com>
1 parent 0dc1a65 commit 2279cd0

26 files changed

Lines changed: 433 additions & 140 deletions

File tree

server/cmd/api/api/api.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,14 @@ import (
1212

1313
"github.com/kernel/kernel-images/server/lib/cdpmonitor"
1414
"github.com/kernel/kernel-images/server/lib/devtoolsproxy"
15-
"github.com/kernel/kernel-images/server/lib/telemetry"
1615
"github.com/kernel/kernel-images/server/lib/events"
1716
"github.com/kernel/kernel-images/server/lib/logger"
1817
"github.com/kernel/kernel-images/server/lib/nekoclient"
1918
oapi "github.com/kernel/kernel-images/server/lib/oapi"
2019
"github.com/kernel/kernel-images/server/lib/policy"
2120
"github.com/kernel/kernel-images/server/lib/recorder"
2221
"github.com/kernel/kernel-images/server/lib/scaletozero"
22+
"github.com/kernel/kernel-images/server/lib/telemetry"
2323
)
2424

2525
type cdpMonitorController interface {
@@ -99,7 +99,7 @@ func New(
9999
stz scaletozero.PinnedController,
100100
nekoAuthClient *nekoclient.AuthClient,
101101
telemetrySession *telemetry.TelemetrySession,
102-
eventStream *events.EventStream,
102+
eventStream *events.EventStream,
103103
displayNum int,
104104
) (*ApiService, error) {
105105
switch {

server/cmd/api/api/api_test.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,12 @@ import (
1212
"log/slog"
1313

1414
"github.com/kernel/kernel-images/server/lib/devtoolsproxy"
15-
"github.com/kernel/kernel-images/server/lib/telemetry"
1615
"github.com/kernel/kernel-images/server/lib/events"
1716
"github.com/kernel/kernel-images/server/lib/nekoclient"
1817
oapi "github.com/kernel/kernel-images/server/lib/oapi"
1918
"github.com/kernel/kernel-images/server/lib/recorder"
2019
"github.com/kernel/kernel-images/server/lib/scaletozero"
20+
"github.com/kernel/kernel-images/server/lib/telemetry"
2121
"github.com/stretchr/testify/assert"
2222
"github.com/stretchr/testify/require"
2323
)

server/cmd/api/api/computer_test.go

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -262,7 +262,7 @@ func TestGaussianDelay_WelfordVelocityVariance(t *testing.T) {
262262
for i := range distances {
263263
t_norm := float64(i) / float64(steps)
264264
base := 5.0 + 15.0*math.Sin(t_norm*math.Pi) // 5-20px, peaked in middle
265-
distances[i] = base + rng.Float64()*3.0 // small random variation
265+
distances[i] = base + rng.Float64()*3.0 // small random variation
266266
}
267267

268268
// Gaussian delays → velocity variance
@@ -302,27 +302,27 @@ func TestClampPoints(t *testing.T) {
302302
expected [][2]int
303303
}{
304304
{
305-
name: "no clamping needed",
306-
points: [][2]int{{10, 20}, {50, 50}, {100, 80}},
307-
w: 200, h: 200,
305+
name: "no clamping needed",
306+
points: [][2]int{{10, 20}, {50, 50}, {100, 80}},
307+
w: 200, h: 200,
308308
expected: [][2]int{{10, 20}, {50, 50}, {100, 80}},
309309
},
310310
{
311-
name: "clamp negative x and y",
312-
points: [][2]int{{-10, -20}, {50, 50}},
313-
w: 200, h: 200,
311+
name: "clamp negative x and y",
312+
points: [][2]int{{-10, -20}, {50, 50}},
313+
w: 200, h: 200,
314314
expected: [][2]int{{0, 0}, {50, 50}},
315315
},
316316
{
317-
name: "clamp exceeding screen bounds",
318-
points: [][2]int{{50, 50}, {250, 300}},
319-
w: 200, h: 200,
317+
name: "clamp exceeding screen bounds",
318+
points: [][2]int{{50, 50}, {250, 300}},
319+
w: 200, h: 200,
320320
expected: [][2]int{{50, 50}, {199, 199}},
321321
},
322322
{
323-
name: "clamp both directions",
324-
points: [][2]int{{-5, 250}, {300, -10}, {100, 100}},
325-
w: 200, h: 200,
323+
name: "clamp both directions",
324+
points: [][2]int{{-5, 250}, {300, -10}, {100, 100}},
325+
w: 200, h: 200,
326326
expected: [][2]int{{0, 199}, {199, 0}, {100, 100}},
327327
},
328328
}

server/cmd/api/api/display.go

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,11 @@ import (
1212
"strings"
1313
"time"
1414

15-
nekooapi "github.com/m1k1o/neko/server/lib/oapi"
1615
"github.com/kernel/kernel-images/server/lib/cdpclient"
1716
"github.com/kernel/kernel-images/server/lib/logger"
1817
oapi "github.com/kernel/kernel-images/server/lib/oapi"
1918
"github.com/kernel/kernel-images/server/lib/recorder"
19+
nekooapi "github.com/m1k1o/neko/server/lib/oapi"
2020
)
2121

2222
// PatchDisplay updates the display configuration. When require_idle
@@ -655,4 +655,3 @@ func (s *ApiService) setResolutionViaNeko(ctx context.Context, width, height, re
655655
log.Info("successfully changed resolution via Neko API", "width", width, "height", height, "refresh_rate", refreshRate)
656656
return nil
657657
}
658-

server/cmd/api/api/fs.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,11 @@ import (
1616
"os/user"
1717

1818
"github.com/fsnotify/fsnotify"
19-
"github.com/nrednav/cuid2"
2019
"github.com/kernel/kernel-images/server/lib/logger"
2120
oapi "github.com/kernel/kernel-images/server/lib/oapi"
2221
"github.com/kernel/kernel-images/server/lib/ziputil"
2322
"github.com/kernel/kernel-images/server/lib/zstdutil"
23+
"github.com/nrednav/cuid2"
2424
)
2525

2626
// fsWatch represents an in-memory directory watch.

server/cmd/api/api/process.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,10 +23,10 @@ import (
2323
"github.com/coder/websocket"
2424
"github.com/creack/pty"
2525
"github.com/google/uuid"
26-
openapi_types "github.com/oapi-codegen/runtime/types"
2726
"github.com/kernel/kernel-images/server/lib/logger"
2827
oapi "github.com/kernel/kernel-images/server/lib/oapi"
2928
"github.com/kernel/kernel-images/server/lib/ptyio"
29+
openapi_types "github.com/oapi-codegen/runtime/types"
3030
)
3131

3232
type processHandle struct {

server/cmd/api/api/process_test.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,9 @@ import (
1313
"time"
1414

1515
"github.com/google/uuid"
16-
openapi_types "github.com/oapi-codegen/runtime/types"
1716
oapi "github.com/kernel/kernel-images/server/lib/oapi"
1817
"github.com/kernel/kernel-images/server/lib/scaletozero"
18+
openapi_types "github.com/oapi-codegen/runtime/types"
1919
"github.com/stretchr/testify/require"
2020
)
2121

server/cmd/chromium-launcher/main.go

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,6 @@ func main() {
135135
}
136136
}
137137

138-
139138
// execLookPath helps satisfy syscall.Exec's requirement to pass an absolute path.
140139
func execLookPath(file string) (string, error) {
141140
if strings.ContainsRune(file, os.PathSeparator) {

server/cmd/chromium-launcher/main_test.go

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,4 +34,3 @@ func TestExecLookPath(t *testing.T) {
3434
t.Fatalf("execLookPath PATH search failed: p=%q err=%v", p, err)
3535
}
3636
}
37-

server/cmd/shell/main.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@ import (
1717
"time"
1818

1919
"github.com/google/uuid"
20-
openapi_types "github.com/oapi-codegen/runtime/types"
2120
oapi "github.com/kernel/kernel-images/server/lib/oapi"
21+
openapi_types "github.com/oapi-codegen/runtime/types"
2222
"golang.org/x/term"
2323
)
2424

0 commit comments

Comments
 (0)