Skip to content

feat(telemetry): bound scans with deadlines, ship log tail in heartbeat#122

Closed
ashishkurmi wants to merge 1 commit into
mainfrom
akurmi/feat/scan-deadline-and-heartbeat-logs
Closed

feat(telemetry): bound scans with deadlines, ship log tail in heartbeat#122
ashishkurmi wants to merge 1 commit into
mainfrom
akurmi/feat/scan-deadline-and-heartbeat-logs

Conversation

@ashishkurmi
Copy link
Copy Markdown
Member

Summary

Four coordinated changes that cap how long any single scan can run, surface where it's stuck, and prevent the unreaped-child subprocess hang that bricked customer deployments (Coveo Antigravity, RocketMortgage node_scan).

This is intended as a starting point for review — happy to slice into smaller PRs if you'd rather merge the executor fix first and iterate.

1. Subprocess process-group kill on context cancel

internal/executor/{executor,executor_unix,executor_windows}.go

exec.CommandContext only SIGKILLs the immediate child PID on timeout. Electron --version and npm/yarn/pnpm ls fork grandchildren that inherit stdout/stderr; with those FDs still held, cmd.Wait() blocks forever even though the deadline fired. Real.Run / RunWithTimeout / RunInDir now use Setpgid + cmd.Cancel that kills the whole process group, plus cmd.WaitDelay = 2s as belt-and-braces. Unix-only; Windows stub deferred (JobObjects + CREATE_BREAKAWAY_FROM_JOB is a larger change and Windows hosts are less exposed).

Why this matters: RocketMortgage device MKT7127XJF was averaging 3.6 minutes per node project with a code-level 30s timeout. The only way that math works is RunInDir's timeout returning but cmd.Wait() blocking on pipes held by an npm grandchild.

Regression test: internal/executor/executor_unix_test.go — fake binary that backgrounds a 60s sleep; without the fix the call hangs ~60s, with it returns in ~2s. Verified passing locally.

2. Global scan deadline

internal/telemetry/scan_deadline.go, telemetry.go

`telemetry.Run` now wraps its scan body in `context.WithTimeout` (60m default). Override via `STEPSEC_MAX_SCAN_DURATION="45m"` / `"2h"`, or `"0"` to disable.

The telemetry upload uses the outer `runCtx` (no deadline) so partial scan data still ships when the deadline trips mid-scan. Heartbeat and phase-post goroutines are also bound to `runCtx` so they outlive the deadline through the upload phase — the final `telemetry_upload` snapshot still reaches the backend.

3. Per-phase deadlines

internal/telemetry/phase_deadline.go, telemetry.go

New `startPhase` / `endPhase` helpers wrap each `tracker.Start` / `tracker.Finish` pair with a per-phase context. Budgets per phase (`phaseBudgets` map): `device_info` 30s, `ide_scan` 2m, `ai_tools_scan` 5m, `node_scan` 15m, `python_scan` 10m, `telemetry_upload` 10m, default 5m. On deadline, `endPhase` logs a warning and lets the next phase start — partial scan beats no scan.

Numbers were eyeballed off the heartbeat data, not from a rigorous percentile analysis — happy to tune up/down per phase. All 10 phase sites in `telemetry.Run` were refactored to use the helper. Diff at each site is mechanical: `ctx → phaseCtx` and the Start/Finish lines bracketed.

4. Gzipped log tail in heartbeat

internal/telemetry/logcapture.go, `log_tail_emitter.go`, `phase_tracker.go`, `telemetry.go`

`LogCapture` is now backed by a 1 MB ring buffer instead of an unbounded `bytes.Buffer` — the old design would OOM a multi-hour stuck run.

New `logTailEmitter` attaches `LogTailGzipBase64` to `RunStatusInfo` snapshots: last 256 KB of stderr, gzipped (~10× smaller on the wire), throttled to once every 2 minutes so phase-boundary bursts can't spam the backend. Heartbeats, phase-boundary posts, and the final upload all pick this up transparently via `postPhase`.

Backend follow-up required (separate repo, not in this PR): the `/telemetry/run-status` handler needs to accept the new `status_info.log_tail_gzip_b64` field and surface it on the device dashboard.


What's NOT in this PR

  • Lowering `maxNodeProjects = 1000` — kept for a separate decision; adding a demo gif #1 makes the per-project ceiling actually enforceable, which may resolve most of the pain on its own.
  • The `current_phase = "null"` cosmetic bug in heartbeat data — separate small fix.
  • Windows JobObject-based killgroup equivalent — separate, larger piece of work.
  • Backend changes to ingest `log_tail_gzip_b64` — separate repo.

Test plan

  • `go test ./...` passes on darwin/arm64
  • `go vet ./...` clean
  • Cross-builds clean for linux + windows
  • New regression test in `executor_unix_test.go` proves the kill-group fix works (2s instead of ~60s hang)
  • New tests in `log_tail_emitter_test.go` cover throttle window, ring-buffer wraparound, nil safety
  • Manual smoke: run `send-telemetry` end-to-end on a real macOS host, confirm log tail appears in backend status_info
  • Verify on a host with Antigravity installed that scan completes within budget instead of hanging (the original failure mode)

🤖 Generated with Claude Code

Four changes that together cap how long any single scan can run, surface
where it's stuck, and prevent the unreaped-child subprocess hang that
manifested on customer machines (Coveo Antigravity, RocketMortgage node).

1. Subprocess process-group kill on context cancel
   internal/executor/executor.go, executor_unix.go, executor_windows.go
   exec.CommandContext only SIGKILLs the immediate child PID on timeout.
   Electron --version and npm/yarn/pnpm ls fork grandchildren that
   inherit stdout/stderr; with those FDs still held, cmd.Wait() blocks
   forever even though the deadline fired. Real.Run/RunWithTimeout/
   RunInDir now use Setpgid + cmd.Cancel that kills the whole group,
   plus cmd.WaitDelay=2s as belt-and-braces. Unix-specific; Windows
   stub deferred (JobObjects).
   Regression test: executor_unix_test.go.

2. Global scan deadline
   internal/telemetry/scan_deadline.go, telemetry.go
   telemetry.Run now wraps its scan-body context in WithTimeout
   (60m default; override via STEPSEC_MAX_SCAN_DURATION="45m"/"2h",
   or "0" to disable). Telemetry upload uses the outer runCtx so
   partial data ships even when the scan deadline trips mid-run.
   Heartbeat + phase-post goroutines bound to runCtx so they
   outlive the deadline through the upload.

3. Per-phase deadlines
   internal/telemetry/phase_deadline.go, telemetry.go
   New startPhase/endPhase helpers wrap each tracker.Start/Finish
   pair with a per-phase context (budgets per phase in phaseBudgets;
   node_scan 15m, python_scan 10m, etc; default 5m). On deadline,
   endPhase warns and lets the next phase start instead of failing
   the whole scan. All 10 phase sites refactored.

4. Gzipped log tail in heartbeat
   internal/telemetry/logcapture.go, log_tail_emitter.go,
   phase_tracker.go, telemetry.go
   LogCapture is now a 1 MB ring buffer (was unbounded — memory risk
   on multi-hour stuck runs). New logTailEmitter attaches a
   gzip+base64 tail of the last 256 KB of stderr to RunStatusInfo,
   throttled to once every 2 minutes so phase-boundary bursts can't
   spam the backend. Heartbeats and final upload pick it up
   transparently via postPhase.

Test coverage: ring-buffer wraparound, throttle window, nil-safety,
subprocess-kill-on-deadline. All packages: go test ./... clean on
darwin/arm64, cross-builds clean for linux + windows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant