feat(telemetry): bound scans with deadlines, ship log tail in heartbeat by ashishkurmi · Pull Request #122 · step-security/dev-machine-guard

ashishkurmi · 2026-05-28T21:39:58Z

Summary

Four coordinated changes that cap how long any single scan can run, surface where it's stuck, and prevent the unreaped-child subprocess hang that bricked customer deployments (Coveo Antigravity, RocketMortgage node_scan).

This is intended as a starting point for review — happy to slice into smaller PRs if you'd rather merge the executor fix first and iterate.

1. Subprocess process-group kill on context cancel

internal/executor/{executor,executor_unix,executor_windows}.go

exec.CommandContext only SIGKILLs the immediate child PID on timeout. Electron --version and npm/yarn/pnpm ls fork grandchildren that inherit stdout/stderr; with those FDs still held, cmd.Wait() blocks forever even though the deadline fired. Real.Run / RunWithTimeout / RunInDir now use Setpgid + cmd.Cancel that kills the whole process group, plus cmd.WaitDelay = 2s as belt-and-braces. Unix-only; Windows stub deferred (JobObjects + CREATE_BREAKAWAY_FROM_JOB is a larger change and Windows hosts are less exposed).

Why this matters: RocketMortgage device MKT7127XJF was averaging 3.6 minutes per node project with a code-level 30s timeout. The only way that math works is RunInDir's timeout returning but cmd.Wait() blocking on pipes held by an npm grandchild.

Regression test: internal/executor/executor_unix_test.go — fake binary that backgrounds a 60s sleep; without the fix the call hangs ~60s, with it returns in ~2s. Verified passing locally.

2. Global scan deadline

internal/telemetry/scan_deadline.go, telemetry.go

`telemetry.Run` now wraps its scan body in `context.WithTimeout` (60m default). Override via `STEPSEC_MAX_SCAN_DURATION="45m"` / `"2h"`, or `"0"` to disable.

The telemetry upload uses the outer `runCtx` (no deadline) so partial scan data still ships when the deadline trips mid-scan. Heartbeat and phase-post goroutines are also bound to `runCtx` so they outlive the deadline through the upload phase — the final `telemetry_upload` snapshot still reaches the backend.

3. Per-phase deadlines

internal/telemetry/phase_deadline.go, telemetry.go

New `startPhase` / `endPhase` helpers wrap each `tracker.Start` / `tracker.Finish` pair with a per-phase context. Budgets per phase (`phaseBudgets` map): `device_info` 30s, `ide_scan` 2m, `ai_tools_scan` 5m, `node_scan` 15m, `python_scan` 10m, `telemetry_upload` 10m, default 5m. On deadline, `endPhase` logs a warning and lets the next phase start — partial scan beats no scan.

Numbers were eyeballed off the heartbeat data, not from a rigorous percentile analysis — happy to tune up/down per phase. All 10 phase sites in `telemetry.Run` were refactored to use the helper. Diff at each site is mechanical: `ctx → phaseCtx` and the Start/Finish lines bracketed.

4. Gzipped log tail in heartbeat

internal/telemetry/logcapture.go, `log_tail_emitter.go`, `phase_tracker.go`, `telemetry.go`

`LogCapture` is now backed by a 1 MB ring buffer instead of an unbounded `bytes.Buffer` — the old design would OOM a multi-hour stuck run.

New `logTailEmitter` attaches `LogTailGzipBase64` to `RunStatusInfo` snapshots: last 256 KB of stderr, gzipped (~10× smaller on the wire), throttled to once every 2 minutes so phase-boundary bursts can't spam the backend. Heartbeats, phase-boundary posts, and the final upload all pick this up transparently via `postPhase`.

Backend follow-up required (separate repo, not in this PR): the `/telemetry/run-status` handler needs to accept the new `status_info.log_tail_gzip_b64` field and surface it on the device dashboard.

What's NOT in this PR

Lowering `maxNodeProjects = 1000` — kept for a separate decision; adding a demo gif #1 makes the per-project ceiling actually enforceable, which may resolve most of the pain on its own.
The `current_phase = "null"` cosmetic bug in heartbeat data — separate small fix.
Windows JobObject-based killgroup equivalent — separate, larger piece of work.
Backend changes to ingest `log_tail_gzip_b64` — separate repo.

Test plan

`go test ./...` passes on darwin/arm64
`go vet ./...` clean
Cross-builds clean for linux + windows
New regression test in `executor_unix_test.go` proves the kill-group fix works (2s instead of ~60s hang)
New tests in `log_tail_emitter_test.go` cover throttle window, ring-buffer wraparound, nil safety
Manual smoke: run `send-telemetry` end-to-end on a real macOS host, confirm log tail appears in backend status_info
Verify on a host with Antigravity installed that scan completes within budget instead of hanging (the original failure mode)

🤖 Generated with Claude Code

Four changes that together cap how long any single scan can run, surface where it's stuck, and prevent the unreaped-child subprocess hang that manifested on customer machines (Coveo Antigravity, RocketMortgage node). 1. Subprocess process-group kill on context cancel internal/executor/executor.go, executor_unix.go, executor_windows.go exec.CommandContext only SIGKILLs the immediate child PID on timeout. Electron --version and npm/yarn/pnpm ls fork grandchildren that inherit stdout/stderr; with those FDs still held, cmd.Wait() blocks forever even though the deadline fired. Real.Run/RunWithTimeout/ RunInDir now use Setpgid + cmd.Cancel that kills the whole group, plus cmd.WaitDelay=2s as belt-and-braces. Unix-specific; Windows stub deferred (JobObjects). Regression test: executor_unix_test.go. 2. Global scan deadline internal/telemetry/scan_deadline.go, telemetry.go telemetry.Run now wraps its scan-body context in WithTimeout (60m default; override via STEPSEC_MAX_SCAN_DURATION="45m"/"2h", or "0" to disable). Telemetry upload uses the outer runCtx so partial data ships even when the scan deadline trips mid-run. Heartbeat + phase-post goroutines bound to runCtx so they outlive the deadline through the upload. 3. Per-phase deadlines internal/telemetry/phase_deadline.go, telemetry.go New startPhase/endPhase helpers wrap each tracker.Start/Finish pair with a per-phase context (budgets per phase in phaseBudgets; node_scan 15m, python_scan 10m, etc; default 5m). On deadline, endPhase warns and lets the next phase start instead of failing the whole scan. All 10 phase sites refactored. 4. Gzipped log tail in heartbeat internal/telemetry/logcapture.go, log_tail_emitter.go, phase_tracker.go, telemetry.go LogCapture is now a 1 MB ring buffer (was unbounded — memory risk on multi-hour stuck runs). New logTailEmitter attaches a gzip+base64 tail of the last 256 KB of stderr to RunStatusInfo, throttled to once every 2 minutes so phase-boundary bursts can't spam the backend. Heartbeats and final upload pick it up transparently via postPhase. Test coverage: ring-buffer wraparound, throttle window, nil-safety, subprocess-kill-on-deadline. All packages: go test ./... clean on darwin/arm64, cross-builds clean for linux + windows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ashishkurmi closed this May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(telemetry): bound scans with deadlines, ship log tail in heartbeat#122

feat(telemetry): bound scans with deadlines, ship log tail in heartbeat#122
ashishkurmi wants to merge 1 commit into
mainfrom
akurmi/feat/scan-deadline-and-heartbeat-logs

ashishkurmi commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ashishkurmi commented May 28, 2026

Summary

1. Subprocess process-group kill on context cancel

2. Global scan deadline

3. Per-phase deadlines

4. Gzipped log tail in heartbeat

What's NOT in this PR

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant