feat(telemetry): bound scans with deadlines, ship log tail in heartbeat#122
Closed
ashishkurmi wants to merge 1 commit into
Closed
feat(telemetry): bound scans with deadlines, ship log tail in heartbeat#122ashishkurmi wants to merge 1 commit into
ashishkurmi wants to merge 1 commit into
Conversation
Four changes that together cap how long any single scan can run, surface where it's stuck, and prevent the unreaped-child subprocess hang that manifested on customer machines (Coveo Antigravity, RocketMortgage node). 1. Subprocess process-group kill on context cancel internal/executor/executor.go, executor_unix.go, executor_windows.go exec.CommandContext only SIGKILLs the immediate child PID on timeout. Electron --version and npm/yarn/pnpm ls fork grandchildren that inherit stdout/stderr; with those FDs still held, cmd.Wait() blocks forever even though the deadline fired. Real.Run/RunWithTimeout/ RunInDir now use Setpgid + cmd.Cancel that kills the whole group, plus cmd.WaitDelay=2s as belt-and-braces. Unix-specific; Windows stub deferred (JobObjects). Regression test: executor_unix_test.go. 2. Global scan deadline internal/telemetry/scan_deadline.go, telemetry.go telemetry.Run now wraps its scan-body context in WithTimeout (60m default; override via STEPSEC_MAX_SCAN_DURATION="45m"/"2h", or "0" to disable). Telemetry upload uses the outer runCtx so partial data ships even when the scan deadline trips mid-run. Heartbeat + phase-post goroutines bound to runCtx so they outlive the deadline through the upload. 3. Per-phase deadlines internal/telemetry/phase_deadline.go, telemetry.go New startPhase/endPhase helpers wrap each tracker.Start/Finish pair with a per-phase context (budgets per phase in phaseBudgets; node_scan 15m, python_scan 10m, etc; default 5m). On deadline, endPhase warns and lets the next phase start instead of failing the whole scan. All 10 phase sites refactored. 4. Gzipped log tail in heartbeat internal/telemetry/logcapture.go, log_tail_emitter.go, phase_tracker.go, telemetry.go LogCapture is now a 1 MB ring buffer (was unbounded — memory risk on multi-hour stuck runs). New logTailEmitter attaches a gzip+base64 tail of the last 256 KB of stderr to RunStatusInfo, throttled to once every 2 minutes so phase-boundary bursts can't spam the backend. Heartbeats and final upload pick it up transparently via postPhase. Test coverage: ring-buffer wraparound, throttle window, nil-safety, subprocess-kill-on-deadline. All packages: go test ./... clean on darwin/arm64, cross-builds clean for linux + windows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four coordinated changes that cap how long any single scan can run, surface where it's stuck, and prevent the unreaped-child subprocess hang that bricked customer deployments (Coveo Antigravity, RocketMortgage node_scan).
This is intended as a starting point for review — happy to slice into smaller PRs if you'd rather merge the executor fix first and iterate.
1. Subprocess process-group kill on context cancel
internal/executor/{executor,executor_unix,executor_windows}.goexec.CommandContextonly SIGKILLs the immediate child PID on timeout. Electron--versionandnpm/yarn/pnpm lsfork grandchildren that inherit stdout/stderr; with those FDs still held,cmd.Wait()blocks forever even though the deadline fired.Real.Run/RunWithTimeout/RunInDirnow useSetpgid+cmd.Cancelthat kills the whole process group, pluscmd.WaitDelay = 2sas belt-and-braces. Unix-only; Windows stub deferred (JobObjects + CREATE_BREAKAWAY_FROM_JOB is a larger change and Windows hosts are less exposed).Why this matters: RocketMortgage device
MKT7127XJFwas averaging 3.6 minutes per node project with a code-level 30s timeout. The only way that math works isRunInDir's timeout returning butcmd.Wait()blocking on pipes held by an npm grandchild.Regression test:
internal/executor/executor_unix_test.go— fake binary that backgrounds a 60s sleep; without the fix the call hangs ~60s, with it returns in ~2s. Verified passing locally.2. Global scan deadline
internal/telemetry/scan_deadline.go,telemetry.go`telemetry.Run` now wraps its scan body in `context.WithTimeout` (60m default). Override via `STEPSEC_MAX_SCAN_DURATION="45m"` / `"2h"`, or `"0"` to disable.
The telemetry upload uses the outer `runCtx` (no deadline) so partial scan data still ships when the deadline trips mid-scan. Heartbeat and phase-post goroutines are also bound to `runCtx` so they outlive the deadline through the upload phase — the final `telemetry_upload` snapshot still reaches the backend.
3. Per-phase deadlines
internal/telemetry/phase_deadline.go,telemetry.goNew `startPhase` / `endPhase` helpers wrap each `tracker.Start` / `tracker.Finish` pair with a per-phase context. Budgets per phase (`phaseBudgets` map): `device_info` 30s, `ide_scan` 2m, `ai_tools_scan` 5m, `node_scan` 15m, `python_scan` 10m, `telemetry_upload` 10m, default 5m. On deadline, `endPhase` logs a warning and lets the next phase start — partial scan beats no scan.
Numbers were eyeballed off the heartbeat data, not from a rigorous percentile analysis — happy to tune up/down per phase. All 10 phase sites in `telemetry.Run` were refactored to use the helper. Diff at each site is mechanical: `ctx → phaseCtx` and the Start/Finish lines bracketed.
4. Gzipped log tail in heartbeat
internal/telemetry/logcapture.go, `log_tail_emitter.go`, `phase_tracker.go`, `telemetry.go``LogCapture` is now backed by a 1 MB ring buffer instead of an unbounded `bytes.Buffer` — the old design would OOM a multi-hour stuck run.
New `logTailEmitter` attaches `LogTailGzipBase64` to `RunStatusInfo` snapshots: last 256 KB of stderr, gzipped (~10× smaller on the wire), throttled to once every 2 minutes so phase-boundary bursts can't spam the backend. Heartbeats, phase-boundary posts, and the final upload all pick this up transparently via `postPhase`.
Backend follow-up required (separate repo, not in this PR): the `/telemetry/run-status` handler needs to accept the new `status_info.log_tail_gzip_b64` field and surface it on the device dashboard.
What's NOT in this PR
Test plan
🤖 Generated with Claude Code