fix(builder): add gRPC keepalive, stream timeout, and reconnection by angerman · Pull Request #1675 · NixOS/hydra

angerman · 2026-04-19T10:44:00Z

Summary

The gRPC bidirectional stream between hydra-builder and the queue runner silently dies under certain conditions. When this happens:

The builder process keeps running but receives no new work
The queue runner shows the builder's job slots as occupied indefinitely (sinceLastPing growing to 50,000+ seconds)
The only fix is a manual builder restart

Root cause

No HTTP/2 keep-alive on the tonic channel — dead TCP connections go undetected
No timeout on stream.next().await — the builder blocks forever waiting for a message that will never arrive
No reconnection — on stream error the builder exits instead of reconnecting

Changes

HTTP/2 keep-alive (10s interval, 20s timeout, active while idle) on all Channel::builder paths — detects dead TCP connections at the transport layer
90-second receive timeout on the stream read loop — the queue runner sends pings every 30s, so 90s without any message reliably indicates a dead connection
Reconnection loop in main.rs — transient stream errors now trigger a 5s backoff + reconnect instead of process exit. Only API version mismatch is fatal.

Testing

Deployed to 10 darwin builders connected to a queue runner via IPv6. Previously these builders would go silent for 10-14 hours before manual intervention. With this patch, dead connections should be detected within ~90s and the builder should auto-reconnect within ~95s.

Fixes #1674

The gRPC bidirectional stream between the builder and queue runner silently dies under certain conditions (network interruptions during long input-fetching phases, idle TCP timeouts). When this happens the builder process keeps running but stops receiving work, and the queue runner never reclaims the occupied job slots. Three fixes: 1. Enable HTTP/2 keep-alive on the tonic channel (10s interval, 20s timeout). This sends HTTP/2 PING frames at the transport layer, detecting dead TCP connections that application-level pings miss. 2. Add a 90-second receive timeout on the stream read loop. The queue runner sends pings every 30s, so 90s without any message is a reliable signal that the connection is dead. 3. Add a reconnection loop in main. On transient stream errors the builder now waits 5s and reconnects instead of exiting. Only fatal errors (API version mismatch) still cause an exit. GC roots are NOT cleared on transient reconnects to protect outputs being uploaded by background tasks. Fixes NixOS#1674

Two features that improve builder throughput and resilience: ## Per-phase timeouts Add --import-timeout (default 3600s) and --upload-timeout (default 1800s) CLI options. The import and upload phases are now wrapped in tokio::time::timeout(), failing with ImportFailure / UploadFailure (both retryable) when exceeded. Set to 0 to disable. Previously, these phases had no timeout and could hang indefinitely. ## Build slot pipelining After `nix build` succeeds, the builder immediately frees its build slot and signals the queue runner via a new SlotFreedMessage on the gRPC tunnel stream. The output upload continues in a background task. This allows the builder to accept new work while uploading, significantly improving utilisation for builds with large outputs. Protocol change: BuilderRequest gains a `slot_freed` oneof variant (field 3). The queue runner moves the job from `jobs` to `freed_jobs` on receipt, decrements the slot counter, and triggers dispatch. CompleteBuild/fail_step fall back to `freed_jobs` when the job is no longer in `jobs`. Safeguards: - GC roots survive into the background upload task via BuildPhaseResult - Upload concurrency bounded by a semaphore (2× max_jobs) - remove_machine drains freed_jobs and fails orphaned builds - Retry strategy bounded to 20 attempts (~10 min) - take_freed_job increments per-machine nr_steps_done for correct stats - jobs_in_last_30s uses saturating subtract to prevent counter underflow - Proto change is backwards-compatible (old builders work without SlotFreed)

…etch factor Add a --prefetch-factor CLI option (default 2) that lets builders accept more builds than max_jobs. Extra builds import their inputs immediately but wait on a build semaphore before running `nix build`. This keeps builders saturated: while N builds run, up to N more can be importing inputs in parallel. When a build finishes, the next one starts instantly with inputs already local. Without prefetch: [import][build][import][build]... With prefetch (2x): [import][build ][build ]... [import ] ← overlapped Protocol: JoinMessage gains a `prefetch_slots` field (proto field 21). The queue runner adds this to max_jobs for capacity checks. Old runners ignore the unknown field and dispatch up to max_jobs only. Builder: A tokio::sync::Semaphore with max_jobs permits gates the actual `nix build` phase. Builds report WaitingForLocalSlot status while waiting for the semaphore, then Building once acquired.

Log import progress so operators can see what's happening during the "Sending inputs" phase instead of just a duration: - Total closure size at start (N sources + M derivations to fetch) - Progress per chunk (imported X/N sources, Y/M drvs) - Remaining requisites count after output-inclusive refetch - Slow substitution warnings (>5s per path) - Streaming batch start/finish messages

Store individual phase durations (import_time_ms, build_time_ms, upload_time_ms) in BuildSteps alongside the existing overhead column. This enables the Perl frontend to show a phase breakdown for completed builds: Import: 18m 22s | Build: 2m 15s | Upload: 1m 45s DB changes: - upgrade-86.sql: ALTER TABLE BuildSteps ADD COLUMN for 3 new fields - hydra.sql: updated CREATE TABLE + fixed stale busy column comments Proto changes: - StepUpdate gains progress_current/progress_total fields (proto3 default 0 — backward compatible with old builders) Queue runner: - finish_build_step writes the 3 timing columns from BuildTimings - overhead continues to be written for backward compatibility Builder: - Sends StepUpdate with progress_current/progress_total during import_requisites, after each chunk of paths is imported Perl frontend: - build.tt shows phase breakdown under the duration for completed steps that have the new timing data

substitute_paths now uses buffer_unordered(10) instead of a sequential for loop. With 586 paths in a closure, this allows up to 10 concurrent ensure_path calls to nix's daemon. Nix's flock on store paths prevents redundant downloads — concurrent calls for the same path serialize at the lock, not duplicate work. Also fixes the silent error discard: substitution failures are now logged at INFO level with counts ("234/586 succeeded, 352 failed") and individual failures at DEBUG level, so operators can see whether peernix/substituters are working.

…ild slot' This status (busy=35, WaitingForLocalSlot) now means the builder has finished importing inputs and is waiting for a build semaphore permit. The old label was misleading.

buffer_unordered(10) × 8 concurrent builds = 80+ spawn_blocking threads, approaching tokio's pool limit and starving the gRPC ping generator. Reduce to 3: still 3x faster than sequential, but safe (8×3=24 blocking threads).

`buildproducts.path` is a full filesystem path, not a bare store path: Hydra's `$out/nix-support/hydra-build-products` lets a product point at a sub-path of an output (e.g. `doc manual $doc/share/doc/nix/manual index.html`), so the column legitimately contains a trailing `/...` after the `<hash>-<name>`. `OwnedBuildProduct::parse_paths` fed the whole string to `StoreDir::parse`, which died with `invalid store path / symbol at 50` on any such value. That wedged every cached-build fast path touching a product with a subpath — notably the `nix-manual-2.31.4/share/doc/nix/manual` products from `roots.nix` builds, leaving ~200 builds per eval stuck as 'unfinished' even though every output was already in the store. Even had the parse succeeded, `BuildProduct::from_db` was hard-coding `relative_path: \"\".into()`, so the sibling constructors (`from_grpc`, `from_shared`) and `from_db` disagreed on what the field meant. Fix by: - dropping the `StorePath` generic + `parse_paths` from `OwnedBuildProduct` (path stays as `Option<String>` — it's opaque to the db crate), and - routing `BuildProduct::from_db` through the same `RelativeStorePath::from_path` splitter `from_grpc`/`from_shared` already use. The db-side write path calls `RelativeStorePath::print`, so `from_path` is now its exact inverse. Adds unit tests covering the subpath product, a bare store path, and the print/from_path round-trip.

Address CI failures and review feedback: - Add #[allow(clippy::unwrap_used, clippy::expect_used)] to the test module — the crate-level deny triggers on test helpers using unwrap(). - Replace collect::<anyhow::Result<_>>()? with filter_map + warn log for build product parsing. A single corrupt DB row (e.g. legacy product with a non-store path) should not abort the entire cached build — log and skip instead, consistent with how malformed build outputs are already skipped via `let Ok(mut res) ... else { continue }` on the line above. - Fix import ordering (rustfmt).

The existing tests covered RelativeStorePath::from_path (which was already working), but not the from_db path where the actual bug lived. Add tests that exercise from_db directly: - subpath product (/nix/store/hash-name/share/doc/nix/manual) - bare store path (/nix/store/hash-name) - None path (NULL in database)

angerman force-pushed the fix/builder-grpc-keepalive branch 2 times, most recently from e37e758 to 4efb4cd Compare April 20, 2026 17:45

angerman and others added 11 commits April 24, 2026 08:14

fix(frontend): rename 'Waiting to receive outputs' to 'Waiting for bu…

3d862a0

…ild slot' This status (busy=35, WaitingForLocalSlot) now means the builder has finished importing inputs and is waiting for a build semaphore permit. The old label was misleading.

angerman force-pushed the fix/builder-grpc-keepalive branch from 65015b1 to 4c4d6fe Compare April 24, 2026 06:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(builder): add gRPC keepalive, stream timeout, and reconnection#1675

fix(builder): add gRPC keepalive, stream timeout, and reconnection#1675
angerman wants to merge 11 commits intoNixOS:masterfrom
angerman:fix/builder-grpc-keepalive

angerman commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

angerman commented Apr 19, 2026

Summary

Root cause

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants