fix(builder): add gRPC keepalive, stream timeout, and reconnection#1675
Open
angerman wants to merge 11 commits intoNixOS:masterfrom
Open
fix(builder): add gRPC keepalive, stream timeout, and reconnection#1675angerman wants to merge 11 commits intoNixOS:masterfrom
angerman wants to merge 11 commits intoNixOS:masterfrom
Conversation
e37e758 to
4efb4cd
Compare
The gRPC bidirectional stream between the builder and queue runner silently dies under certain conditions (network interruptions during long input-fetching phases, idle TCP timeouts). When this happens the builder process keeps running but stops receiving work, and the queue runner never reclaims the occupied job slots. Three fixes: 1. Enable HTTP/2 keep-alive on the tonic channel (10s interval, 20s timeout). This sends HTTP/2 PING frames at the transport layer, detecting dead TCP connections that application-level pings miss. 2. Add a 90-second receive timeout on the stream read loop. The queue runner sends pings every 30s, so 90s without any message is a reliable signal that the connection is dead. 3. Add a reconnection loop in main. On transient stream errors the builder now waits 5s and reconnects instead of exiting. Only fatal errors (API version mismatch) still cause an exit. GC roots are NOT cleared on transient reconnects to protect outputs being uploaded by background tasks. Fixes NixOS#1674
Two features that improve builder throughput and resilience: ## Per-phase timeouts Add --import-timeout (default 3600s) and --upload-timeout (default 1800s) CLI options. The import and upload phases are now wrapped in tokio::time::timeout(), failing with ImportFailure / UploadFailure (both retryable) when exceeded. Set to 0 to disable. Previously, these phases had no timeout and could hang indefinitely. ## Build slot pipelining After `nix build` succeeds, the builder immediately frees its build slot and signals the queue runner via a new SlotFreedMessage on the gRPC tunnel stream. The output upload continues in a background task. This allows the builder to accept new work while uploading, significantly improving utilisation for builds with large outputs. Protocol change: BuilderRequest gains a `slot_freed` oneof variant (field 3). The queue runner moves the job from `jobs` to `freed_jobs` on receipt, decrements the slot counter, and triggers dispatch. CompleteBuild/fail_step fall back to `freed_jobs` when the job is no longer in `jobs`. Safeguards: - GC roots survive into the background upload task via BuildPhaseResult - Upload concurrency bounded by a semaphore (2× max_jobs) - remove_machine drains freed_jobs and fails orphaned builds - Retry strategy bounded to 20 attempts (~10 min) - take_freed_job increments per-machine nr_steps_done for correct stats - jobs_in_last_30s uses saturating subtract to prevent counter underflow - Proto change is backwards-compatible (old builders work without SlotFreed)
…etch factor
Add a --prefetch-factor CLI option (default 2) that lets builders
accept more builds than max_jobs. Extra builds import their inputs
immediately but wait on a build semaphore before running `nix build`.
This keeps builders saturated: while N builds run, up to N more can
be importing inputs in parallel. When a build finishes, the next one
starts instantly with inputs already local.
Without prefetch: [import][build][import][build]...
With prefetch (2x): [import][build ][build ]...
[import ] ← overlapped
Protocol: JoinMessage gains a `prefetch_slots` field (proto field 21).
The queue runner adds this to max_jobs for capacity checks. Old
runners ignore the unknown field and dispatch up to max_jobs only.
Builder: A tokio::sync::Semaphore with max_jobs permits gates the
actual `nix build` phase. Builds report WaitingForLocalSlot status
while waiting for the semaphore, then Building once acquired.
Log import progress so operators can see what's happening during the "Sending inputs" phase instead of just a duration: - Total closure size at start (N sources + M derivations to fetch) - Progress per chunk (imported X/N sources, Y/M drvs) - Remaining requisites count after output-inclusive refetch - Slow substitution warnings (>5s per path) - Streaming batch start/finish messages
Store individual phase durations (import_time_ms, build_time_ms, upload_time_ms) in BuildSteps alongside the existing overhead column. This enables the Perl frontend to show a phase breakdown for completed builds: Import: 18m 22s | Build: 2m 15s | Upload: 1m 45s DB changes: - upgrade-86.sql: ALTER TABLE BuildSteps ADD COLUMN for 3 new fields - hydra.sql: updated CREATE TABLE + fixed stale busy column comments Proto changes: - StepUpdate gains progress_current/progress_total fields (proto3 default 0 — backward compatible with old builders) Queue runner: - finish_build_step writes the 3 timing columns from BuildTimings - overhead continues to be written for backward compatibility Builder: - Sends StepUpdate with progress_current/progress_total during import_requisites, after each chunk of paths is imported Perl frontend: - build.tt shows phase breakdown under the duration for completed steps that have the new timing data
substitute_paths now uses buffer_unordered(10) instead of a
sequential for loop. With 586 paths in a closure, this allows up
to 10 concurrent ensure_path calls to nix's daemon. Nix's flock
on store paths prevents redundant downloads — concurrent calls for
the same path serialize at the lock, not duplicate work.
Also fixes the silent error discard: substitution failures are now
logged at INFO level with counts ("234/586 succeeded, 352 failed")
and individual failures at DEBUG level, so operators can see whether
peernix/substituters are working.
…ild slot' This status (busy=35, WaitingForLocalSlot) now means the builder has finished importing inputs and is waiting for a build semaphore permit. The old label was misleading.
buffer_unordered(10) × 8 concurrent builds = 80+ spawn_blocking threads, approaching tokio's pool limit and starving the gRPC ping generator. Reduce to 3: still 3x faster than sequential, but safe (8×3=24 blocking threads).
`buildproducts.path` is a full filesystem path, not a bare store path: Hydra's `$out/nix-support/hydra-build-products` lets a product point at a sub-path of an output (e.g. `doc manual $doc/share/doc/nix/manual index.html`), so the column legitimately contains a trailing `/...` after the `<hash>-<name>`. `OwnedBuildProduct::parse_paths` fed the whole string to `StoreDir::parse`, which died with `invalid store path / symbol at 50` on any such value. That wedged every cached-build fast path touching a product with a subpath — notably the `nix-manual-2.31.4/share/doc/nix/manual` products from `roots.nix` builds, leaving ~200 builds per eval stuck as 'unfinished' even though every output was already in the store. Even had the parse succeeded, `BuildProduct::from_db` was hard-coding `relative_path: \"\".into()`, so the sibling constructors (`from_grpc`, `from_shared`) and `from_db` disagreed on what the field meant. Fix by: - dropping the `StorePath` generic + `parse_paths` from `OwnedBuildProduct` (path stays as `Option<String>` — it's opaque to the db crate), and - routing `BuildProduct::from_db` through the same `RelativeStorePath::from_path` splitter `from_grpc`/`from_shared` already use. The db-side write path calls `RelativeStorePath::print`, so `from_path` is now its exact inverse. Adds unit tests covering the subpath product, a bare store path, and the print/from_path round-trip.
Address CI failures and review feedback:
- Add #[allow(clippy::unwrap_used, clippy::expect_used)] to the test
module — the crate-level deny triggers on test helpers using unwrap().
- Replace collect::<anyhow::Result<_>>()? with filter_map + warn log
for build product parsing. A single corrupt DB row (e.g. legacy
product with a non-store path) should not abort the entire cached
build — log and skip instead, consistent with how malformed build
outputs are already skipped via `let Ok(mut res) ... else { continue }`
on the line above.
- Fix import ordering (rustfmt).
The existing tests covered RelativeStorePath::from_path (which was already working), but not the from_db path where the actual bug lived. Add tests that exercise from_db directly: - subpath product (/nix/store/hash-name/share/doc/nix/manual) - bare store path (/nix/store/hash-name) - None path (NULL in database)
65015b1 to
4c4d6fe
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The gRPC bidirectional stream between
hydra-builderand the queue runner silently dies under certain conditions. When this happens:sinceLastPinggrowing to 50,000+ seconds)Root cause
stream.next().await— the builder blocks forever waiting for a message that will never arriveChanges
Channel::builderpaths — detects dead TCP connections at the transport layermain.rs— transient stream errors now trigger a 5s backoff + reconnect instead of process exit. Only API version mismatch is fatal.Testing
Deployed to 10 darwin builders connected to a queue runner via IPv6. Previously these builders would go silent for 10-14 hours before manual intervention. With this patch, dead connections should be detected within ~90s and the builder should auto-reconnect within ~95s.
Fixes #1674