Skip to content

fix(builder): add gRPC keepalive, stream timeout, and reconnection#1675

Open
angerman wants to merge 11 commits intoNixOS:masterfrom
angerman:fix/builder-grpc-keepalive
Open

fix(builder): add gRPC keepalive, stream timeout, and reconnection#1675
angerman wants to merge 11 commits intoNixOS:masterfrom
angerman:fix/builder-grpc-keepalive

Conversation

@angerman
Copy link
Copy Markdown
Contributor

Summary

The gRPC bidirectional stream between hydra-builder and the queue runner silently dies under certain conditions. When this happens:

  • The builder process keeps running but receives no new work
  • The queue runner shows the builder's job slots as occupied indefinitely (sinceLastPing growing to 50,000+ seconds)
  • The only fix is a manual builder restart

Root cause

  1. No HTTP/2 keep-alive on the tonic channel — dead TCP connections go undetected
  2. No timeout on stream.next().await — the builder blocks forever waiting for a message that will never arrive
  3. No reconnection — on stream error the builder exits instead of reconnecting

Changes

  1. HTTP/2 keep-alive (10s interval, 20s timeout, active while idle) on all Channel::builder paths — detects dead TCP connections at the transport layer
  2. 90-second receive timeout on the stream read loop — the queue runner sends pings every 30s, so 90s without any message reliably indicates a dead connection
  3. Reconnection loop in main.rs — transient stream errors now trigger a 5s backoff + reconnect instead of process exit. Only API version mismatch is fatal.

Testing

Deployed to 10 darwin builders connected to a queue runner via IPv6. Previously these builders would go silent for 10-14 hours before manual intervention. With this patch, dead connections should be detected within ~90s and the builder should auto-reconnect within ~95s.

Fixes #1674

@angerman angerman force-pushed the fix/builder-grpc-keepalive branch 2 times, most recently from e37e758 to 4efb4cd Compare April 20, 2026 17:45
angerman and others added 11 commits April 24, 2026 08:14
The gRPC bidirectional stream between the builder and queue runner
silently dies under certain conditions (network interruptions during
long input-fetching phases, idle TCP timeouts).  When this happens
the builder process keeps running but stops receiving work, and the
queue runner never reclaims the occupied job slots.

Three fixes:

1. Enable HTTP/2 keep-alive on the tonic channel (10s interval, 20s
   timeout).  This sends HTTP/2 PING frames at the transport layer,
   detecting dead TCP connections that application-level pings miss.

2. Add a 90-second receive timeout on the stream read loop.  The
   queue runner sends pings every 30s, so 90s without any message
   is a reliable signal that the connection is dead.

3. Add a reconnection loop in main.  On transient stream errors the
   builder now waits 5s and reconnects instead of exiting.  Only
   fatal errors (API version mismatch) still cause an exit.
   GC roots are NOT cleared on transient reconnects to protect
   outputs being uploaded by background tasks.

Fixes NixOS#1674
Two features that improve builder throughput and resilience:

## Per-phase timeouts

Add --import-timeout (default 3600s) and --upload-timeout (default
1800s) CLI options.  The import and upload phases are now wrapped in
tokio::time::timeout(), failing with ImportFailure / UploadFailure
(both retryable) when exceeded.  Set to 0 to disable.  Previously,
these phases had no timeout and could hang indefinitely.

## Build slot pipelining

After `nix build` succeeds, the builder immediately frees its build
slot and signals the queue runner via a new SlotFreedMessage on the
gRPC tunnel stream.  The output upload continues in a background
task.  This allows the builder to accept new work while uploading,
significantly improving utilisation for builds with large outputs.

Protocol change: BuilderRequest gains a `slot_freed` oneof variant
(field 3).  The queue runner moves the job from `jobs` to
`freed_jobs` on receipt, decrements the slot counter, and triggers
dispatch.  CompleteBuild/fail_step fall back to `freed_jobs` when
the job is no longer in `jobs`.

Safeguards:
- GC roots survive into the background upload task via BuildPhaseResult
- Upload concurrency bounded by a semaphore (2× max_jobs)
- remove_machine drains freed_jobs and fails orphaned builds
- Retry strategy bounded to 20 attempts (~10 min)
- take_freed_job increments per-machine nr_steps_done for correct stats
- jobs_in_last_30s uses saturating subtract to prevent counter underflow
- Proto change is backwards-compatible (old builders work without SlotFreed)
…etch factor

Add a --prefetch-factor CLI option (default 2) that lets builders
accept more builds than max_jobs.  Extra builds import their inputs
immediately but wait on a build semaphore before running `nix build`.
This keeps builders saturated: while N builds run, up to N more can
be importing inputs in parallel.  When a build finishes, the next one
starts instantly with inputs already local.

Without prefetch:  [import][build][import][build]...
With prefetch (2x): [import][build      ][build      ]...
                            [import     ] ← overlapped

Protocol: JoinMessage gains a `prefetch_slots` field (proto field 21).
The queue runner adds this to max_jobs for capacity checks.  Old
runners ignore the unknown field and dispatch up to max_jobs only.

Builder: A tokio::sync::Semaphore with max_jobs permits gates the
actual `nix build` phase.  Builds report WaitingForLocalSlot status
while waiting for the semaphore, then Building once acquired.
Log import progress so operators can see what's happening during the
"Sending inputs" phase instead of just a duration:

- Total closure size at start (N sources + M derivations to fetch)
- Progress per chunk (imported X/N sources, Y/M drvs)
- Remaining requisites count after output-inclusive refetch
- Slow substitution warnings (>5s per path)
- Streaming batch start/finish messages
Store individual phase durations (import_time_ms, build_time_ms,
upload_time_ms) in BuildSteps alongside the existing overhead column.
This enables the Perl frontend to show a phase breakdown for
completed builds:

  Import: 18m 22s | Build: 2m 15s | Upload: 1m 45s

DB changes:
- upgrade-86.sql: ALTER TABLE BuildSteps ADD COLUMN for 3 new fields
- hydra.sql: updated CREATE TABLE + fixed stale busy column comments

Proto changes:
- StepUpdate gains progress_current/progress_total fields (proto3
  default 0 — backward compatible with old builders)

Queue runner:
- finish_build_step writes the 3 timing columns from BuildTimings
- overhead continues to be written for backward compatibility

Builder:
- Sends StepUpdate with progress_current/progress_total during
  import_requisites, after each chunk of paths is imported

Perl frontend:
- build.tt shows phase breakdown under the duration for completed
  steps that have the new timing data
substitute_paths now uses buffer_unordered(10) instead of a
sequential for loop.  With 586 paths in a closure, this allows up
to 10 concurrent ensure_path calls to nix's daemon.  Nix's flock
on store paths prevents redundant downloads — concurrent calls for
the same path serialize at the lock, not duplicate work.

Also fixes the silent error discard: substitution failures are now
logged at INFO level with counts ("234/586 succeeded, 352 failed")
and individual failures at DEBUG level, so operators can see whether
peernix/substituters are working.
…ild slot'

This status (busy=35, WaitingForLocalSlot) now means the builder
has finished importing inputs and is waiting for a build semaphore
permit.  The old label was misleading.
buffer_unordered(10) × 8 concurrent builds = 80+ spawn_blocking
threads, approaching tokio's pool limit and starving the gRPC ping
generator.  Reduce to 3: still 3x faster than sequential, but safe
(8×3=24 blocking threads).
`buildproducts.path` is a full filesystem path, not a bare store path:
Hydra's `$out/nix-support/hydra-build-products` lets a product point at
a sub-path of an output (e.g. `doc manual $doc/share/doc/nix/manual
index.html`), so the column legitimately contains a trailing `/...`
after the `<hash>-<name>`.

`OwnedBuildProduct::parse_paths` fed the whole string to
`StoreDir::parse`, which died with `invalid store path / symbol at 50`
on any such value. That wedged every cached-build fast path touching a
product with a subpath — notably the `nix-manual-2.31.4/share/doc/nix/manual`
products from `roots.nix` builds, leaving ~200 builds per eval stuck
as 'unfinished' even though every output was already in the store.

Even had the parse succeeded, `BuildProduct::from_db` was hard-coding
`relative_path: \"\".into()`, so the sibling constructors
(`from_grpc`, `from_shared`) and `from_db` disagreed on what the field
meant.

Fix by:
- dropping the `StorePath` generic + `parse_paths` from
  `OwnedBuildProduct` (path stays as `Option<String>` — it's opaque
  to the db crate), and
- routing `BuildProduct::from_db` through the same
  `RelativeStorePath::from_path` splitter `from_grpc`/`from_shared`
  already use. The db-side write path calls `RelativeStorePath::print`,
  so `from_path` is now its exact inverse.

Adds unit tests covering the subpath product, a bare store path, and
the print/from_path round-trip.
Address CI failures and review feedback:

- Add #[allow(clippy::unwrap_used, clippy::expect_used)] to the test
  module — the crate-level deny triggers on test helpers using unwrap().

- Replace collect::<anyhow::Result<_>>()? with filter_map + warn log
  for build product parsing.  A single corrupt DB row (e.g. legacy
  product with a non-store path) should not abort the entire cached
  build — log and skip instead, consistent with how malformed build
  outputs are already skipped via `let Ok(mut res) ... else { continue }`
  on the line above.

- Fix import ordering (rustfmt).
The existing tests covered RelativeStorePath::from_path (which was
already working), but not the from_db path where the actual bug lived.

Add tests that exercise from_db directly:
- subpath product (/nix/store/hash-name/share/doc/nix/manual)
- bare store path (/nix/store/hash-name)
- None path (NULL in database)
@angerman angerman force-pushed the fix/builder-grpc-keepalive branch from 65015b1 to 4c4d6fe Compare April 24, 2026 06:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gRPC builder stream silently dies, builder never reconnects

2 participants