Skip to content

[v1.2.0-rc1 Feature] introduce safe warm workers#2050

Closed
MarcusSorealheis wants to merge 23 commits into
TraceMachina:mainfrom
MarcusSorealheis:remote-persistent-safe
Closed

[v1.2.0-rc1 Feature] introduce safe warm workers#2050
MarcusSorealheis wants to merge 23 commits into
TraceMachina:mainfrom
MarcusSorealheis:remote-persistent-safe

Conversation

@MarcusSorealheis

@MarcusSorealheis MarcusSorealheis commented Nov 15, 2025

Copy link
Copy Markdown
Member

Description

Implements Copy-on-Write (COW) isolation for warm worker pools to prevent cross-tenant state contamination in multi-tenant RBE deployments.

Fixes #2049

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How Has This Been Tested?

Locally. Needs to be tested by the community.

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

@MarcusSorealheis

Copy link
Copy Markdown
Member Author

@palfrey if you can ascertain why these tests have been so flakey (all musl-related), that would be greatly appreciated.

Create OCI image / Publish image (pull_request)
Create OCI image / Publish image (pull_request)Cancelled after 30m
Create OCI image / Publish nativelink-worker-init (pull_request)
Create OCI image / Publish nativelink-worker-init (pull_request)Cancelled after 30m
Nix / rbe-toolchain (pull_request)
Nix / rbe-toolchain (pull_request)Cancelled after 45m

@MarcusSorealheis MarcusSorealheis changed the title Introduce safe warm workers [v1.0.0 Feature] introduce safe warm workers Nov 21, 2025
@MarcusSorealheis MarcusSorealheis marked this pull request as draft November 21, 2025 17:59
@MarcusSorealheis

Copy link
Copy Markdown
Member Author

/build-image

1 similar comment
@MarcusSorealheis

Copy link
Copy Markdown
Member Author

/build-image

@github-actions

Copy link
Copy Markdown

Image built and pushed!

ghcr.io/TraceMachina/nativelink:3866934

1 similar comment
@github-actions

Copy link
Copy Markdown

Image built and pushed!

ghcr.io/TraceMachina/nativelink:3866934

MarcusSorealheis and others added 11 commits May 3, 2026 05:58
* origin/main: (44 commits)
  Release NativeLink v1.0.0-rc2 (TraceMachina#2170)
  Add boolean and optional data size shellexpands (TraceMachina#2172)
  Log NotFound as info, not error (TraceMachina#2171)
  Add Max Concurrent Writes (TraceMachina#2156)
  Fix integer overflow in compression_store.rs data retrieval logic (TraceMachina#2151)
  Add logs for stall detection (TraceMachina#2155)
  Dummy streams should be pending, not empty (TraceMachina#2154)
  Add Max action executing timeouts to scheduler (TraceMachina#2153)
  fix metrics (TraceMachina#2097)
  Add GRPC timeouts and other improvements to detect dead connections (TraceMachina#2152)
  Allows setting environment variables from the environment (TraceMachina#2143)
  Add Max Upload timeout to CAS (TraceMachina#2150)
  Advise the kernel to drop page cache (TraceMachina#2149)
  Add tracing to hyper-util (TraceMachina#2132)
  fix(deps): update rust crate toml to v1 (TraceMachina#2147)
  fix(deps): update module github.com/go-git/go-git/v5 to v5.16.5 [security] (TraceMachina#2138)
  Fix Max Inflight Workers job acceptance (TraceMachina#2142)
  Replace Fred with redis-rs (TraceMachina#2076)
  No workers logging (TraceMachina#2137)
  Make update_with_whole_file logging default to trace (TraceMachina#2131)
  ...

# Conflicts:
#	Cargo.lock
Demonstrates the before/after difference of the COW isolation contract
in nativelink-crio-worker-pool/src/isolation.rs:

- "before" path: a single warm worker shared across tenants leaks
  tenant A's on-disk state into tenant B's job (the bug this PR fixes).
- "after" path: each job gets a per-job overlay; reads fall through to
  the shared warmed template, writes are scoped to the job, cleanup
  discards the upper layer and leaves the template intact.

Self-contained pure-Java test (no junit, no CRI-O, no root). Wired into
docker/java/Dockerfile as a build-time gate, runnable locally via
scripts/test_isolation.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ULE.bazel.lock

- nativelink-config/Cargo.toml: collapse the two [features] tables
  (warm-worker-pools from this branch + dev-schema from main) into one.
- nativelink-{util,service,worker}/BUILD.bazel: align @crates//:hyper
  references to 1.7.0, matching what Cargo.lock actually resolves and
  what main is on. The 1.8.1 references were a stale PR-side bump.
- MODULE.bazel.lock: refresh via `bazel mod deps --lockfile_mode=update`
  so the rules_rust crate_universe extension hash matches the merged
  Cargo.lock - this was the cause of the redis_store_tester CI failure.

bazel test //... is green locally (87/87).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Repo's .rustfmt.toml sets imports_granularity = "Module" (nightly),
which forbids the grouped `use runtime::{a::A, b::B, *}` form. Splits
into one `use` per module path, matching what `cargo +nightly fmt`
produces. This is the fix for the pre-commit-run failure on the PR.

Also picks up the warm-worker-pools.mdx doc improvements: a copy-paste
"Quick start" with isolation enabled by default, a Verify section that
points at scripts/test_isolation.sh, and tighter CRI-O-vs-Docker
language in Prerequisites and Worker Images. Adds "containerd" to the
Vale vocabulary and reflows two "repo" -> "repository" usages so the
Vale step stays green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Swift hits the same cold-start tax as the JVM - the Swift compiler has
a heavy front-end and resolves Foundation/Dispatch out of its module
cache on every invocation. The warm-pool model captures that cache in
the COW template's lower layer, so every cloned job inherits a hot
module cache without re-resolving stdlib modules.

This change:
- nativelink-config + nativelink-crio-worker-pool: add `Swift` as a
  first-class variant of the Language enum (next to Jvm/NodeJs/Custom).
- docker/swift/: new image based on swift:6.0-jammy, pre-compiles
  SwiftWarmup.swift at template-creation time and primes the module
  cache under /opt/warmup/module-cache so the lower layer carries it.
- docker/swift/warmup/: SwiftWarmup.swift exercises Foundation
  (Codable round-trip), math, and collections; swift-warmup.sh runs
  the binary then -typecheck to warm the parser; prime-swift-cache.sh
  pre-resolves Foundation/Dispatch/FoundationNetworking imports.
- examples/swift-pool.json5: drop-in pool config with COW isolation
  enabled by default.
- docs: extend "Use warm pools for" with Swift, add Swift row to the
  language-mappings table, and add an "Other languages" subsection
  under Quick start that links to the example configs.

bazel test //... still 87/87.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the Swift work for the Node.js / TypeScript pool:

- docker/typescript/warmup/WarmWorkerIsolationTest.ts: TypeScript port
  of the Java WarmWorkerIsolationTest. Pure-Node model of OverlayFsMount
  from src/isolation.rs - 7 assertions covering both the leak ("before")
  and the COW isolation ("after"). No CRI-O, no root, no junit.
- scripts/test_isolation_node.sh: local runner that npm-installs
  typescript@5.3 + @types/node, compiles, and runs the test.
- docker/typescript/Dockerfile: add @types/node and a build-time gate
  that compiles + runs the isolation test, so a broken COW contract
  fails the image build (same pattern as docker/java/Dockerfile).
- examples/typescript-pool.json5: standalone TypeScript pool example
  with COW isolation enabled by default (parallel to swift-pool.json5).
  Docs language-mappings table now points TS at the standalone file
  instead of the combined java-typescript example.
- scripts/README.md + warm-worker-pools.mdx: surface the second test
  runner so users can pick whichever language their team prefers.

Both isolation tests pass locally (7/7 each).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…BUILD

Matches the suppression pattern already used in nativelink-proto/BUILD.bazel
for the same @@toolchains_protoc++ label references.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TypeScript: tsc --types flag doesn't find globally-installed @types/node;
use --typeRoots pointing to npm's global prefix instead.

Swift: @main attribute is incompatible with top-level code mode (single-file
compilation without -parse-as-library). Replace with an explicit top-level
SwiftWarmup.main() call, which is idiomatic for single-file Swift binaries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When attic.uc1.scdev.nativelink.net drops a connection mid-transfer (HTTP
200 + curl error), Nix fails the entire build instead of building from
source. Setting fallback = true in nix.conf tells Nix to build the
derivation locally whenever all substituters fail, which recovers these
transient cache-miss failures without any code change.

Applies to pre-commit-checks, lre, and all other workflows sharing the
prepare-nix composite action.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@MarcusSorealheis

Copy link
Copy Markdown
Member Author

/build-image

@github-actions

github-actions Bot commented May 3, 2026

Copy link
Copy Markdown

Image built and pushed!

ghcr.io/TraceMachina/nativelink:20aad88

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remote Persistent Workers Present Security Risk

1 participant