[v1.2.0-rc1 Feature] introduce safe warm workers#2050
Closed
MarcusSorealheis wants to merge 23 commits into
Closed
[v1.2.0-rc1 Feature] introduce safe warm workers#2050MarcusSorealheis wants to merge 23 commits into
MarcusSorealheis wants to merge 23 commits into
Conversation
98a4b68 to
b3e4746
Compare
a379a41 to
bb0ee8e
Compare
bb0ee8e to
43e6d51
Compare
43e6d51 to
728670e
Compare
728670e to
fba2c05
Compare
fba2c05 to
8281391
Compare
8281391 to
94924a0
Compare
27e3274 to
77debd7
Compare
77debd7 to
2873846
Compare
b1478a7 to
7003259
Compare
Member
Author
|
@palfrey if you can ascertain why these tests have been so flakey (all musl-related), that would be greatly appreciated.
|
Member
Author
|
/build-image |
1 similar comment
Member
Author
|
/build-image |
|
Image built and pushed! |
1 similar comment
|
Image built and pushed! |
* origin/main: (44 commits) Release NativeLink v1.0.0-rc2 (TraceMachina#2170) Add boolean and optional data size shellexpands (TraceMachina#2172) Log NotFound as info, not error (TraceMachina#2171) Add Max Concurrent Writes (TraceMachina#2156) Fix integer overflow in compression_store.rs data retrieval logic (TraceMachina#2151) Add logs for stall detection (TraceMachina#2155) Dummy streams should be pending, not empty (TraceMachina#2154) Add Max action executing timeouts to scheduler (TraceMachina#2153) fix metrics (TraceMachina#2097) Add GRPC timeouts and other improvements to detect dead connections (TraceMachina#2152) Allows setting environment variables from the environment (TraceMachina#2143) Add Max Upload timeout to CAS (TraceMachina#2150) Advise the kernel to drop page cache (TraceMachina#2149) Add tracing to hyper-util (TraceMachina#2132) fix(deps): update rust crate toml to v1 (TraceMachina#2147) fix(deps): update module github.com/go-git/go-git/v5 to v5.16.5 [security] (TraceMachina#2138) Fix Max Inflight Workers job acceptance (TraceMachina#2142) Replace Fred with redis-rs (TraceMachina#2076) No workers logging (TraceMachina#2137) Make update_with_whole_file logging default to trace (TraceMachina#2131) ... # Conflicts: # Cargo.lock
Demonstrates the before/after difference of the COW isolation contract in nativelink-crio-worker-pool/src/isolation.rs: - "before" path: a single warm worker shared across tenants leaks tenant A's on-disk state into tenant B's job (the bug this PR fixes). - "after" path: each job gets a per-job overlay; reads fall through to the shared warmed template, writes are scoped to the job, cleanup discards the upper layer and leaves the template intact. Self-contained pure-Java test (no junit, no CRI-O, no root). Wired into docker/java/Dockerfile as a build-time gate, runnable locally via scripts/test_isolation.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ULE.bazel.lock
- nativelink-config/Cargo.toml: collapse the two [features] tables
(warm-worker-pools from this branch + dev-schema from main) into one.
- nativelink-{util,service,worker}/BUILD.bazel: align @crates//:hyper
references to 1.7.0, matching what Cargo.lock actually resolves and
what main is on. The 1.8.1 references were a stale PR-side bump.
- MODULE.bazel.lock: refresh via `bazel mod deps --lockfile_mode=update`
so the rules_rust crate_universe extension hash matches the merged
Cargo.lock - this was the cause of the redis_store_tester CI failure.
bazel test //... is green locally (87/87).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Repo's .rustfmt.toml sets imports_granularity = "Module" (nightly),
which forbids the grouped `use runtime::{a::A, b::B, *}` form. Splits
into one `use` per module path, matching what `cargo +nightly fmt`
produces. This is the fix for the pre-commit-run failure on the PR.
Also picks up the warm-worker-pools.mdx doc improvements: a copy-paste
"Quick start" with isolation enabled by default, a Verify section that
points at scripts/test_isolation.sh, and tighter CRI-O-vs-Docker
language in Prerequisites and Worker Images. Adds "containerd" to the
Vale vocabulary and reflows two "repo" -> "repository" usages so the
Vale step stays green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Swift hits the same cold-start tax as the JVM - the Swift compiler has a heavy front-end and resolves Foundation/Dispatch out of its module cache on every invocation. The warm-pool model captures that cache in the COW template's lower layer, so every cloned job inherits a hot module cache without re-resolving stdlib modules. This change: - nativelink-config + nativelink-crio-worker-pool: add `Swift` as a first-class variant of the Language enum (next to Jvm/NodeJs/Custom). - docker/swift/: new image based on swift:6.0-jammy, pre-compiles SwiftWarmup.swift at template-creation time and primes the module cache under /opt/warmup/module-cache so the lower layer carries it. - docker/swift/warmup/: SwiftWarmup.swift exercises Foundation (Codable round-trip), math, and collections; swift-warmup.sh runs the binary then -typecheck to warm the parser; prime-swift-cache.sh pre-resolves Foundation/Dispatch/FoundationNetworking imports. - examples/swift-pool.json5: drop-in pool config with COW isolation enabled by default. - docs: extend "Use warm pools for" with Swift, add Swift row to the language-mappings table, and add an "Other languages" subsection under Quick start that links to the example configs. bazel test //... still 87/87. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the Swift work for the Node.js / TypeScript pool:
- docker/typescript/warmup/WarmWorkerIsolationTest.ts: TypeScript port
of the Java WarmWorkerIsolationTest. Pure-Node model of OverlayFsMount
from src/isolation.rs - 7 assertions covering both the leak ("before")
and the COW isolation ("after"). No CRI-O, no root, no junit.
- scripts/test_isolation_node.sh: local runner that npm-installs
typescript@5.3 + @types/node, compiles, and runs the test.
- docker/typescript/Dockerfile: add @types/node and a build-time gate
that compiles + runs the isolation test, so a broken COW contract
fails the image build (same pattern as docker/java/Dockerfile).
- examples/typescript-pool.json5: standalone TypeScript pool example
with COW isolation enabled by default (parallel to swift-pool.json5).
Docs language-mappings table now points TS at the standalone file
instead of the combined java-typescript example.
- scripts/README.md + warm-worker-pools.mdx: surface the second test
runner so users can pick whichever language their team prefers.
Both isolation tests pass locally (7/7 each).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…BUILD Matches the suppression pattern already used in nativelink-proto/BUILD.bazel for the same @@toolchains_protoc++ label references. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TypeScript: tsc --types flag doesn't find globally-installed @types/node; use --typeRoots pointing to npm's global prefix instead. Swift: @main attribute is incompatible with top-level code mode (single-file compilation without -parse-as-library). Replace with an explicit top-level SwiftWarmup.main() call, which is idiomatic for single-file Swift binaries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When attic.uc1.scdev.nativelink.net drops a connection mid-transfer (HTTP 200 + curl error), Nix fails the entire build instead of building from source. Setting fallback = true in nix.conf tells Nix to build the derivation locally whenever all substituters fail, which recovers these transient cache-miss failures without any code change. Applies to pre-commit-checks, lre, and all other workflows sharing the prepare-nix composite action. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Member
Author
|
/build-image |
|
Image built and pushed! |
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Implements Copy-on-Write (COW) isolation for warm worker pools to prevent cross-tenant state contamination in multi-tenant RBE deployments.
Fixes #2049
Type of change
How Has This Been Tested?
Locally. Needs to be tested by the community.
Checklist
bazel test //...passes locallygit amendsee some docsThis change is