Skip to content

[DO NOT MERGE] Temporary CAS updates and fixes. #2301

Open
amankrx wants to merge 20 commits intoTraceMachina:mainfrom
amankrx:temp-cas-bulk-changes
Open

[DO NOT MERGE] Temporary CAS updates and fixes. #2301
amankrx wants to merge 20 commits intoTraceMachina:mainfrom
amankrx:temp-cas-bulk-changes

Conversation

@amankrx
Copy link
Copy Markdown
Collaborator

@amankrx amankrx commented May 5, 2026

Description

This is a combination of PRs that I have added here, such that I can get a docker image out of this and maintain a track of PRs to be merged.
Fixes # (issue)

Type of change

Please delete options that aren't relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to
    not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please also list any relevant details for your test configuration

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 5, 2026

/build-image nativelink-worker-init

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 5, 2026

/build-image

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink:749e1b4

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:749e1b4

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 6, 2026

/build-image nativelink-worker-init

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 6, 2026

/build-image

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:8266b7d

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink:8266b7d

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 6, 2026

/build-image nativelink-worker-init

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 6, 2026

/build-image

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink:4f8e090

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:4f8e090

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 6, 2026

/build-image nativelink-worker-init

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 6, 2026

/build-image

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink:c461440

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:c461440

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 6, 2026

/build-image nativelink-worker-init

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 6, 2026

/build-image

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:065daaa

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 6, 2026

/build-image nativelink-worker-init

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 6, 2026

/build-image

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink:1a6cc2a

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:1a6cc2a

amankrx added 2 commits May 7, 2026 12:28
The populating-digests dedup is a net loss for multi-GB blobs: the
leader's pull from the slow store reliably exceeds LEADER_WAIT_TIMEOUT,
every concurrent follower falls through to its own slow-store read
anyway, and populating the fast tier with the huge blob evicts a
large number of smaller, more-useful entries.

Add a configurable size threshold (FastSlowSpec::bypass_dedup_threshold_bytes,
default 256 MiB when unset) at and above which get_part streams
straight from the slow store and skips the populate. Below the
threshold the existing leader/follower path is unchanged.

Tests: blobs at and above the threshold drive one slow-store call
per concurrent reader and leave the fast tier empty; blobs below the
threshold collapse to a single slow-store call and populate fast.
The default StoreDriver::check_health performs a full update_oneshot +
has + get_part_unchunked roundtrip. Under any meaningful slow-store
load (e.g. concurrent multi-GB blob reads saturating the network) that
roundtrip queues behind production traffic and easily exceeds the
per-indicator budget, causing /status to return 503 even when the
pod is otherwise functional. Kubelet treats those as probe failures
and eventually restarts the pod, dropping every connected client.

Override check_health on both registered indicators with a probe that
shares no path with production traffic:

  * GcsStore: a single object_exists() call against a fixed
    never-existing path. Metadata roundtrip only — independent of
    body-transfer bandwidth and the upload buffer pool.
  * FilesystemStore: a stat() of the configured content_path. A
    single syscall, microseconds on a healthy mount; bounded with a
    timeout so a hung NFS / EBS mount cannot wedge the indicator.

Both paths are bounded by a 2 s ceiling so they remain well inside
the HealthServer per-indicator budget.
@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 7, 2026

/build-image nativelink-worker-init

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 7, 2026

/build-image

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:cc6dd6f

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink:cc6dd6f

The previous 2-second PING ceiling was tight enough that a routine
Redis BGSAVE fork would push every RedisStore indicator over the line
simultaneously: on an 11 GB production master under load we observe
fork-induced pauses around 3 seconds, and with three RedisStore
indicators (AC, CAS, scheduler) all PING-ing through the same
connection pool, all three return Failed in lockstep — surfacing as a
503 on /status and a kubelet probe-failure event even though the
Redis service is otherwise healthy.

Verified by capturing /status response bodies during a flap window:

  [{"namespace": ".../SCHEDULER_STORE/RedisStore",
    "status": {"Failed": {"message":
      "RedisStore::check_health: PING exceeded 2 s timeout"}}},
   {"namespace": ".../CAS_REDIS_STORE/RedisStore",
    "status": {"Failed": {"message":
      "RedisStore::check_health: PING exceeded 2 s timeout"}}},
   {"namespace": ".../AC_REDIS_STORE/RedisStore",
    "status": {"Failed": {"message":
      "RedisStore::check_health: PING exceeded 2 s timeout"}}}]

The HealthServer's per-indicator wrapper budget is 5 s
(DEFAULT_HEALTH_CHECK_TIMEOUT_SECONDS), so 4 s leaves a small
safety margin while comfortably absorbing the BGSAVE worst case we
have observed in practice.
@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 7, 2026

/build-image nativelink-worker-init

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 7, 2026

/build-image

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink:7a89d6b

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:7a89d6b

amankrx and others added 3 commits May 9, 2026 01:49
Production worker pods were OOMKilled (exit 137) after the
ConnectionManager's `available_connections: usize` counter underflowed
to ~u64::MAX while `waiting_connections` climbed unbounded. The manual
decrement-on-issue / increment-on-Dropped accounting balances on paper,
but a leak path was occasionally missing a `Dropped` delivery during
tonic transport errors and task aborts.

Switch to `Arc<Semaphore>` with `OwnedSemaphorePermit` on the Connection.
RAII makes leakage structurally impossible: every Drop path (panic,
abort, dropped oneshot receiver, transport error) releases the permit
exactly once.

Adds 3 integration tests covering the request/acquire/release cycle, an
aborted-caller-future cleanup scenario, and the MAX_CONCURRENT ceiling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`RedisSubscription::Drop` previously dropped the `watch::Receiver`
*before* taking the `subscribed_keys` write lock, then decided whether
to remove the publisher entry based on `receiver_count() == 0`. Two
concurrent drops on subscriptions sharing a publisher (e.g. multiple
`WaitExecution` clients on the same operation_id) could both decrement
their counts before either took the lock, then race for it: the loser
saw the entry already removed and emitted a spurious "Key … was not
found in subscribed keys" error. Worse, if a fresh `subscribe(same_key)`
interleaved between the two drops, the second drop could remove the
freshly-inserted publisher and silently strand its subscribers.

Acquire the write lock *first*, evaluate "count == 1 with my receiver
still alive", remove the entry under the lock if so, then drop the
receiver. The lock now serialises both the count read and the map
mutation, closing both race windows. Demote the absence log from
`error!` to `warn!`: with the fix, that path now indicates a genuine
unexpected mutation outside the lock, not the race noise.

Adds 4 regression tests covering single-drop silence, drop-one-of-two
preserving the publisher, 200-iteration concurrent-drop race, and
resubscribe-after-drop creating a fresh publisher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings in upstream changes since the last sync, including:
  - feb6a15 Bound CAS leader-wait + per-blob batch deadline (TraceMachina#2298)
  - 43ab01d Add expiry to completed redis actions (TraceMachina#2315)
  - 6cdcf8e fix RBE CI for hermetic LLVM (TraceMachina#2314)
  - f5846df Migrate to hermetic llvm (TraceMachina#2312)

Conflicts resolved (all kept the local superset where the local change
extended an upstream one):
  - nativelink-store/src/fast_slow_store.rs: kept HEAD's
    `huge_blob_dedup_bypasses` / `fast_store_stale_map_falls_through`
    metrics and the `DEFAULT_BYPASS_DEDUP_THRESHOLD_BYTES` const
    alongside upstream's `LEADER_WAIT_TIMEOUT` / `leader_wait_timeouts`
  - nativelink-store/src/filesystem_store.rs: kept HEAD's path_type=Temp
    bookkeeping inside the ENOENT branch on top of upstream's
    debug-demote of the rename failure
  - nativelink-store/src/redis_store.rs: kept HEAD's 4s PING_TIMEOUT
    and richer doc comment
  - nativelink-store/tests/{fast_slow_store_test,redis_store_test}.rs:
    concatenated both branches' independent test additions; merged
    `use` lists; updated `test_search_by_index_skips_int_from_cursor_read`
    to expect the local FT.AGGREGATE TIMEOUT clause; added
    `bypass_dedup_threshold_bytes: 0` to upstream's new FastSlowSpec
    literal so it satisfies the field added locally.

All 30 test binaries across nativelink-store and nativelink-util pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 9, 2026

/build-image nativelink-worker-init

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 9, 2026

/build-image

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink:b3d5473

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:b3d5473

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant