Skip to content

release-26.2: mmaprototype: fix nil-pointer panic when StorePool sees stores MMA does not#170939

Merged
trunk-io[bot] merged 4 commits into
cockroachdb:release-26.2from
tbg:blathers/backport-release-26.2-170796
May 27, 2026
Merged

release-26.2: mmaprototype: fix nil-pointer panic when StorePool sees stores MMA does not#170939
trunk-io[bot] merged 4 commits into
cockroachdb:release-26.2from
tbg:blathers/backport-release-26.2-170796

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented May 26, 2026

Backport 4/4 commits from #170796 on behalf of @tbg.


Fixes #170703.

The MMA prototype crashed with a nil-pointer dereference in computeMeansForStoreSet whenever the legacy allocator handed it a candidate slice (or an existing storeID) that MMA's clusterState had not yet seen. This is routine during startup: StorePool and MMA's clusterState are populated by separate gossip-driven paths, so the two views diverge until MMA's SetStore callbacks catch up. The asymmetry is already acknowledged for updateStoreStatuses, which logs and skips unknown stores; BuildMMARebalanceAdvisor was missing the same guard, so it segfaulted instead.

The fix filters unknown stores at the right architectural layer — the integration boundary in BuildMMARebalanceAdvisor — rather than papering over the symptom inside the internal helper. The internal computeMeansForStoreSet precondition (every storeID must be known to the load provider) is also now documented and defended by an assertTruef that panics in test builds and logs+skips in production, so future internal misuse fails loudly without re-introducing the production segfault.

This issue became more common after MMA was enabled by default in v26.3 (#169411).

Epic: none

Release note: None


Release justification: bug fix for default-off multi-metric allocator

@tbg tbg requested review from a team as code owners May 26, 2026 17:33
@blathers-crl blathers-crl Bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels May 26, 2026
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 26, 2026

😎 Merged successfully - details.

@blathers-crl blathers-crl Bot requested a review from pav-kv May 26, 2026 17:33
@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented May 26, 2026

Thanks for opening a backport.

Before merging, please confirm that the change does not break backwards compatibility and otherwise complies with the backport policy. Include a brief release justification in the PR description explaining why the backport is appropriate. All backports must be reviewed by the TL for the owning area. While the stricter LTS policy does not yet apply, please exercise judgment and consider gating non-critical changes behind a disabled-by-default feature flag when appropriate.

@blathers-crl blathers-crl Bot added backport Label PR's that are backports to older release branches T-kv KV Team labels May 26, 2026
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

tbg added 4 commits May 27, 2026 11:15
A subsequent commit needs to log and assert from inside
(*allocatorState).BuildMMARebalanceAdvisor with proper call-site log
tags (e.g. mmaid). Today the function does not take a ctx, so the
existing fallback assertion has to use context.Background(), which
loses those tags.

Add ctx as the first parameter of (*allocatorState).BuildMMARebalanceAdvisor
and the corresponding interface declarations on the mmaprototype.Allocator
and mmaintegration.mmaState interfaces. Plumb the ctx that AllocatorSync
already has on hand into the underlying call. Replace the
context.Background() literal in the existing fallback assertion.

No behavior change.

Release note: None
Add a test that pins the current, broken behavior described in cockroachdb#170703:
(*allocatorState).BuildMMARebalanceAdvisor panics with a nil pointer
dereference if any storeID it is given (either `existing` or anything
in `cands`) is not yet known to MMA's clusterState.

This happens during startup, when the legacy allocator builds candidate
lists from StorePool's gossip-driven view before MMA has been notified
of the corresponding stores via SetStore — see the call path in cockroachdb#170703
from bestRebalanceTarget through AllocatorSync.BuildMMARebalanceAdvisor
into computeMeansForStoreSet, which dereferences the nil *storeLoad
returned by clusterState.getStoreReportedLoad for unknown stores.

The assertions use require.Panics so the test passes on this commit.
The next commit fixes the panic and flips them to require.NotPanics
plus a behavior check, so the red/green pair is visible in a single
diff.

Informs cockroachdb#170703.

Release note: None
Fix the nil-pointer panic from cockroachdb#170703 at the right architectural
layer: the integration boundary where MMA's clusterState meets the
legacy allocator's StorePool view.

The two views are kept in sync independently — StorePool from gossip
callbacks, MMA's clusterState from explicit SetStore / ProcessStoreLoadMsg
calls. During startup (and any time gossip races ahead of MMA), the
candidate slice handed to BuildMMARebalanceAdvisor can include storeIDs
MMA has never heard of. computeMeansForStoreSet then nil-derefs the
*storeLoad returned by getStoreReportedLoad for those unknown stores.

Filter at the entry point instead:

  - If `existing` is unknown to MMA, return NoopMMARebalanceAdvisor and
    log at V(2). MMA has no load history for the source store, so it
    cannot judge whether candidates are more overloaded than existing.
  - Drop unknown storeIDs from `cands` before computing means.

This matches the asymmetry already acknowledged in updateStoreStatuses
(cluster_state.go), which logs and skips unknown stores rather than
panicking.

In addition, tighten the contract of the internal helper
computeMeansForStoreSet: its precondition that every storeID be known
to the loadProvider is now documented, and a defensive assertTruef
inside the loop catches future violations — panicking in test builds
so the bug surfaces immediately, logging and skipping in production
so we never reintroduce the segfault. A divide-by-zero guard handles
the (now unreachable from BuildMMARebalanceAdvisor) case where every
store was filtered.

Flip the red test added in the previous commit to assert the new
behavior: unknown cand → silently dropped, unknown existing →
disabled (no-op) advisor.

Fixes cockroachdb#170703.

Release note: None
Replace the copy-on-write loop in BuildMMARebalanceAdvisor with
slices.IndexFunc + slices.DeleteFunc, per review on cockroachdb#170796. The
previous loop was hard to read: control flow after `cands = filtered`
was unclear, the `+1` capacity hint was opaque, and mutating a slice
mid-`range` is an uncommon pattern.

The new shape preserves the no-allocation property in the steady state
(IndexFunc walks once, returns -1). In the filtered path it makes one
copy and lets DeleteFunc compact; DeleteFunc leaves cap > len, so the
subsequent append(cands, existing) reuses the residual capacity
without a re-alloc and the +1 hint is no longer needed.

Add cluster_state.notHasStore as a one-line negation of hasStore so
the slices helpers can take a method value instead of an inline
closure.

Release note: None
@tbg tbg force-pushed the blathers/backport-release-26.2-170796 branch from ee8a43e to 808a893 Compare May 27, 2026 09:15
@tbg
Copy link
Copy Markdown
Member Author

tbg commented May 27, 2026

@trunk-io trunk-io Bot merged commit 11377f8 into cockroachdb:release-26.2 May 27, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport Label PR's that are backports to older release branches blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. T-kv KV Team target-release-26.2.2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants