DEBUG: instrument SPBase.allreduce_or to localize LOR_bug#717
Draft
DLWoodruff wants to merge 3 commits into
Draft
DEBUG: instrument SPBase.allreduce_or to localize LOR_bug#717DLWoodruff wants to merge 3 commits into
DLWoodruff wants to merge 3 commits into
Conversation
A spurious shutdown is firing on every xhatter rank despite no rank
having written 1.0 to the SHUTDOWN buffer; replacing
Allreduce(LOR) with Allreduce(SUM) returns a stable-by-pattern
nonzero (~69), and the input local_val has been verified zero on
the xhatter ranks themselves. Four hypotheses remain:
(i) self.mpicomm has wider membership than the xhatter cylinder
(ii) buffer memory underneath local_val is not 0 when MPI reads it
(iii) the Allreduce reducer path is malfunctioning
(iv) duplicate rank participation in self.mpicomm
This patch packs four diagnostic axes into a single Allreduce call:
1. an Allgather of (world_rk, cyl_rk, local_int) - shows exactly
which world ranks participate and what each one contributed
2. parallel SUM, MAX, LOR reductions - MAX distinguishes "many
small contributions" from "few large ones," LOR confirms
observed call-site behavior
3. a rank-sum sanity reduction (each rank contributes its
mpicomm rank), expected to equal n*(n-1)/2; mismatch flags
a corrupt SUM reducer
4. a comparison between the Allgather-summed values and the
Allreduce(SUM) result; divergence isolates the bug to the
reducer path
Output is printed on cyl_rk == 0 with the call counter, class
name, host, pid, comm name, and world-rank min/max/count/unique;
on any anomaly it also lists nonzero rows and the full participant
list. One Allreduce call now does 4 reductions + 1 Allgather; cost
is dominated by run-launch overhead in the target environment, so
the extra collectives are acceptable.
Revert before merging to main.
Reading the output (greps):
# 1. Did anything print at all?
grep '^\[LOR_bug' out.log | head
# 2. What does each cylinder think its comm size is?
# (one printer per cylinder; size should be that cylinder's rank count)
# If you see size=150 here, hypothesis (i) is the bug.
grep 'mpicomm size=' out.log | sort -u
# 3. World-rank membership of each cylinder.
# count==unique is required; unique<count means duplicate participation (iv).
# The range [min..max] tells you which slice of WORLD is in this comm.
grep 'world_ranks:' out.log
# 4. Sanity check that SUM works at all on this comm.
# rank_sum should equal expected_rank_sum (n*(n-1)/2). If not, the
# SUM reducer itself is broken on this comm - hypothesis (iii).
grep 'rank_sum=' out.log | head
# 5. The main reduce result vs. the gather-truth.
# sum == gather_sum is required. Divergence pins the bug to the
# reducer path (gather honest, reduce wrong).
# max>1 means at least one rank contributed >1 (not boolean) -
# hypothesis (ii) on that rank specifically.
grep -E 'reductions:|gather:' out.log
# 6. Which ranks actually contributed nonzero (only printed on anomaly).
# Tells you world_rk + cyl_rk of every guilty contributor.
grep 'nonzero: world_rk' out.log
# 7. Full participant list (only printed on anomaly).
# Use this if comm membership is suspicious.
grep 'ALL world_ranks:' out.log
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #717 +/- ##
=======================================
Coverage 71.44% 71.44%
=======================================
Files 154 154
Lines 19463 19509 +46
=======================================
+ Hits 13905 13939 +34
- Misses 5558 5570 +12 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Initial trigger fired on any nonzero reduce result, which caught legitimate shutdown signals (sum=lor=1, gather_sum=1, all consistent) and would flood logs on every cylinder finalization. Real-bug signature is invariant-violating, not just nonzero: - rank_sum != expected_rank_sum -> SUM broken on this comm - sum != gather_sum -> reducer disagrees with gather - unique != count -> duplicate world ranks - max > 1 -> non-boolean input on some rank Verified: 0 false positives across ~440k allreduce_or calls in a sizes 3-scen 3-rank xhatshuffle+lagrangian run (sizes_cylinders.py --num-scens 3 --xhatshuffle --lagrangian).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Do not merge. Diagnostic patch for a spurious-shutdown bug under investigation. Revert before merging anything else from this branch.
What this does
Replaces
SPBase.allreduce_orwith an instrumented variant that, on every call, does four collectives instead of one and dumps a one-block diagnostic to stdout fromcyl_rk == 0of the cylinder'smpicomm.Originally the function was:
Now each call also does:
(world_rk, cyl_rk, local_int)— shows exactly which world ranks participate and what each one packed.mpicommrank; expected sum isn*(n-1)/2. Mismatch flags a corrupt SUM reducer.gather_sum(sum of locally-reported values from the Allgather) vsreduce_sum(the Allreduce SUM). Divergence isolates the bug to the reducer path.Hypotheses being tested (in priority order)
self.mpicommhas wider membership than the cylinder it should — i.e., includes ranks from another cylinder. Detected ifmpicomm sizeexceeds the cylinder's rank count, or ifworld_ranksincludes ranks outside the cylinder's slice.local_valis not 0 when MPI reads it. Detected ifgather_sum > 0(some rank reports nonzero) ANDmax > 0(the rank's value was nonzero from MPI's perspective).gather_sum == 0butreduce_sum != 0, OR ifrank_sum != expected_rank_sum.self.mpicomm. Detected ifworld_ranks: unique < count.Reading the output
Operator-friendly greps:
Cost / overhead
One Allreduce call now does 4 reductions + 1 Allgather. In the target environment the run-launch overhead dominates the per-call cost, so this is acceptable. Output is only printed on
cyl_rk == 0(one line block per cylinder per call), so log volume is bounded.Followups
🤖 Generated with Claude Code