DEBUG: instrument SPBase.allreduce_or to localize LOR_bug by DLWoodruff · Pull Request #717 · Pyomo/mpi-sppy

DLWoodruff · 2026-05-20T23:06:06Z

Do not merge. Diagnostic patch for a spurious-shutdown bug under investigation. Revert before merging anything else from this branch.

What this does

Replaces SPBase.allreduce_or with an instrumented variant that, on every call, does four collectives instead of one and dumps a one-block diagnostic to stdout from cyl_rk == 0 of the cylinder's mpicomm.

Originally the function was:

def allreduce_or(self, val):
    local_val = np.array([val], dtype='int8')
    global_val = np.zeros(1, dtype='int8')
    self.mpicomm.Allreduce(local_val, global_val, op=MPI.LOR)
    return bool(global_val[0])

Now each call also does:

Allgather of (world_rk, cyl_rk, local_int) — shows exactly which world ranks participate and what each one packed.
SUM, MAX, LOR reductions in parallel — MAX distinguishes "many small" from "few large"; LOR confirms call-site behavior.
Rank-sum sanity — each rank contributes its mpicomm rank; expected sum is n*(n-1)/2. Mismatch flags a corrupt SUM reducer.
Cross-check: gather_sum (sum of locally-reported values from the Allgather) vs reduce_sum (the Allreduce SUM). Divergence isolates the bug to the reducer path.

Hypotheses being tested (in priority order)

self.mpicomm has wider membership than the cylinder it should — i.e., includes ranks from another cylinder. Detected if mpicomm size exceeds the cylinder's rank count, or if world_ranks includes ranks outside the cylinder's slice.
Buffer memory underneath local_val is not 0 when MPI reads it. Detected if gather_sum > 0 (some rank reports nonzero) AND max > 0 (the rank's value was nonzero from MPI's perspective).
The Allreduce reducer path is malfunctioning. Detected if gather_sum == 0 but reduce_sum != 0, OR if rank_sum != expected_rank_sum.
Duplicate rank participation in self.mpicomm. Detected if world_ranks: unique < count.

Reading the output

Operator-friendly greps:

# 1. Did anything print at all?
grep '^\[LOR_bug' out.log | head

# 2. What does each cylinder think its comm size is?
#    (one printer per cylinder; size should be that cylinder's rank count)
#    If you see size=150 here, hypothesis (i) is the bug.
grep 'mpicomm size=' out.log | sort -u

# 3. World-rank membership of each cylinder.
#    count==unique is required; unique<count means duplicate participation (iv).
#    The range [min..max] tells you which slice of WORLD is in this comm.
grep 'world_ranks:' out.log

# 4. Sanity check that SUM works at all on this comm.
#    rank_sum should equal expected_rank_sum (n*(n-1)/2). If not, the
#    SUM reducer itself is broken on this comm - hypothesis (iii).
grep 'rank_sum=' out.log | head

# 5. The main reduce result vs. the gather-truth.
#    sum == gather_sum is required. Divergence pins the bug to the
#    reducer path (gather honest, reduce wrong).
#    max>1 means at least one rank contributed >1 (not boolean) -
#    hypothesis (ii) on that rank specifically.
grep -E 'reductions:|gather:' out.log

# 6. Which ranks actually contributed nonzero (only printed on anomaly).
#    Tells you world_rk + cyl_rk of every guilty contributor.
grep 'nonzero: world_rk' out.log

# 7. Full participant list (only printed on anomaly).
#    Use this if comm membership is suspicious.
grep 'ALL world_ranks:' out.log

Cost / overhead

One Allreduce call now does 4 reductions + 1 Allgather. In the target environment the run-launch overhead dominates the per-call cost, so this is acceptable. Output is only printed on cyl_rk == 0 (one line block per cylinder per call), so log volume is bounded.

Followups

Once the hypothesis is pinned, fix in a separate PR.
Revert this file change before any non-diagnostic merge to main.

🤖 Generated with Claude Code

A spurious shutdown is firing on every xhatter rank despite no rank having written 1.0 to the SHUTDOWN buffer; replacing Allreduce(LOR) with Allreduce(SUM) returns a stable-by-pattern nonzero (~69), and the input local_val has been verified zero on the xhatter ranks themselves. Four hypotheses remain: (i) self.mpicomm has wider membership than the xhatter cylinder (ii) buffer memory underneath local_val is not 0 when MPI reads it (iii) the Allreduce reducer path is malfunctioning (iv) duplicate rank participation in self.mpicomm This patch packs four diagnostic axes into a single Allreduce call: 1. an Allgather of (world_rk, cyl_rk, local_int) - shows exactly which world ranks participate and what each one contributed 2. parallel SUM, MAX, LOR reductions - MAX distinguishes "many small contributions" from "few large ones," LOR confirms observed call-site behavior 3. a rank-sum sanity reduction (each rank contributes its mpicomm rank), expected to equal n*(n-1)/2; mismatch flags a corrupt SUM reducer 4. a comparison between the Allgather-summed values and the Allreduce(SUM) result; divergence isolates the bug to the reducer path Output is printed on cyl_rk == 0 with the call counter, class name, host, pid, comm name, and world-rank min/max/count/unique; on any anomaly it also lists nonzero rows and the full participant list. One Allreduce call now does 4 reductions + 1 Allgather; cost is dominated by run-launch overhead in the target environment, so the extra collectives are acceptable. Revert before merging to main. Reading the output (greps): # 1. Did anything print at all? grep '^\[LOR_bug' out.log | head # 2. What does each cylinder think its comm size is? # (one printer per cylinder; size should be that cylinder's rank count) # If you see size=150 here, hypothesis (i) is the bug. grep 'mpicomm size=' out.log | sort -u # 3. World-rank membership of each cylinder. # count==unique is required; unique<count means duplicate participation (iv). # The range [min..max] tells you which slice of WORLD is in this comm. grep 'world_ranks:' out.log # 4. Sanity check that SUM works at all on this comm. # rank_sum should equal expected_rank_sum (n*(n-1)/2). If not, the # SUM reducer itself is broken on this comm - hypothesis (iii). grep 'rank_sum=' out.log | head # 5. The main reduce result vs. the gather-truth. # sum == gather_sum is required. Divergence pins the bug to the # reducer path (gather honest, reduce wrong). # max>1 means at least one rank contributed >1 (not boolean) - # hypothesis (ii) on that rank specifically. grep -E 'reductions:|gather:' out.log # 6. Which ranks actually contributed nonzero (only printed on anomaly). # Tells you world_rk + cyl_rk of every guilty contributor. grep 'nonzero: world_rk' out.log # 7. Full participant list (only printed on anomaly). # Use this if comm membership is suspicious. grep 'ALL world_ranks:' out.log Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-05-20T23:16:22Z

Codecov Report

❌ Patch coverage is 84.00000% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.44%. Comparing base (df879ae) to head (3d38b25).

Files with missing lines	Patch %	Lines
mpisppy/spbase.py	84.00%	8 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #717   +/-   ##
=======================================
  Coverage   71.44%   71.44%           
=======================================
  Files         154      154           
  Lines       19463    19509   +46     
=======================================
+ Hits        13905    13939   +34     
- Misses       5558     5570   +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Initial trigger fired on any nonzero reduce result, which caught legitimate shutdown signals (sum=lor=1, gather_sum=1, all consistent) and would flood logs on every cylinder finalization. Real-bug signature is invariant-violating, not just nonzero: - rank_sum != expected_rank_sum -> SUM broken on this comm - sum != gather_sum -> reducer disagrees with gather - unique != count -> duplicate world ranks - max > 1 -> non-boolean input on some rank Verified: 0 false positives across ~440k allreduce_or calls in a sizes 3-scen 3-rank xhatshuffle+lagrangian run (sizes_cylinders.py --num-scens 3 --xhatshuffle --lagrangian).

DLWoodruff added 2 commits May 20, 2026 16:21

Merge remote-tracking branch 'upstream/main' into LOR_bug

9475da1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEBUG: instrument SPBase.allreduce_or to localize LOR_bug#717

DEBUG: instrument SPBase.allreduce_or to localize LOR_bug#717
DLWoodruff wants to merge 3 commits into
Pyomo:mainfrom
DLWoodruff:LOR_bug

DLWoodruff commented May 20, 2026

Uh oh!

codecov Bot commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DLWoodruff commented May 20, 2026

What this does

Hypotheses being tested (in priority order)

Reading the output

Cost / overhead

Followups

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented May 20, 2026 •

edited

Loading