Skip to content

DEBUG: instrument SPBase.allreduce_or to localize LOR_bug#717

Draft
DLWoodruff wants to merge 3 commits into
Pyomo:mainfrom
DLWoodruff:LOR_bug
Draft

DEBUG: instrument SPBase.allreduce_or to localize LOR_bug#717
DLWoodruff wants to merge 3 commits into
Pyomo:mainfrom
DLWoodruff:LOR_bug

Conversation

@DLWoodruff
Copy link
Copy Markdown
Collaborator

Do not merge. Diagnostic patch for a spurious-shutdown bug under investigation. Revert before merging anything else from this branch.

What this does

Replaces SPBase.allreduce_or with an instrumented variant that, on every call, does four collectives instead of one and dumps a one-block diagnostic to stdout from cyl_rk == 0 of the cylinder's mpicomm.

Originally the function was:

def allreduce_or(self, val):
    local_val = np.array([val], dtype='int8')
    global_val = np.zeros(1, dtype='int8')
    self.mpicomm.Allreduce(local_val, global_val, op=MPI.LOR)
    return bool(global_val[0])

Now each call also does:

  1. Allgather of (world_rk, cyl_rk, local_int) — shows exactly which world ranks participate and what each one packed.
  2. SUM, MAX, LOR reductions in parallel — MAX distinguishes "many small" from "few large"; LOR confirms call-site behavior.
  3. Rank-sum sanity — each rank contributes its mpicomm rank; expected sum is n*(n-1)/2. Mismatch flags a corrupt SUM reducer.
  4. Cross-check: gather_sum (sum of locally-reported values from the Allgather) vs reduce_sum (the Allreduce SUM). Divergence isolates the bug to the reducer path.

Hypotheses being tested (in priority order)

  1. self.mpicomm has wider membership than the cylinder it should — i.e., includes ranks from another cylinder. Detected if mpicomm size exceeds the cylinder's rank count, or if world_ranks includes ranks outside the cylinder's slice.
  2. Buffer memory underneath local_val is not 0 when MPI reads it. Detected if gather_sum > 0 (some rank reports nonzero) AND max > 0 (the rank's value was nonzero from MPI's perspective).
  3. The Allreduce reducer path is malfunctioning. Detected if gather_sum == 0 but reduce_sum != 0, OR if rank_sum != expected_rank_sum.
  4. Duplicate rank participation in self.mpicomm. Detected if world_ranks: unique < count.

Reading the output

Operator-friendly greps:

# 1. Did anything print at all?
grep '^\[LOR_bug' out.log | head

# 2. What does each cylinder think its comm size is?
#    (one printer per cylinder; size should be that cylinder's rank count)
#    If you see size=150 here, hypothesis (i) is the bug.
grep 'mpicomm size=' out.log | sort -u

# 3. World-rank membership of each cylinder.
#    count==unique is required; unique<count means duplicate participation (iv).
#    The range [min..max] tells you which slice of WORLD is in this comm.
grep 'world_ranks:' out.log

# 4. Sanity check that SUM works at all on this comm.
#    rank_sum should equal expected_rank_sum (n*(n-1)/2). If not, the
#    SUM reducer itself is broken on this comm - hypothesis (iii).
grep 'rank_sum=' out.log | head

# 5. The main reduce result vs. the gather-truth.
#    sum == gather_sum is required. Divergence pins the bug to the
#    reducer path (gather honest, reduce wrong).
#    max>1 means at least one rank contributed >1 (not boolean) -
#    hypothesis (ii) on that rank specifically.
grep -E 'reductions:|gather:' out.log

# 6. Which ranks actually contributed nonzero (only printed on anomaly).
#    Tells you world_rk + cyl_rk of every guilty contributor.
grep 'nonzero: world_rk' out.log

# 7. Full participant list (only printed on anomaly).
#    Use this if comm membership is suspicious.
grep 'ALL world_ranks:' out.log

Cost / overhead

One Allreduce call now does 4 reductions + 1 Allgather. In the target environment the run-launch overhead dominates the per-call cost, so this is acceptable. Output is only printed on cyl_rk == 0 (one line block per cylinder per call), so log volume is bounded.

Followups

  • Once the hypothesis is pinned, fix in a separate PR.
  • Revert this file change before any non-diagnostic merge to main.

🤖 Generated with Claude Code

A spurious shutdown is firing on every xhatter rank despite no rank
having written 1.0 to the SHUTDOWN buffer; replacing
Allreduce(LOR) with Allreduce(SUM) returns a stable-by-pattern
nonzero (~69), and the input local_val has been verified zero on
the xhatter ranks themselves. Four hypotheses remain:

  (i)   self.mpicomm has wider membership than the xhatter cylinder
  (ii)  buffer memory underneath local_val is not 0 when MPI reads it
  (iii) the Allreduce reducer path is malfunctioning
  (iv)  duplicate rank participation in self.mpicomm

This patch packs four diagnostic axes into a single Allreduce call:

  1. an Allgather of (world_rk, cyl_rk, local_int) - shows exactly
     which world ranks participate and what each one contributed
  2. parallel SUM, MAX, LOR reductions - MAX distinguishes "many
     small contributions" from "few large ones," LOR confirms
     observed call-site behavior
  3. a rank-sum sanity reduction (each rank contributes its
     mpicomm rank), expected to equal n*(n-1)/2; mismatch flags
     a corrupt SUM reducer
  4. a comparison between the Allgather-summed values and the
     Allreduce(SUM) result; divergence isolates the bug to the
     reducer path

Output is printed on cyl_rk == 0 with the call counter, class
name, host, pid, comm name, and world-rank min/max/count/unique;
on any anomaly it also lists nonzero rows and the full participant
list. One Allreduce call now does 4 reductions + 1 Allgather; cost
is dominated by run-launch overhead in the target environment, so
the extra collectives are acceptable.

Revert before merging to main.

Reading the output (greps):

  # 1. Did anything print at all?
  grep '^\[LOR_bug' out.log | head

  # 2. What does each cylinder think its comm size is?
  #    (one printer per cylinder; size should be that cylinder's rank count)
  #    If you see size=150 here, hypothesis (i) is the bug.
  grep 'mpicomm size=' out.log | sort -u

  # 3. World-rank membership of each cylinder.
  #    count==unique is required; unique<count means duplicate participation (iv).
  #    The range [min..max] tells you which slice of WORLD is in this comm.
  grep 'world_ranks:' out.log

  # 4. Sanity check that SUM works at all on this comm.
  #    rank_sum should equal expected_rank_sum (n*(n-1)/2). If not, the
  #    SUM reducer itself is broken on this comm - hypothesis (iii).
  grep 'rank_sum=' out.log | head

  # 5. The main reduce result vs. the gather-truth.
  #    sum == gather_sum is required. Divergence pins the bug to the
  #    reducer path (gather honest, reduce wrong).
  #    max>1 means at least one rank contributed >1 (not boolean) -
  #    hypothesis (ii) on that rank specifically.
  grep -E 'reductions:|gather:' out.log

  # 6. Which ranks actually contributed nonzero (only printed on anomaly).
  #    Tells you world_rk + cyl_rk of every guilty contributor.
  grep 'nonzero: world_rk' out.log

  # 7. Full participant list (only printed on anomaly).
  #    Use this if comm membership is suspicious.
  grep 'ALL world_ranks:' out.log

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 84.00000% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.44%. Comparing base (df879ae) to head (3d38b25).

Files with missing lines Patch % Lines
mpisppy/spbase.py 84.00% 8 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #717   +/-   ##
=======================================
  Coverage   71.44%   71.44%           
=======================================
  Files         154      154           
  Lines       19463    19509   +46     
=======================================
+ Hits        13905    13939   +34     
- Misses       5558     5570   +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Initial trigger fired on any nonzero reduce result, which caught
legitimate shutdown signals (sum=lor=1, gather_sum=1, all
consistent) and would flood logs on every cylinder finalization.
Real-bug signature is invariant-violating, not just nonzero:

  - rank_sum != expected_rank_sum  -> SUM broken on this comm
  - sum != gather_sum              -> reducer disagrees with gather
  - unique != count                -> duplicate world ranks
  - max > 1                        -> non-boolean input on some rank

Verified: 0 false positives across ~440k allreduce_or calls in a
sizes 3-scen 3-rank xhatshuffle+lagrangian run (sizes_cylinders.py
--num-scens 3 --xhatshuffle --lagrangian).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant