Skip to content

Add MiniMax-M2.7-NVFP4 (N=5, TP=2): temp serving-trap + exhaustive-completer findings#33

Merged
Lightheartdevs merged 4 commits into
mainfrom
add-minimax-m2.7-nvfp4-microbench-2026-05-31
Jun 1, 2026
Merged

Add MiniMax-M2.7-NVFP4 (N=5, TP=2): temp serving-trap + exhaustive-completer findings#33
Lightheartdevs merged 4 commits into
mainfrom
add-minimax-m2.7-nvfp4-microbench-2026-05-31

Conversation

@Lightheartdevs

@Lightheartdevs Lightheartdevs commented May 31, 2026

Copy link
Copy Markdown
Contributor

Adds MiniMax-M2.7-NVFP4 (230B-A10B MoE, vLLM tensor-parallel TP=2) as a fifth model in the 397B vs Step-3.7-Flash microbench entry, full 12-family MMBT microbench at N=5 (60 cells).

Re-cherry-picked cleanly onto main (the original commit landed on a branch after PR #32 merged, so it was never in an open PR). Now audit-corrected.

Two findings

1. A temperature serving-trap, not a capability gap. At the bench default temp=0.3, MiniMax-NVFP4 runs the agent loop then degenerates into a final-turn repetition loop to max_tokens14/19 coding cells ran away (p1_testwrite 10/10). At the card's temp=1.0/top_p=0.95/top_k=40, the same cells/cap are 0/10 runaway; all 60 cells 58 done_signal, 0 runaway. Clean same-cell A/B → sampling, not the model or the cap (identically-capped 397B: 0 runaways in 260 cells).

2. An "exhaustive completer" temperament. Aggregate 7/12 (35/60) — ties the band, complementary to 397B: p2 analysis 20/20, but scope-constrained coding 0/5 (edits files it was told to leave alone) and p3_market ctx-exhaustion. Self-inflicted over-delivery, the opposite of 397B's surgical restraint.

GPU power — claim corrected (audit fix)

The continuous power telemetry for this run was not reliably captured (logger output missing/untagged), so the earlier-cited TP=2 figures (~896W median / 1089W peak / 64% / "crosses 1000W") are withdrawn as unverifiable — the only surviving data (per-cell nvidia-smi snapshots, N=60: combined median 612W, 0/60 over 1000W) contradicts them and itself undersamples decode peaks. Replaced with the architectural claim (TP=2 loads both GPUs simultaneously vs pipeline alternation; snapshots confirm balanced ~313W/~300W per-GPU draw) plus an explicit data-loss caveat. Registered as held in claims.yaml.

Registration (repo contract)

  • manifest.jsonminimax_m2.7 block (model/engine/serving/sampling/run-inventory) + results + caveats
  • 3 claims.yaml entries (hw.minimax.temp-serving-trap, .exhaustive-completer, .tp2-power.held)
  • tooling (BENCH_TEMP/TOP_P/TOP_K env, top_p/top_k in request, reasoning_content retention)

Caveats (in-doc)

  • Sampling deviation: MiniMax alone runs temp=1.0 (off the temp=0.3 cohort) — justified, documented per-run.
  • N=5 vs comparators' N=10/N=1; 131072 ctx vs 262144.
  • receipt.json records temperature but not top_p/top_k (full profile in doc + manifest).

🤖 Generated with Claude Code

User Name and others added 4 commits May 31, 2026 19:19
…mpleter findings

- harness.py: add --top-p/--top-k plumbing + retain reasoning_content across turns
  (interleaved-thinking models); run_microbench.sh: BENCH_TEMP/TOP_P/TOP_K env overrides
  (default temp=0.3 unchanged for other models).
- Finding 1: MiniMax-NVFP4 runs away (repetition loop -> model_exceeded_max_tokens) at the
  bench default temp=0.3 (74% of coding cells; testwrite 10/10); at the model-card
  temp=1.0/top_p=0.95/top_k=40 the SAME cells on the SAME 131072 cap are 0/10 runaway,
  58/60 done_signal. Clean A/B => sampling, not capability, not cap.
- Finding 2: exhaustive-completer temperament. Aggregate 7/12 (35/60). p2 analysis 20/20;
  scope-constrained coding 0/5 (testwrite+refactor edit files told to leave alone);
  p3_market ctx-exhaustion. Complementary to 397B's surgical restraint.
- Power: TP=2 heaviest simultaneous draw (median ~896W active, peak 1089W=91% cap, both
  GPUs >400W 64% of samples) vs 397B pipeline 670W/985W never >1000W.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s (audit fixes)

Audit findings addressed:
- GPU power: the cited TP=2 figures (~896W median / 1089W peak / 64% /
  "crosses 1000W") are UNVERIFIABLE — continuous power telemetry for this run
  was not captured (logger output missing/untagged), and the only surviving data
  (per-cell nvidia-smi snapshots, N=60: combined median 612W, 0/60 over 1000W)
  contradicts them while itself undersampling decode peaks. Withdrawn across
  findings-minimax-m2.7.md, findings.md, and README; replaced with the
  architectural claim + an explicit data-loss caveat. Registered as held
  (hw.minimax.tp2-power.held).
- token-range floor corrected (58k-106k, not 71k-106k; testwrite vs bugfix).
- 397B runaway denominator corrected (260 cells, was 240).

Registration (repo contract): manifest.json gains a minimax_m2.7 block
(model/engine/serving/sampling/run-inventory) + results entries + caveats;
claims.yaml gains hw.minimax.temp-serving-trap, .exhaustive-completer, and
.tp2-power.held.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…redo

The new qwen3.6-27b-fp8 entry shows FP8 serving is stable (113/119 clean),
so the 'Q8/FP8 both serving failures' framing was superseded. Narrow to
Q8-serving-specific and point forward to the FP8 entry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Lightheartdevs Lightheartdevs merged commit b939a61 into main Jun 1, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant