Add MiniMax-M2.7-NVFP4 (N=5, TP=2): temp serving-trap + exhaustive-completer findings by Lightheartdevs · Pull Request #33 · Light-Heart-Labs/MMBT-Messy-Model-Bench-Tests

Lightheartdevs · 2026-05-31T23:20:17Z

Adds MiniMax-M2.7-NVFP4 (230B-A10B MoE, vLLM tensor-parallel TP=2) as a fifth model in the 397B vs Step-3.7-Flash microbench entry, full 12-family MMBT microbench at N=5 (60 cells).

Re-cherry-picked cleanly onto main (the original commit landed on a branch after PR #32 merged, so it was never in an open PR). Now audit-corrected.

Two findings

1. A temperature serving-trap, not a capability gap. At the bench default temp=0.3, MiniMax-NVFP4 runs the agent loop then degenerates into a final-turn repetition loop to max_tokens — 14/19 coding cells ran away (p1_testwrite 10/10). At the card's temp=1.0/top_p=0.95/top_k=40, the same cells/cap are 0/10 runaway; all 60 cells 58 done_signal, 0 runaway. Clean same-cell A/B → sampling, not the model or the cap (identically-capped 397B: 0 runaways in 260 cells).

2. An "exhaustive completer" temperament. Aggregate 7/12 (35/60) — ties the band, complementary to 397B: p2 analysis 20/20, but scope-constrained coding 0/5 (edits files it was told to leave alone) and p3_market ctx-exhaustion. Self-inflicted over-delivery, the opposite of 397B's surgical restraint.

GPU power — claim corrected (audit fix)

The continuous power telemetry for this run was not reliably captured (logger output missing/untagged), so the earlier-cited TP=2 figures (~896W median / 1089W peak / 64% / "crosses 1000W") are withdrawn as unverifiable — the only surviving data (per-cell nvidia-smi snapshots, N=60: combined median 612W, 0/60 over 1000W) contradicts them and itself undersamples decode peaks. Replaced with the architectural claim (TP=2 loads both GPUs simultaneously vs pipeline alternation; snapshots confirm balanced ~313W/~300W per-GPU draw) plus an explicit data-loss caveat. Registered as held in claims.yaml.

Registration (repo contract)

manifest.json — minimax_m2.7 block (model/engine/serving/sampling/run-inventory) + results + caveats
3 claims.yaml entries (hw.minimax.temp-serving-trap, .exhaustive-completer, .tp2-power.held)
tooling (BENCH_TEMP/TOP_P/TOP_K env, top_p/top_k in request, reasoning_content retention)

Caveats (in-doc)

Sampling deviation: MiniMax alone runs temp=1.0 (off the temp=0.3 cohort) — justified, documented per-run.
N=5 vs comparators' N=10/N=1; 131072 ctx vs 262144.
receipt.json records temperature but not top_p/top_k (full profile in doc + manifest).

🤖 Generated with Claude Code

…mpleter findings - harness.py: add --top-p/--top-k plumbing + retain reasoning_content across turns (interleaved-thinking models); run_microbench.sh: BENCH_TEMP/TOP_P/TOP_K env overrides (default temp=0.3 unchanged for other models). - Finding 1: MiniMax-NVFP4 runs away (repetition loop -> model_exceeded_max_tokens) at the bench default temp=0.3 (74% of coding cells; testwrite 10/10); at the model-card temp=1.0/top_p=0.95/top_k=40 the SAME cells on the SAME 131072 cap are 0/10 runaway, 58/60 done_signal. Clean A/B => sampling, not capability, not cap. - Finding 2: exhaustive-completer temperament. Aggregate 7/12 (35/60). p2 analysis 20/20; scope-constrained coding 0/5 (testwrite+refactor edit files told to leave alone); p3_market ctx-exhaustion. Complementary to 397B's surgical restraint. - Power: TP=2 heaviest simultaneous draw (median ~896W active, peak 1089W=91% cap, both GPUs >400W 64% of samples) vs 397B pipeline 670W/985W never >1000W. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…s (audit fixes) Audit findings addressed: - GPU power: the cited TP=2 figures (~896W median / 1089W peak / 64% / "crosses 1000W") are UNVERIFIABLE — continuous power telemetry for this run was not captured (logger output missing/untagged), and the only surviving data (per-cell nvidia-smi snapshots, N=60: combined median 612W, 0/60 over 1000W) contradicts them while itself undersampling decode peaks. Withdrawn across findings-minimax-m2.7.md, findings.md, and README; replaced with the architectural claim + an explicit data-loss caveat. Registered as held (hw.minimax.tp2-power.held). - token-range floor corrected (58k-106k, not 71k-106k; testwrite vs bugfix). - 397B runaway denominator corrected (260 cells, was 240). Registration (repo contract): manifest.json gains a minimax_m2.7 block (model/engine/serving/sampling/run-inventory) + results entries + caveats; claims.yaml gains hw.minimax.temp-serving-trap, .exhaustive-completer, and .tp2-power.held. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…redo The new qwen3.6-27b-fp8 entry shows FP8 serving is stable (113/119 clean), so the 'Q8/FP8 both serving failures' framing was superseded. Narrow to Q8-serving-specific and point forward to the FP8 entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

User Name and others added 4 commits May 31, 2026 19:19

397B README: forward-link the FP8 redo + microbench-index taxonomy note

c84cf75

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Lightheartdevs mentioned this pull request Jun 1, 2026

docs: microbench cross-tree index + synthesis-doc currency (org fixes) #37

Merged

Lightheartdevs merged commit b939a61 into main Jun 1, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MiniMax-M2.7-NVFP4 (N=5, TP=2): temp serving-trap + exhaustive-completer findings#33

Add MiniMax-M2.7-NVFP4 (N=5, TP=2): temp serving-trap + exhaustive-completer findings#33
Lightheartdevs merged 4 commits into
mainfrom
add-minimax-m2.7-nvfp4-microbench-2026-05-31

Lightheartdevs commented May 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Lightheartdevs commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Two findings

GPU power — claim corrected (audit fix)

Registration (repo contract)

Caveats (in-doc)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Lightheartdevs commented May 31, 2026 •

edited

Loading