Add MiniMax-M2.7-NVFP4 (N=5, TP=2): temp serving-trap + exhaustive-completer findings#33
Merged
Lightheartdevs merged 4 commits intoJun 1, 2026
Conversation
…mpleter findings - harness.py: add --top-p/--top-k plumbing + retain reasoning_content across turns (interleaved-thinking models); run_microbench.sh: BENCH_TEMP/TOP_P/TOP_K env overrides (default temp=0.3 unchanged for other models). - Finding 1: MiniMax-NVFP4 runs away (repetition loop -> model_exceeded_max_tokens) at the bench default temp=0.3 (74% of coding cells; testwrite 10/10); at the model-card temp=1.0/top_p=0.95/top_k=40 the SAME cells on the SAME 131072 cap are 0/10 runaway, 58/60 done_signal. Clean A/B => sampling, not capability, not cap. - Finding 2: exhaustive-completer temperament. Aggregate 7/12 (35/60). p2 analysis 20/20; scope-constrained coding 0/5 (testwrite+refactor edit files told to leave alone); p3_market ctx-exhaustion. Complementary to 397B's surgical restraint. - Power: TP=2 heaviest simultaneous draw (median ~896W active, peak 1089W=91% cap, both GPUs >400W 64% of samples) vs 397B pipeline 670W/985W never >1000W. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s (audit fixes) Audit findings addressed: - GPU power: the cited TP=2 figures (~896W median / 1089W peak / 64% / "crosses 1000W") are UNVERIFIABLE — continuous power telemetry for this run was not captured (logger output missing/untagged), and the only surviving data (per-cell nvidia-smi snapshots, N=60: combined median 612W, 0/60 over 1000W) contradicts them while itself undersampling decode peaks. Withdrawn across findings-minimax-m2.7.md, findings.md, and README; replaced with the architectural claim + an explicit data-loss caveat. Registered as held (hw.minimax.tp2-power.held). - token-range floor corrected (58k-106k, not 71k-106k; testwrite vs bugfix). - 397B runaway denominator corrected (260 cells, was 240). Registration (repo contract): manifest.json gains a minimax_m2.7 block (model/engine/serving/sampling/run-inventory) + results entries + caveats; claims.yaml gains hw.minimax.temp-serving-trap, .exhaustive-completer, and .tp2-power.held. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…redo The new qwen3.6-27b-fp8 entry shows FP8 serving is stable (113/119 clean), so the 'Q8/FP8 both serving failures' framing was superseded. Narrow to Q8-serving-specific and point forward to the FP8 entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds MiniMax-M2.7-NVFP4 (230B-A10B MoE, vLLM tensor-parallel TP=2) as a fifth model in the 397B vs Step-3.7-Flash microbench entry, full 12-family MMBT microbench at N=5 (60 cells).
Two findings
1. A temperature serving-trap, not a capability gap. At the bench default
temp=0.3, MiniMax-NVFP4 runs the agent loop then degenerates into a final-turn repetition loop tomax_tokens— 14/19 coding cells ran away (p1_testwrite 10/10). At the card'stemp=1.0/top_p=0.95/top_k=40, the same cells/cap are 0/10 runaway; all 60 cells 58 done_signal, 0 runaway. Clean same-cell A/B → sampling, not the model or the cap (identically-capped 397B: 0 runaways in 260 cells).2. An "exhaustive completer" temperament. Aggregate 7/12 (35/60) — ties the band, complementary to 397B: p2 analysis 20/20, but scope-constrained coding 0/5 (edits files it was told to leave alone) and p3_market ctx-exhaustion. Self-inflicted over-delivery, the opposite of 397B's surgical restraint.
GPU power — claim corrected (audit fix)
The continuous power telemetry for this run was not reliably captured (logger output missing/untagged), so the earlier-cited TP=2 figures (~896W median / 1089W peak / 64% / "crosses 1000W") are withdrawn as unverifiable — the only surviving data (per-cell
nvidia-smisnapshots, N=60: combined median 612W, 0/60 over 1000W) contradicts them and itself undersamples decode peaks. Replaced with the architectural claim (TP=2 loads both GPUs simultaneously vs pipeline alternation; snapshots confirm balanced ~313W/~300W per-GPU draw) plus an explicit data-loss caveat. Registered asheldin claims.yaml.Registration (repo contract)
manifest.json—minimax_m2.7block (model/engine/serving/sampling/run-inventory) + results + caveatsclaims.yamlentries (hw.minimax.temp-serving-trap,.exhaustive-completer,.tp2-power.held)BENCH_TEMP/TOP_P/TOP_Kenv, top_p/top_k in request, reasoning_content retention)Caveats (in-doc)
🤖 Generated with Claude Code