Skip to content

Commit b939a61

Browse files
Merge PR #33: MiniMax-M2.7-NVFP4 microbench (N=5, TP=2)
Add MiniMax-M2.7-NVFP4 (N=5, TP=2): temp serving-trap + exhaustive-completer findings
2 parents fd27f37 + c84cf75 commit b939a61

7 files changed

Lines changed: 302 additions & 12 deletions

File tree

claims.yaml

Lines changed: 48 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
# entry once it's been published.
2020

2121
schema_version: 0.1
22-
last_updated: "2026-05-28"
22+
last_updated: "2026-05-31"
2323

2424
claims:
2525

@@ -337,6 +337,53 @@ claims:
337337
caveats:
338338
- "Pairwise quality study found one 27B CI-fix grader regression that the binary grader missed."
339339

340+
# ─── MiniMax-M2.7-NVFP4 (hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/findings-minimax-m2.7.md) ───
341+
342+
- id: hw.minimax.temp-serving-trap
343+
text: >
344+
MiniMax-M2.7-NVFP4 on vLLM TP=2 degenerates at the bench's default
345+
temperature=0.3: it runs the agent loop correctly then enters a final-turn
346+
repetition loop to max_tokens (14/19 coding cells ran away, p1_testwrite
347+
10/10). At the model card's temperature=1.0/top_p=0.95/top_k=40 the same
348+
cells on the same 131072 cap produce 0/10 runaways; all 60 cells finish
349+
58 done_signal / 0 runaway. The runaway is a sampling trap, not the model
350+
and not the token cap (identically-capped 397B: 0 runaways in 260 cells).
351+
status: provisional
352+
scope: MiniMax-M2.7-NVFP4, vLLM TP=2, sm_120, clean same-cell A/B (temp 0.3 vs 1.0)
353+
evidence: "hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/findings-minimax-m2.7.md § The serving trap"
354+
caveats:
355+
- "Single rig; receipt.json records temperature but not top_p/top_k (full sampling profile in the doc + manifest)."
356+
- "NVFP4 on sm_120 is a maturity-rough surface; --enable-expert-parallel not used."
357+
358+
- id: hw.minimax.exhaustive-completer
359+
text: >
360+
MiniMax-M2.7 is an "exhaustive completer": it aces open analysis (p2
361+
extract/ci/hallucination/triage a perfect 20/20) but fails scope-constrained
362+
coding (p1_testwrite 0/5, p1_refactor 0/5 on *_unchanged scope violations —
363+
it edits files it was told to leave alone) and exhausts context on p3_market
364+
(2 cells HTTP-400). Aggregate 35/60 (7/12) ties the field's band, but the
365+
per-family shape is complementary to 397B's surgical restraint, not a tie.
366+
status: provisional
367+
scope: MiniMax-M2.7-NVFP4, vLLM TP=2, N=5, temp=1.0 (card sampling)
368+
evidence: "hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/findings-minimax-m2.7.md § Scorecard + Qualitative"
369+
caveats:
370+
- "N=5 vs comparators' N=10/N=1; treat aggregate as +/-1 family."
371+
- "p1_testwrite/refactor share the bench-wide starter-code task-design caveat, but MiniMax's failures are substantive scope violations, not the grader artifact."
372+
373+
- id: hw.minimax.tp2-power.held
374+
text: >
375+
The TP=2-vs-pipeline simultaneous-power-draw comparison for MiniMax-M2.7 is
376+
held. Continuous per-sample power telemetry for the MiniMax run was not
377+
reliably captured; only instantaneous per-cell nvidia-smi snapshots survive
378+
(N=60: combined median 612W, balanced ~313W/~300W per GPU, 0/60 over 1000W),
379+
and those undersample active-decode peaks. Earlier-cited figures
380+
(~896W median / 1089W peak / 64% / "crosses 1000W") are withdrawn as
381+
unverifiable. The architectural expectation (TP computes each layer on both
382+
GPUs simultaneously, vs pipeline alternation) stands; a quantified peak does not.
383+
status: held
384+
scope: MiniMax-M2.7-NVFP4, vLLM TP=2, power telemetry
385+
evidence: "hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/findings-minimax-m2.7.md § GPU power"
386+
340387
# ─── Retracted ────────────────────────────────────────────────────────────────
341388

342389
retracted:

hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/README.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ Step-3.7-Flash-NVFP4 entry on the same box.
66

77
**N=10** — ten replicates per cell, both arms (240 cells, all `done_signal`; phase-1 graded with the fixed `phase1_grade.py`).
88

9+
> **Where this lives:** this is a **model-behavior** study (the 12-family agentic microbench), physically filed under `hardware-tests/` only because it needed the dual-Blackwell rig. For every 12-family microbench across both trees, see [`../../MICROBENCH-INDEX.md`](../../MICROBENCH-INDEX.md).
10+
911
## TL;DR
1012

1113
*This entry is methodological, not "which model won." The two results that survive scrutiny lead; the "scale ties" observation is real but the most caveated, so it's demoted.*
@@ -15,12 +17,16 @@ Step-3.7-Flash-NVFP4 entry on the same box.
1517
- **③ Aggregate ties ~7–8/12 across 397B / Flash / 27B-Q4 / Coder-Q4 — but read as *suggestive*.** Two confounds keep this from being a scaling law: **cross-quant** (397B at Q3 vs ~11B-active at FP4 — not a clean scale axis) and **N-asymmetry** (only 397B is N=10; comparators are N=1, which this very entry proves misreads cells).
1618
- **Failure temperament tracks lineage, not size:** 397B + 27B *stall* (never over-generate); Coder-Next + Flash *run away*. Zero max_tokens runaways across all 240 397B cells.
1719
- ⚠️ 27B/Coder **phase-1 reference cells are quarantined** pending [issue #29] (same grader bug this entry fixed); their p2/p3 cells are unaffected and used in the cross-model comparison.
18-
- **Cross-model uses clean Q4/AWQ refs** for 27B/Coder; fresh Q8/FP8 runs excluded as serving failures (documented, not faked).
20+
- **Cross-model uses clean Q4/AWQ refs** for 27B/Coder; the *first* fresh Q8/FP8 attempts were excluded as serving failures (documented, not faked). **Update (2026-05-31):** a clean **FP8** redo of 27B since succeeded — full entry at [`../qwen3.6-27b-fp8-microbench-2026-05-31/`](../qwen3.6-27b-fp8-microbench-2026-05-31/); the failure was **Q8-serving-specific**, not 27B-on-this-rig.
1921
- **GPU power:** combined both-GPU draw never within 5% of the 1200W cap (median 670W, max 985W=82%); GPU0 leads GPU1 — pipeline alternation. The pair never hits full power together.
2022
- The substance is qualitative — **read [QUALITATIVE.md](QUALITATIVE.md).**
2123

2224
## Files
2325
- [findings.md](findings.md) — N=10 scorecard + headline findings + power + cross-model qualitative.
26+
- [findings-minimax-m2.7.md](findings-minimax-m2.7.md)**MiniMax-M2.7-NVFP4 (N=5, vLLM TP=2):** the
27+
temp=0.3→1.0 serving-trap (0 runaways at spec vs 74% at default), the "exhaustive completer" temperament
28+
(p2 analysis 20/20, scope-constrained coding 0/5, p3_market ctx-exhaustion), and the TP=2 power topology
29+
(balanced both-GPU draw; continuous-sample peak not captured this run — see the doc's data caveat).
2430
- [findings-n10.md](findings-n10.md) — auto-generated replicate-stability table (flags small-N flips) + finish-reason audit.
2531
- [power-analysis.md](power-analysis.md) — dual-GPU power percentiles, pipeline asymmetry, %-of-cap.
2632
- [QUALITATIVE.md](QUALITATIVE.md) — behavioral analysis beyond pass/fail (token economy, packaging,
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# MiniMax-M2.7-NVFP4 on 2× RTX PRO 6000 Blackwell — microbench N=5 (+ the temp serving-trap)
2+
3+
MiniMax-M2.7 (230B-A10B MoE), served as **NVFP4 on vLLM tensor-parallel (TP=2)**, run through the MMBT
4+
12-family agentic microbench. Added to the [397B vs Step-3.7-Flash entry](findings.md) as a fifth model.
5+
6+
**N=5** per cell (60 cells). Comparators in the main entry are N=10 (397B) / N=1 (Step, 27B-Q4, Coder-Q4) —
7+
so MiniMax is **N=5, an asymmetry to read with the same caution this entry already documents for small N.**
8+
9+
## TL;DR — two findings, the first is the bigger one
10+
11+
1. **A serving trap, not a capability gap (the headline).** Run at the bench's cross-model default
12+
**`temperature=0.3`**, MiniMax-NVFP4 looked *broken* on coding: it ran the agent loop correctly for
13+
~29 iterations, then on the **final text turn entered a repetition loop that generated tens of
14+
thousands of tokens to the `max_tokens` cap** (`finish_reason: model_exceeded_max_tokens`). Across the
15+
two coding families the temp=0.3 run reached before it was stopped: **14/19 cells ran away (74%)**
16+
`p1_bugfix` 4/9, `p1_testwrite` **10/10**. Re-run at MiniMax's **model-card sampling
17+
(`temperature=1.0, top_p=0.95, top_k=40`)**, the *same cells on the same 131072 cap* produce **0/10
18+
runaways** — clean `done_signal`. Across all **60** N=5 cells: **58 `done_signal`, 0 runaways**, 2
19+
context-exhaustion. The clean A/B on the identical cell/cap proves the runaway was **sampling**, not
20+
the model and not a stingy token cap. *Greedy-ish decode on a reasoning model is a documented
21+
repetition-loop trap; MiniMax's card mandates temp=1.0.*
22+
23+
2. **MiniMax is an "exhaustive completer" — a genuinely double-edged temperament.** Its instinct is to do
24+
the maximum: fix every bug it sees, write a full test suite + audit/ADR/CHANGELOG docs, research
25+
exhaustively. This **dominates open analysis** and **sinks scope-constrained edits** (see scorecard).
26+
It is the mirror image of 397B's surgical restraint.
27+
28+
## The serving trap — before/after on the same cells
29+
30+
| family | temp=0.3 (broken) | temp=1.0 (spec) |
31+
|---|---|---|
32+
| p1_bugfix | 5 `done_signal` / **4 runaway** (44%) | **5 `done_signal` / 0 runaway** |
33+
| p1_testwrite | 0 / **10 runaway (100%)** | **5 `done_signal` / 0 runaway** |
34+
| all 60 cells (spec) || **58 `done_signal`, 0 runaway, 2 ctx-exhaustion** |
35+
36+
A temp=0.3-runaway cell generated **58k–106k tokens in a single final turn** (the testwrite cells cluster
37+
at 71k–106k; two bugfix cells dip to ~59k/75k). That is degenerate repetition, not legitimate output —
38+
and the identically-capped 397B (also 131072, also temp 0.3-equivalent path) had **0 runaways in 260
39+
cells**, so the cap is not the cause.
40+
41+
## Scorecard (N=5; PASS = grader verdict, majority ≥3/5 marks the family ✓)
42+
43+
| family | PASS/5 | avg iters | dominant fail-reason |
44+
|---|---|---|---|
45+
| p2_extract | **5/5**| 11 ||
46+
| p2_ci | **5/5**| 41 ||
47+
| p2_hallucination | **5/5**| 15 ||
48+
| p2_triage | **5/5**| 13 ||
49+
| p3_business | **5/5**| 18 ||
50+
| p3_doc | **4/5**| 19 | 1× word-limit |
51+
| p3_pm | **3/5**| 14 ||
52+
| p1_bugfix | 2/5 | 131 | `ruff_no_regression` ×3 (lint nits in its own new tests) |
53+
| p3_writing | 1/5 | 30 | length/quality |
54+
| p1_testwrite | 0/5 | 86 | `logalyzer_unchanged` ×5 (**scope violation**) |
55+
| p1_refactor | 0/5 | 75 | `tests_unchanged` ×5, `non_output_files_unchanged` ×2 (**scope violation**) |
56+
| p3_market | 0/5 | 51 |**ctx-exhaustion (HTTP 400)** + 3 fail |
57+
58+
**Aggregate: 7/12 families pass majority; 35/60 cells (58%).** This lands in the same **~7–8/12 band** as
59+
397B (8/12 no-think), Step-3.7-Flash, 27B-Q4 and Coder-Q4 — MiniMax **aggregate-ties the field, but the
60+
per-family *shape* is distinctive** (and that shape, not the tie, is the finding).
61+
62+
## Qualitative — the "exhaustive completer" pattern (every claim cited to graded cells)
63+
64+
- **Open analysis = pure upside. p2 is a perfect 20/20** (extract/ci/hallucination/triage all 5/5), fast
65+
(~11–41 iters). When there is no scope or length constraint, thoroughness only helps. Best-in-class here.
66+
- **Scope-constrained coding = real liability, not a grader artifact. `p1_testwrite` 0/5 and `p1_refactor`
67+
0/5**, failing on **substantive** criteria: the test-writing task says *"add tests, leave the production
68+
code unchanged"* → MiniMax rewrites the production code anyway (`logalyzer_unchanged` fails 5/5); the
69+
refactor task says *"only touch the output package"* → it modifies tests and out-of-scope files
70+
(`tests_unchanged` 5/5, `non_output_files_unchanged` 2/5). It **cannot resist improving everything**,
71+
which is exactly what these guardrails forbid. This is the **opposite of 397B**, whose surgical restraint
72+
*passes* `p1_refactor`.
73+
- **`p1_bugfix` is the in-between case (2/5):** it fixes the planted bugs correctly (the O(n²) `load()`
74+
measured 11.9s→0.57–0.63s by the grader; the `collections.Iterable` import removed; 69–82 tests pass)
75+
but trips `ruff_no_regression` (2→3, 2→5) on **unused-import nits in its own newly-written tests**
76+
a lint technicality. *Here* the binary score understates it; on testwrite/refactor it does not.
77+
- **Open-ended sinks exhaust context. `p3_market` 0/5**, with **2 cells hitting the 131072 ceiling
78+
outright (HTTP 400)** — it keeps issuing research tool-calls until the conversation won't fit. The p3
79+
analog of over-delivery.
80+
- **Structured synthesis is fine**`p3_business` 5/5, `p3_doc` 4/5, `p3_pm` 3/5. Where the deliverable
81+
is well-bounded, the thoroughness lands.
82+
- **Failure texture is its own category:** not 397B's quiet *stall* (omission), not Coder/Flash's
83+
*runaway* (over-generation cutoff) — MiniMax's misses are **self-inflicted: scope violations and context
84+
exhaustion from doing too much.** And it is **expensive** — 75–131 iters/cell on p1 (the high end of the
85+
field) vs ~11–18 on the analysis cells it aces.
86+
87+
**One-line verdict:** complementary to 397B. MiniMax aces analysis where 397B is average; 397B respects
88+
guardrails where MiniMax bulldozes them. A task-class strengths finding, not a scaling-law tie.
89+
90+
## GPU power — TP=2 loads both GPUs in balance (no quantified peak this run)
91+
92+
> **Data caveat (important):** continuous power sampling for the MiniMax run was **not reliably captured**
93+
> — the per-sample logger output for this run is missing/untagged, so the active-decode median/peak
94+
> percentiles are **not available**. An earlier draft of this doc cited specific figures
95+
> (~896W median / 1089W peak / 64%-of-samples / "crosses 1000W"); those are **unverifiable against any
96+
> committed data and are withdrawn.** The only surviving power data is the per-cell instantaneous
97+
> `receipt.json` `nvidia-smi` snapshot (one sample per cell), which is **not** a decode-peak measurement.
98+
99+
What the **60 receipt snapshots** show (instantaneous, N=60 cells): combined draw **median 612W, max 703W**,
100+
balanced per-GPU (**GPU0 ~313W / GPU1 ~300W median**), and **0/60 above 1000W**. These confirm TP=2's
101+
*balance* — both GPUs draw nearly equally, the signature of tensor-parallel splitting each layer across
102+
both cards — but because the snapshots are single instantaneous samples (likely caught outside sustained
103+
decode), they **undersample peak draw and cannot be compared** to 397B's continuously-sampled pipeline
104+
figures (median 670W / max 985W over 3,868 samples).
105+
106+
**The defensible claim is architectural, not a measured peak:** tensor-parallel makes both GPUs compute
107+
each layer *simultaneously* (so they fire together), whereas pipeline-parallel *alternates* them (GPU0
108+
leads, combined draw staggered). MiniMax's balanced per-GPU snapshots are consistent with that; a
109+
quantified simultaneous-peak comparison to 397B would need a re-run with the per-sample logger fixed.
110+
111+
## Caveats (read these with the numbers)
112+
113+
- **Sampling deviation:** MiniMax ran at its **card-specified `temp=1.0/top_p=0.95/top_k=40`**, *not* the
114+
bench's cross-model `temp=0.3`. This is a deliberate, documented per-model deviation — at temp=0.3 the
115+
model is off-spec and degenerates (finding #1), so a temp=0.3 score would be meaningless. The receipt
116+
schema records `temperature` but **not** `top_p`/`top_k`; the full profile is recorded here and in the
117+
launch command.
118+
- **N=5 vs comparators' N=10/N=1** — directional. The per-family rates above (esp. the 0/5 and 5/5 ones)
119+
are stable enough to characterize, but treat the aggregate as ±1 family.
120+
- **131072 context** (vs Step/27B/Coder at 262144) — a real cap asymmetry, footnoted; it did not cause the
121+
runaways (those were sampling) but it is where `p3_market` exhausted context. MiniMax's native max is ~196k.
122+
- **NVFP4 on Blackwell SM120** is a maturity-rough surface (vLLM has open repetition/kernel issues for
123+
MiniMax-NVFP4); we did **not** use `--enable-expert-parallel` (the named trigger in vLLM #31856).
124+
- This build emits **no separate `<think>`/`reasoning_content`** despite MiniMax-M2 being an
125+
interleaved-thinking model — so the interleaved-thinking history requirement was moot here (the harness
126+
retains `reasoning_content` defensively regardless).
127+
128+
## Reproduce
129+
```bash
130+
# Serve (vLLM TP=2, NVFP4). minimax_m2 parsers; no expert-parallel.
131+
docker run -d --name vllm-minimax --gpus all --shm-size 16g -e NCCL_P2P_DISABLE=1 \
132+
-v $HOME/models:/models:ro -p 127.0.0.1:8001:8000 vllm/vllm-openai:latest \
133+
--model /models/nvidia-MiniMax-M2.7-NVFP4 --served-model-name minimax-m2.7 \
134+
--tensor-parallel-size 2 --gpu-memory-utilization 0.92 --quantization modelopt \
135+
--trust-remote-code --max-model-len 131072 --disable-custom-all-reduce \
136+
--tool-call-parser minimax_m2 --reasoning-parser minimax_m2 --enable-auto-tool-choice \
137+
--host 0.0.0.0 --port 8000
138+
139+
# Run at the model card's sampling (BENCH_* env overrides the default temp=0.3):
140+
BENCH_TEMP=1.0 BENCH_TOP_P=0.95 BENCH_TOP_K=40 \
141+
bash tooling/scripts/run_microbench.sh minimax-m2.7 8001 minimax-m2.7-spec 5 "" "" 131072
142+
bash tooling/scripts/grade_microbench.sh minimax-m2.7-spec
143+
```

0 commit comments

Comments
 (0)