Skip to content

Bench autopilot tooling + 397B N=10 + GPU power analysis + unified cross-model#31

Merged
Lightheartdevs merged 8 commits into
mainfrom
mmbt-bench-autopilot-2026-05-30
May 30, 2026
Merged

Bench autopilot tooling + 397B N=10 + GPU power analysis + unified cross-model#31
Lightheartdevs merged 8 commits into
mainfrom
mmbt-bench-autopilot-2026-05-30

Conversation

@Lightheartdevs
Copy link
Copy Markdown
Contributor

Builds on the merged N=3 entry (#30). Adds a self-healing bench harness, extends 397B to N=10, adds GPU power analysis, and unifies the cross-model comparison on clean data.

Tooling (autonomous bench harness + /mmbt-bench skill)

  • bench_autopilot.py — self-healing supervisor (endpoint restart, stuck-cell kill, idempotent resume, model presets, --publish, Pushover, heartbeat).
  • bench_dashboard.py — color grid + --oneline/--watch/--html/--json/--trend/--flips/sparkline.
  • bench_report.py — auto-generates the findings doc (scorecard + replicate-stability + redistribution + finish-reason audit).
  • bench_power.py + gpu_power_logger.shNEW dual-GPU telemetry logger + analyzer.
  • skill_draft/SKILL.md → installed as /mmbt-bench.

397B-A17B N=10 (240 cells, all done_signal)

  • no-think 82/120, think 72/120 — same 7–8/12 band as Step-3.7-Flash, 27B-Q4, Coder-Next-Q4. Scale doesn't move the aggregate (now at N=10).
  • Thinking net −10: helps p3_market (8→10) but craters p3_doc (9→2) and p3_pm (5→0). Per-task call, not a default.
  • N=10 overturns small-N luck: p3_market no-think flips 1/3 (N=3) → 8/10 (auto-flagged in the stability table). The headline methodological result.
  • Failure temperament tracks lineage, not size: 397B + 27B stall; Coder-Next + Flash run away.

Cross-model (Q4 refs, honest)

27B/Coder columns use the clean Q4/AWQ runs (microbench-2026-04-28). Fresh Q8/FP8 attempts excluded as serving failures (35B: 36/36 HTTP-400; 27B-Q8: 23/36 runaway) — documented, not faked.

GPU power — "are both GPUs ever near full power?" No (3,868 paired samples)

  • Combined both-GPU: median 670W (56% of 1200W cap), max 985W (82%), 0 samples within 5% of full.
  • GPU0 leads GPU1 (+27W) = pipeline-parallel alternation; decode 339W/GPU vs CPU-tool phases 125W/GPU.
  • The "both at 600W" worst case doesn't occur for single-stream agentic inference under pipeline parallelism. (MiniMax-M2.7 TP=2 contrast pending in a follow-up commit.)

🤖 Generated with Claude Code

User Name and others added 7 commits May 30, 2026 04:20
…run harness)

Self-healing supervisor (bench_autopilot.py) that drives the MMBT microbench to a
target N: keeps the llama.cpp endpoint alive, runs both reasoning arms idempotently,
kills truly-hung cells, grades + summarizes, emits live status.json, Pushover alerts,
per-cell timing/tok-s, resume-safety heartbeat. Visual dashboard (bench_dashboard.py):
color grid tasks x reps per arm, progress/ETA/health, --oneline/--watch/--html,
pass-rate vs N=10 baseline, sparkline, event feed. SKILL.md wraps it as /mmbt-bench.
(.v1.py keep the pre-enrichment versions.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…m 82)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…trend/--flips (dashboard); bench_report.py findings generator

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…h_html.py), refresh skill

bench_report.py: reconciliation guard (autopilot counts ungraded cells in denom,
report excludes; converge at clean COMPLETE). Verified scorecard/finish-reason/
redistribution match hand-recounts + published findings (0 ref mismatches) — see
HARDENING-NOTES.md. bench_html.py: auto-refresh dark/mobile dashboard. SKILL refreshed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
397B-A17B microbench extended to N=10 both arms (240 cells, all done_signal):
no-think 82/120, think 72/120 — same 7-8/12 band as Step-3.7-Flash, 27B-Q4,
Coder-Next-Q4 (scale doesn't move the aggregate, confirmed at N=10). Thinking is
net -10: helps p3_market (8->10) but craters p3_doc (9->2) and p3_pm (5->0). N=10
overturns small-N luck — p3_market no-think flips 1/3 (N=3) -> 8/10 (auto-flagged
in findings-n10.md stability table).

Cross-model: 27B/Coder columns use the clean Q4/AWQ runs (microbench-2026-04-28);
fresh Q8/FP8 attempts excluded as serving failures (35B 36/36 HTTP-400, 27B-Q8
23/36 runaway). Failure temperament tracks lineage not size: 397B+27B stall,
Coder+Flash run away.

Power (gpu_power_logger.sh + new bench_power.py, 3868 paired samples): combined
both-GPU median 670W / max 985W = 82% of 1200W cap, 0 samples within 5% of full;
GPU0 leads GPU1 (+27W) = pipeline-parallel alternation; decode 339W/GPU vs
CPU-tool phases 125W/GPU. The pair never approaches full power together.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR audit caught the entry was internally inconsistent: findings.md was N=10 (82/120,
72/120) but README/QUALITATIVE/manifest still showed N=3 (23/36, 22/36, "net -1").
Updated all three to the N=10 numbers + the sharpened thinking finding (net -10:
market 8->10, doc 9->2, pm 5->0), added the GPU power + cross-model + Q4-refs notes,
and listed the new findings-n10.md / power-analysis.md files. Scorecard re-verified
against grade.json (82/120, 72/120 exact). No content claims changed — only stale
counts corrected to match the audited ground truth.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…utter in a public entry)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Lightheartdevs
Copy link
Copy Markdown
Contributor Author

Self-audit (against grade.json ground truth)

Audited the PR end-to-end. Findings:

  • Scorecard accurate — recounted all 240 cells from grade.json: no-think 82/120, think 72/120, every per-task cell matches findings.md exactly.
  • Power numbers reproducebench_power.py re-run matches power-analysis.md (max 985W=82% of cap, 0 samples within 5% of full; tiny mean drift is just more samples since commit).
  • 🔴 Fixed staleness (the one real issue): README, QUALITATIVE.md, and manifest.json still carried N=3 numbers (23/36, 22/36, "net −1") while findings.md was N=10 — the entry was internally inconsistent. All three updated to N=10 (567ed15), with the sharpened finding (thinking net −10: market 8→10, doc 9→2, pm 5→0), plus the power + Q4-cross-model notes and the new file listings. No claims changed — only stale counts corrected.
  • Tooling — all 5 scripts compile; reproduce commands match real flags. Removed the .v1.py pre-enrichment backups (26d8ce3) — git history preserves them; clutter in a public entry.
  • No leaks — no run-logs, no broken-Q8 cells, no agent-pilot data in the diff.

Verdict: accurate and complete for everything through 397B N=10 + power + Q4 cross-model. MiniMax-M2.7 is deliberately not in this PR — it smoke-passed but the N=10 run 400-errored on every agentic cell (multi-turn serving-format issue); being debugged separately, will land as a follow-up so this PR isn't blocked.

…7B + Coder-Next)

The N=1 ref columns undersold data we already have: benchmarks/microbench-phase-b-
2026-05-02 ran Qwen3.6-27B-AWQ at N=10 in BOTH thinking and no-think arms (+ Coder-Next
N=10) on the P3 differential cells — a structural match to this entry's 397B study.

Headline now spans the size range: thinking is net-negative for BOTH the 397B giant
(72 vs 82) AND the 27B (75% vs 86.8% ship), via the SAME p3_doc word-limit loop
(397B 9->2 PASS; 27B-thinking wall_killed ~40%). Added a per-cell P3 ship-rate table
(Coder/27B-think/27B-nothink/397B-nothink/397B-think, all N=10) and confirmed failure
temperament clusters by lineage not size: Qwen-derived (397B,27B) stall; Coder-Next +
Flash run away. P1 cross-harness caveat noted (phase-b 27B-think P1 used older sha).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Lightheartdevs Lightheartdevs merged commit 4b628a4 into main May 30, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant