Bench autopilot tooling + 397B N=10 + GPU power analysis + unified cross-model#31
Merged
Merged
Conversation
…run harness) Self-healing supervisor (bench_autopilot.py) that drives the MMBT microbench to a target N: keeps the llama.cpp endpoint alive, runs both reasoning arms idempotently, kills truly-hung cells, grades + summarizes, emits live status.json, Pushover alerts, per-cell timing/tok-s, resume-safety heartbeat. Visual dashboard (bench_dashboard.py): color grid tasks x reps per arm, progress/ETA/health, --oneline/--watch/--html, pass-rate vs N=10 baseline, sparkline, event feed. SKILL.md wraps it as /mmbt-bench. (.v1.py keep the pre-enrichment versions.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…m 82) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…trend/--flips (dashboard); bench_report.py findings generator Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…h_html.py), refresh skill bench_report.py: reconciliation guard (autopilot counts ungraded cells in denom, report excludes; converge at clean COMPLETE). Verified scorecard/finish-reason/ redistribution match hand-recounts + published findings (0 ref mismatches) — see HARDENING-NOTES.md. bench_html.py: auto-refresh dark/mobile dashboard. SKILL refreshed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
397B-A17B microbench extended to N=10 both arms (240 cells, all done_signal): no-think 82/120, think 72/120 — same 7-8/12 band as Step-3.7-Flash, 27B-Q4, Coder-Next-Q4 (scale doesn't move the aggregate, confirmed at N=10). Thinking is net -10: helps p3_market (8->10) but craters p3_doc (9->2) and p3_pm (5->0). N=10 overturns small-N luck — p3_market no-think flips 1/3 (N=3) -> 8/10 (auto-flagged in findings-n10.md stability table). Cross-model: 27B/Coder columns use the clean Q4/AWQ runs (microbench-2026-04-28); fresh Q8/FP8 attempts excluded as serving failures (35B 36/36 HTTP-400, 27B-Q8 23/36 runaway). Failure temperament tracks lineage not size: 397B+27B stall, Coder+Flash run away. Power (gpu_power_logger.sh + new bench_power.py, 3868 paired samples): combined both-GPU median 670W / max 985W = 82% of 1200W cap, 0 samples within 5% of full; GPU0 leads GPU1 (+27W) = pipeline-parallel alternation; decode 339W/GPU vs CPU-tool phases 125W/GPU. The pair never approaches full power together. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR audit caught the entry was internally inconsistent: findings.md was N=10 (82/120, 72/120) but README/QUALITATIVE/manifest still showed N=3 (23/36, 22/36, "net -1"). Updated all three to the N=10 numbers + the sharpened thinking finding (net -10: market 8->10, doc 9->2, pm 5->0), added the GPU power + cross-model + Q4-refs notes, and listed the new findings-n10.md / power-analysis.md files. Scorecard re-verified against grade.json (82/120, 72/120 exact). No content claims changed — only stale counts corrected to match the audited ground truth. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…utter in a public entry) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Author
Self-audit (against grade.json ground truth)Audited the PR end-to-end. Findings:
Verdict: accurate and complete for everything through 397B N=10 + power + Q4 cross-model. MiniMax-M2.7 is deliberately not in this PR — it smoke-passed but the N=10 run 400-errored on every agentic cell (multi-turn serving-format issue); being debugged separately, will land as a follow-up so this PR isn't blocked. |
…7B + Coder-Next) The N=1 ref columns undersold data we already have: benchmarks/microbench-phase-b- 2026-05-02 ran Qwen3.6-27B-AWQ at N=10 in BOTH thinking and no-think arms (+ Coder-Next N=10) on the P3 differential cells — a structural match to this entry's 397B study. Headline now spans the size range: thinking is net-negative for BOTH the 397B giant (72 vs 82) AND the 27B (75% vs 86.8% ship), via the SAME p3_doc word-limit loop (397B 9->2 PASS; 27B-thinking wall_killed ~40%). Added a per-cell P3 ship-rate table (Coder/27B-think/27B-nothink/397B-nothink/397B-think, all N=10) and confirmed failure temperament clusters by lineage not size: Qwen-derived (397B,27B) stall; Coder-Next + Flash run away. P1 cross-harness caveat noted (phase-b 27B-think P1 used older sha). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Builds on the merged N=3 entry (#30). Adds a self-healing bench harness, extends 397B to N=10, adds GPU power analysis, and unifies the cross-model comparison on clean data.
Tooling (autonomous bench harness + /mmbt-bench skill)
bench_autopilot.py— self-healing supervisor (endpoint restart, stuck-cell kill, idempotent resume, model presets,--publish, Pushover, heartbeat).bench_dashboard.py— color grid +--oneline/--watch/--html/--json/--trend/--flips/sparkline.bench_report.py— auto-generates the findings doc (scorecard + replicate-stability + redistribution + finish-reason audit).bench_power.py+gpu_power_logger.sh— NEW dual-GPU telemetry logger + analyzer.skill_draft/SKILL.md→ installed as/mmbt-bench.397B-A17B N=10 (240 cells, all
done_signal)p3_market(8→10) but cratersp3_doc(9→2) andp3_pm(5→0). Per-task call, not a default.p3_marketno-think flips 1/3 (N=3) → 8/10 (auto-flagged in the stability table). The headline methodological result.Cross-model (Q4 refs, honest)
27B/Coder columns use the clean Q4/AWQ runs (
microbench-2026-04-28). Fresh Q8/FP8 attempts excluded as serving failures (35B: 36/36 HTTP-400; 27B-Q8: 23/36 runaway) — documented, not faked.GPU power — "are both GPUs ever near full power?" No (3,868 paired samples)
🤖 Generated with Claude Code