Skip to content

[N=3] Qwen3.5-397B-A17B firmed results + P2a guard (#28 merged without it)#30

Merged
Lightheartdevs merged 3 commits into
mainfrom
qwen3.5-397b-n3-results-2026-05-29
May 29, 2026
Merged

[N=3] Qwen3.5-397B-A17B firmed results + P2a guard (#28 merged without it)#30
Lightheartdevs merged 3 commits into
mainfrom
qwen3.5-397b-n3-results-2026-05-29

Conversation

@Lightheartdevs

Copy link
Copy Markdown
Contributor

Draft / WIP. Stacked on #28 (the N=1 + tooling/grader-fix PR) — base is that branch, so this PR's diff is only the N=3 delta. Merge after #28.

Firms the 397B entry from N=1 → N=3, graded with the fixed phase1_grade.py from #28.

Status

  • No-think arm complete (N=3): 23/36 = 8/12 by majority — N=1 headline holds.
  • Think arm running — column fills on completion.

What N=3 already shows (no-think)

Replicates expose that two N=1 verdicts were single-draw luck:

  • p3_market: N=1 ✓ → 1/3 (high variance, hand-graded citation dimension).
  • p3_pm: N=1 ✗ → 2/3 (intermittent under-recall, not systematic).
  • p3_doc: N=1 ✓ → 2/3 (occasional 700-word-limit trip).

Zero-variance (all 3 reps): p1_bugfix, the grounded mid-tier (p2_extract/ci/hallucination/triage), p3_business — and the consistent fails (testwrite/refactor/writing). The grader-fix-sensitive cells (bugfix 3/3, testwrite 0/3, refactor 0/3) are perfectly consistent → the P1 fix is stable.

Remaining before un-drafting

  • Think N=3 column + think-vs-no-think at N=3 (does "thinking net −1" hold, or was the p3_doc flip also draw-luck?).
  • Think-arm per-cell variance (esp. p3_market runaway behavior across reps).
  • Fold N=3 into findings.md / README / manifest; retire the "N=1 provisional" framing.

🤖 Generated with Claude Code

User Name and others added 2 commits May 29, 2026 18:55
Run names + the idempotent skip are keyed by LABEL only, so running --thinking
off then on under one label silently skipped the second arm as already-complete.
Mirror the reasoning-effort guard: require the mode encoded in the label
(nothink / think), reject otherwise with a corrected example. Closes the last
review item; this was deferred while the N=3 run was executing the script.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both arms complete at N=3 (graded with the fixed phase1_grade.py). no-think 23/36,
think 22/36. Retire the N=1 "provisional" framing; findings.md / README / manifest /
QUALITATIVE now carry the N=3 numbers.

Key N=3 corrections to the N=1 story:
- Thinking is net -1 but NOT inert (N=1's read): it redistributes — stabilizes
  p3_market (1/3->3/3, zero runaways) while hurting p3_pm (2/3->0/3) and p3_doc
  (2/3->1/3). Changes WHERE 397B succeeds, not how often.
- N=3 exposes single-draw luck: no-think p3_market (N1 PASS->1/3), p3_pm
  (N1 FAIL->2/3), p3_doc (PASS->2/3) are high-variance; mid-tier + bugfix are 3/3.
- Runaway resistance confirmed across all 72 cells (incl. p3_market think 3/3).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Lightheartdevs Lightheartdevs force-pushed the qwen3.5-397b-n3-results-2026-05-29 branch from 0acba1b to 6a640dc Compare May 29, 2026 22:59
@Lightheartdevs Lightheartdevs marked this pull request as ready for review May 29, 2026 22:59
@Lightheartdevs Lightheartdevs changed the base branch from add-qwen3.5-397b-microbench-2026-05-29 to main May 29, 2026 23:01
@Lightheartdevs Lightheartdevs changed the title [N=3] Qwen3.5-397B-A17B microbench — firmed results (WIP) [N=3] Qwen3.5-397B-A17B firmed results + P2a guard (#28 merged without it) May 29, 2026
@Lightheartdevs

Copy link
Copy Markdown
Contributor Author

Retargeted base main (was stacked on #28's branch). Heads-up: #28 merged at 500c423, i.e. without the P2a --thinking guard (7ba3650) — the merge landed a moment before that push made it in. P2a is carried here instead (it's in this PR's history), so merging #30 brings both the guard and the N=3 results into main. Diff is clean: run_microbench.sh (P2a) + the 4 entry docs folded to N=3.

Self-audit against the grade/summary data caught that "all 72 cells done_signal /
never ran away" was false: 69/72 done_signal. The 3 non-clean exits are all
no-think — p3_market v2/v3 hit the 500-iter stuck threshold and p3_pm v1
model_stopped. Zero max_tokens/length runaways IS true (the Flash failure mode),
but the pathology is stalling, not runaway. Reframed across findings.md / README /
QUALITATIVE / manifest into a sharper finding: 397B stalls quietly rather than
running away, and thinking clears the market stall (think p3_market 3/3 done_signal
vs no-think 2-stuck/1-pass). Also retired the last stale N=1 status line in
QUALITATIVE.md. Scorecard totals (23/36, 22/36) re-verified against grade.json.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Lightheartdevs

Copy link
Copy Markdown
Contributor Author

Self-audit (against the grade/summary data)

Verified the published claims against ground truth. One real correction:

  • 🔴 Fixed an overclaim: "all 72 cells done_signal / never ran away" was false — 69/72 done_signal. The 3 non-clean exits are all no-think: p3_market v2/v3 hit the 500-iter stuck threshold and p3_pm v1 model_stopped. Zero max_tokens/length runaways is still true (that's the Flash failure mode), so reframed as: 397B's pathology is stalling, not runaway — and thinking clears the market stall (think p3_market 3/3 done_signal vs no-think 2-stuck/1-pass). Corrected in findings/README/QUALITATIVE/manifest (4809534).

Verified accurate (no change): scorecard vs grade.json (23/36, 22/36 and every cell match); testwrite-think cov=99 / 153 passing, fails on logalyzer_unchanged; bugfix ruff 2→0, bench 11.2s→0.537s; redistribution deltas (market +2 / pm −2 / doc −1); internal-doc consistency; links resolve; no run-logs/agent-pilot in the diff; P2a guard present and carried to main.

@Lightheartdevs Lightheartdevs merged commit aea2585 into main May 29, 2026
1 check passed
@Lightheartdevs Lightheartdevs deleted the qwen3.5-397b-n3-results-2026-05-29 branch May 29, 2026 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant