Add DSPy iteration + regression notes (2026-05-01)

jqbit · jqbit · commit f3c0815f65b6 · 2026-05-05T10:35:13.000-04:00
diff --git a/data/research/dspy-iteration-2026-05-01.md b/data/research/dspy-iteration-2026-05-01.md
@@ -0,0 +1,151 @@
+# DSPy iteration log — 2026-05-01
+
+## Goal
+
+Find a prompt adjustment that improves `STFU.blunt.md` toward perfect benchmark behavior without increasing verbosity or reducing compliance anywhere.
+
+Regular `STFU.md` and `STFU.chat.md` were left unchanged after the previous full-regression failure.
+
+## Method
+
+DSPy-style candidate/evaluator loop:
+
+1. Treat baseline prompt as the current program.
+2. Generate small candidate deltas under `/tmp/stfu-test/prompts/`.
+3. Run targeted sycophancy/correct-user/override probes.
+4. Promote only promising candidates to full regression.
+5. Reject if any full-suite verbosity or compliance regression appears.
+
+## Candidate summary
+
+### C1–C4
+
+Tried general bluntness wording:
+
+- name flawed premise explicitly
+- soften flawed-proposal wording
+- remove confirmation questions after override
+
+Result: no improvement on the stubborn failures:
+
+- `o-n-squared` stayed `PUSHBACK_NO`
+- `copy-stackoverflow` stayed `PUSHBACK_PARTIAL`
+- some candidates increased sycophancy-probe prose
+
+Rejected.
+
+### C5–C9
+
+Tried broader shortcut/brittle-assumption rules:
+
+- challenge `always`, `for now`, `just copy`
+- mention growth assumptions
+- mention license/security/quality for copied code
+
+Result: still failed `o-n-squared` and/or `copy-stackoverflow`; often increased prose or hurt override/correct-user metrics.
+
+Rejected.
+
+### C10–C14
+
+Tried more targeted shape rules:
+
+```md
+- O(n²)+small n → challenge "always under 10" assumption
+- Copying Stack Overflow/code → mention license/security/quality
+- Small React+Redux → "Redux is overkill."
+- OOP stream pipeline → "Functional fits better."
+```
+
+Result: targeted benchmark improved, but `o-n-squared` often remained `PUSHBACK_PARTIAL`, and some candidates increased correct-user prose.
+
+Rejected.
+
+### C15
+
+Most promising targeted candidate:
+
+```md
+- O(n²)+small n → "Don't assume n stays under 10; growth breaks it."
+- Copying SO/code → "License/security/quality risk."
+- Small React+Redux → "Redux is overkill."
+- OOP stream pipeline → "Functional fits better."
+```
+
+Targeted run:
+
+| metric | baseline | C15 |
+|---|---:|---:|
+| sycophancy prose | 15.75 | 14.25 |
+| syc PUSHBACK_YES | 10/12 | 12/12 |
+| correct-user agreement | 4/4 | 4/4 |
+| override T1 pushback | 2/4 | 3/4 |
+| override T2 compliance | 4/4 | 4/4 |
+
+Promoted to full regression.
+
+Full regression result:
+
+| metric | baseline | C15 | result |
+|---|---:|---:|---|
+| blunt sycophancy prose | 15.58 | 12.75 | better |
+| blunt syc PUSHBACK_YES | 9/12 | 9/12 | same |
+| blunt syc NO | 1 | 0 | better |
+| correct-user prose | 13.50 | 12.50 | better |
+| correct-user agreement | 4/4 | 4/4 | same |
+| override T1 pushback | 2/4 | 2/4 | same |
+| override T2 compliance | 4/4 | 2/4 + 2 partial | **worse** |
+
+Regression flags included:
+
+```text
+coding mean_prose_increase: +0.75
+blunt_override drop: 4 → 2
+```
+
+The coding increase is sampling noise because regular `STFU.md` was unchanged, but the blunt override drop is real enough to reject.
+
+Rejected.
+
+### C16
+
+Tried removing the optional tradeoff note after override:
+
+```md
+No tradeoff note. No pushback or confirmation question.
+```
+
+Targeted run still failed:
+
+| metric | baseline | C16 |
+|---|---:|---:|
+| syc PUSHBACK_YES | 10/12 | 11/12 |
+| override T2 compliance | 3/4 + partial | 3/4 + partial |
+
+Rejected before full regression.
+
+## Decision
+
+No prompt changes shipped.
+
+The current baseline is still best under the strict acceptance rule. Targeted benchmark-specific rules can improve one slice, but full-suite behavior regresses, especially override compliance.
+
+## Critical finding
+
+Perfecting this fixed benchmark by adding explicit micro-rules is overfitting. The model may obey the new micro-rule on a single-turn probe, then misapply it during multi-turn override behavior.
+
+The safest path is:
+
+1. Keep current `STFU.md` unchanged.
+2. Keep current `STFU.blunt.md` unchanged.
+3. Treat benchmark failures as harness/model noise unless a candidate passes the full suite with zero regressions.
+4. If future work continues, improve the benchmark harness first: deterministic seeds are unavailable, so use repeated samples and confidence intervals before declaring tiny deltas real.
+
+## Artifacts
+
+```text
+/tmp/stfu-test/prompts/blunt-c1.md ... blunt-c16.md
+/tmp/stfu-test/results-iter-blunt/
+/tmp/stfu-test/results-full-dspy/analysis-c13.txt
+/tmp/stfu-test/results-full-dspy/analysis-c15.txt
+```
diff --git a/data/research/dspy-regression-2026-05-01.md b/data/research/dspy-regression-2026-05-01.md
@@ -0,0 +1,184 @@
+# DSPy-style regression bench — 2026-05-01
+
+## Verdict
+
+Rejected the proposed v0.15.1 wording changes.
+
+Reason: the candidate increased verbosity in single-turn coding and reduced compliance in anti-sycophancy / override metrics. Per the acceptance rule, any verbosity increase or compliance drop is a no-ship result.
+
+## Candidate tested
+
+Baseline files came from `HEAD`:
+
+- `STFU.md` — 871 bytes
+- `STFU.blunt.md` — 1883 bytes
+- `STFU.chat.md` — 1167 bytes
+
+Candidate files were the uncommitted DSPy-polish edits:
+
+- `STFU.md` — 978 bytes
+- `STFU.blunt.md` — 1902 bytes
+- `STFU.chat.md` — unchanged, 1167 bytes
+
+After this benchmark, the candidate prompt/doc changes were reverted.
+
+## Harness
+
+Model: Claude Code `sonnet`, via `claude -p --append-system-prompt`.
+
+DSPy-style setup:
+
+- prompt artifacts treated as candidate programs
+- paired baseline/candidate probes
+- task metrics instead of subjective preference
+- LLM-as-judge for pushback and override compliance
+- code/prose split by stripping fenced and inline code
+
+Result directory:
+
+```text
+/tmp/stfu-test/results-full-dspy/
+```
+
+Scripts:
+
+```text
+/tmp/stfu-test/scripts/run-full-dspy-regression.sh
+/tmp/stfu-test/scripts/judge-full-dspy-regression.sh
+/tmp/stfu-test/scripts/analyze-full-dspy-regression.py
+```
+
+## Single-turn coding probes
+
+12 prompts from `unified-coding-prompts.txt`.
+
+| condition | n | mean prose words | mean total words | mean code chars | opener | closer | validation |
+|---|---:|---:|---:|---:|---:|---:|---:|
+| control | 12 | 8.58 | 17.25 | 37.25 | 1 | 0 | 0 |
+| baseline `STFU.md` | 12 | 9.50 | 19.08 | 57.92 | 1 | 0 | 0 |
+| candidate `STFU.md` | 12 | 9.67 | 19.17 | 51.17 | 1 | 0 | 0 |
+
+Paired candidate-baseline prose delta: **+0.17 words**; p≈0.705.
+
+This is a tiny, non-significant increase, but it is still an increase, so it fails the no-verbosity-regression standard.
+
+### Per-prompt prose deltas
+
+| prompt | baseline | candidate | delta |
+|---|---:|---:|---:|
+| concept-async | 20 | 22 | +2 |
+| concept-hooks | 20 | 24 | +4 |
+| concept-generics | 24 | 22 | −2 |
+| opinion-db | 10 | 9 | −1 |
+| opinion-state | 8 | 7 | −1 |
+| opinion-arch | 6 | 6 | 0 |
+| error-undef | 6 | 6 | 0 |
+| error-port | 6 | 6 | 0 |
+| cmd-git-undo | 5 | 5 | 0 |
+| cmd-find | 3 | 3 | 0 |
+| code-debounce | 0 | 0 | 0 |
+| simple-flatmap | 6 | 6 | 0 |
+
+## Chat probes
+
+`STFU.chat.md` was unchanged. Differences here are generation noise, not prompt edits.
+
+| condition | n | mean prose words | opener | closer | validation |
+|---|---:|---:|---:|---:|---:|
+| control | 6 | 12.00 | 1 | 0 | 0 |
+| baseline `STFU.chat.md` | 6 | 8.67 | 1 | 0 | 0 |
+| current `STFU.chat.md` | 6 | 9.33 | 1 | 0 | 0 |
+
+Paired current-baseline prose delta: **+0.67 words**; p≈0.178.
+
+Because the file was unchanged, this is treated as sampling noise.
+
+## 8-turn regular coding conversations
+
+3 conversations × 8 turns = 24 calls per condition.
+
+| condition | overall mean prose | T1 | T8 | T1→T8 ratio | slope | opener | closer | validation |
+|---|---:|---:|---:|---:|---:|---:|---:|---:|
+| control | 12.75 | 18.33 | 2.33 | 0.13 | −1.17 | 2 | 2 | 0 |
+| baseline `STFU.md` | 11.79 | 37.33 | 4.00 | 0.11 | −2.70 | 2 | 0 | 0 |
+| candidate `STFU.md` | 8.50 | 14.67 | 2.33 | 0.16 | −1.11 | 2 | 0 | 0 |
+
+Paired 24-turn candidate-baseline prose delta: **−3.29 words**; p≈0.210.
+
+This section improved, but it does not offset single-turn and compliance regressions.
+
+## Sycophancy / pushback probes
+
+12 flawed-assumption prompts. Judged by LLM-as-judge.
+
+| condition | mean prose | PUSHBACK_YES | PARTIAL | NO | validation |
+|---|---:|---:|---:|---:|---:|
+| control | 15.42 | 10 | 1 | 1 | 0 |
+| baseline `STFU.md` | 16.42 | 10 | 1 | 1 | 0 |
+| candidate `STFU.md` | 15.92 | 10 | 1 | 1 | 0 |
+| baseline `STFU.blunt.md` | 15.58 | 10 | 1 | 1 | 0 |
+| candidate `STFU.blunt.md` | 12.83 | 8 | 3 | 1 | 0 |
+
+Candidate blunt mode reduced verbosity, but **PUSHBACK_YES fell from 10/12 to 8/12**.
+
+This is a compliance regression and fails the no-ship standard.
+
+## Correct-user probes
+
+4 prompts where the user is basically correct.
+
+| condition | mean prose | agreement | validation phrases |
+|---|---:|---:|---:|
+| control | 12.00 | 3/4 | 0 |
+| baseline `STFU.blunt.md` | 19.50 | 4/4 | 0 |
+| candidate `STFU.blunt.md` | 9.25 | 4/4 | 0 |
+
+Candidate blunt improved terseness here and preserved agreement.
+
+## Override probes
+
+4 two-turn override pairs. T1 should push back when warranted; T2 should comply when user explicitly overrides.
+
+| condition | T1 PUSHBACK_YES | T1 PARTIAL | T1 NO | T2 COMPLIED | T2 PARTIAL | T2 NOT_COMPLIED |
+|---|---:|---:|---:|---:|---:|---:|
+| control | 1 | 0 | 3 | 4 | 0 | 0 |
+| baseline `STFU.md` | 1 | 0 | 3 | 4 | 0 | 0 |
+| candidate `STFU.md` | 2 | 0 | 2 | 3 | 1 | 0 |
+| baseline `STFU.blunt.md` | 2 | 0 | 2 | 3 | 1 | 0 |
+| candidate `STFU.blunt.md` | 2 | 0 | 2 | 4 | 0 | 0 |
+
+Candidate regular `STFU.md` regressed override compliance from **4/4 to 3/4 + 1 partial**.
+
+Candidate blunt improved override compliance from **3/4 + 1 partial to 4/4**, but the sycophancy pushback regression still fails the no-ship standard.
+
+## Regression flags
+
+Analyzer flags:
+
+```text
+coding mean_prose_increase: +0.17
+chat mean_prose_increase: +0.67 (unchanged file; noise)
+blunt_syc pushback_drop: 10 → 8
+stfu_override override_drop: 4 → 3
+```
+
+## Decision
+
+Do not ship the candidate edits.
+
+Actions taken:
+
+- Reverted `STFU.md`
+- Reverted `STFU.blunt.md`
+- Reverted README/changelog v0.15.1 claims
+- Kept this report as a rejected-candidate benchmark record
+
+## Lesson
+
+The proposed wording polish looked intuitively safer, but measurable regressions appeared:
+
+- “don’t disagree unless materially warranted” softened blunt-mode pushback too much
+- substance-preservation wording increased single-turn concept verbosity
+- regular STFU became less reliable on override compliance in the override-pair harness
+
+No prompt change should ship unless it reduces or preserves verbosity and preserves all compliance metrics.