|
| 1 | +# DSPy-style regression bench — 2026-05-01 |
| 2 | + |
| 3 | +## Verdict |
| 4 | + |
| 5 | +Rejected the proposed v0.15.1 wording changes. |
| 6 | + |
| 7 | +Reason: the candidate increased verbosity in single-turn coding and reduced compliance in anti-sycophancy / override metrics. Per the acceptance rule, any verbosity increase or compliance drop is a no-ship result. |
| 8 | + |
| 9 | +## Candidate tested |
| 10 | + |
| 11 | +Baseline files came from `HEAD`: |
| 12 | + |
| 13 | +- `STFU.md` — 871 bytes |
| 14 | +- `STFU.blunt.md` — 1883 bytes |
| 15 | +- `STFU.chat.md` — 1167 bytes |
| 16 | + |
| 17 | +Candidate files were the uncommitted DSPy-polish edits: |
| 18 | + |
| 19 | +- `STFU.md` — 978 bytes |
| 20 | +- `STFU.blunt.md` — 1902 bytes |
| 21 | +- `STFU.chat.md` — unchanged, 1167 bytes |
| 22 | + |
| 23 | +After this benchmark, the candidate prompt/doc changes were reverted. |
| 24 | + |
| 25 | +## Harness |
| 26 | + |
| 27 | +Model: Claude Code `sonnet`, via `claude -p --append-system-prompt`. |
| 28 | + |
| 29 | +DSPy-style setup: |
| 30 | + |
| 31 | +- prompt artifacts treated as candidate programs |
| 32 | +- paired baseline/candidate probes |
| 33 | +- task metrics instead of subjective preference |
| 34 | +- LLM-as-judge for pushback and override compliance |
| 35 | +- code/prose split by stripping fenced and inline code |
| 36 | + |
| 37 | +Result directory: |
| 38 | + |
| 39 | +```text |
| 40 | +/tmp/stfu-test/results-full-dspy/ |
| 41 | +``` |
| 42 | + |
| 43 | +Scripts: |
| 44 | + |
| 45 | +```text |
| 46 | +/tmp/stfu-test/scripts/run-full-dspy-regression.sh |
| 47 | +/tmp/stfu-test/scripts/judge-full-dspy-regression.sh |
| 48 | +/tmp/stfu-test/scripts/analyze-full-dspy-regression.py |
| 49 | +``` |
| 50 | + |
| 51 | +## Single-turn coding probes |
| 52 | + |
| 53 | +12 prompts from `unified-coding-prompts.txt`. |
| 54 | + |
| 55 | +| condition | n | mean prose words | mean total words | mean code chars | opener | closer | validation | |
| 56 | +|---|---:|---:|---:|---:|---:|---:|---:| |
| 57 | +| control | 12 | 8.58 | 17.25 | 37.25 | 1 | 0 | 0 | |
| 58 | +| baseline `STFU.md` | 12 | 9.50 | 19.08 | 57.92 | 1 | 0 | 0 | |
| 59 | +| candidate `STFU.md` | 12 | 9.67 | 19.17 | 51.17 | 1 | 0 | 0 | |
| 60 | + |
| 61 | +Paired candidate-baseline prose delta: **+0.17 words**; p≈0.705. |
| 62 | + |
| 63 | +This is a tiny, non-significant increase, but it is still an increase, so it fails the no-verbosity-regression standard. |
| 64 | + |
| 65 | +### Per-prompt prose deltas |
| 66 | + |
| 67 | +| prompt | baseline | candidate | delta | |
| 68 | +|---|---:|---:|---:| |
| 69 | +| concept-async | 20 | 22 | +2 | |
| 70 | +| concept-hooks | 20 | 24 | +4 | |
| 71 | +| concept-generics | 24 | 22 | −2 | |
| 72 | +| opinion-db | 10 | 9 | −1 | |
| 73 | +| opinion-state | 8 | 7 | −1 | |
| 74 | +| opinion-arch | 6 | 6 | 0 | |
| 75 | +| error-undef | 6 | 6 | 0 | |
| 76 | +| error-port | 6 | 6 | 0 | |
| 77 | +| cmd-git-undo | 5 | 5 | 0 | |
| 78 | +| cmd-find | 3 | 3 | 0 | |
| 79 | +| code-debounce | 0 | 0 | 0 | |
| 80 | +| simple-flatmap | 6 | 6 | 0 | |
| 81 | + |
| 82 | +## Chat probes |
| 83 | + |
| 84 | +`STFU.chat.md` was unchanged. Differences here are generation noise, not prompt edits. |
| 85 | + |
| 86 | +| condition | n | mean prose words | opener | closer | validation | |
| 87 | +|---|---:|---:|---:|---:|---:| |
| 88 | +| control | 6 | 12.00 | 1 | 0 | 0 | |
| 89 | +| baseline `STFU.chat.md` | 6 | 8.67 | 1 | 0 | 0 | |
| 90 | +| current `STFU.chat.md` | 6 | 9.33 | 1 | 0 | 0 | |
| 91 | + |
| 92 | +Paired current-baseline prose delta: **+0.67 words**; p≈0.178. |
| 93 | + |
| 94 | +Because the file was unchanged, this is treated as sampling noise. |
| 95 | + |
| 96 | +## 8-turn regular coding conversations |
| 97 | + |
| 98 | +3 conversations × 8 turns = 24 calls per condition. |
| 99 | + |
| 100 | +| condition | overall mean prose | T1 | T8 | T1→T8 ratio | slope | opener | closer | validation | |
| 101 | +|---|---:|---:|---:|---:|---:|---:|---:|---:| |
| 102 | +| control | 12.75 | 18.33 | 2.33 | 0.13 | −1.17 | 2 | 2 | 0 | |
| 103 | +| baseline `STFU.md` | 11.79 | 37.33 | 4.00 | 0.11 | −2.70 | 2 | 0 | 0 | |
| 104 | +| candidate `STFU.md` | 8.50 | 14.67 | 2.33 | 0.16 | −1.11 | 2 | 0 | 0 | |
| 105 | + |
| 106 | +Paired 24-turn candidate-baseline prose delta: **−3.29 words**; p≈0.210. |
| 107 | + |
| 108 | +This section improved, but it does not offset single-turn and compliance regressions. |
| 109 | + |
| 110 | +## Sycophancy / pushback probes |
| 111 | + |
| 112 | +12 flawed-assumption prompts. Judged by LLM-as-judge. |
| 113 | + |
| 114 | +| condition | mean prose | PUSHBACK_YES | PARTIAL | NO | validation | |
| 115 | +|---|---:|---:|---:|---:|---:| |
| 116 | +| control | 15.42 | 10 | 1 | 1 | 0 | |
| 117 | +| baseline `STFU.md` | 16.42 | 10 | 1 | 1 | 0 | |
| 118 | +| candidate `STFU.md` | 15.92 | 10 | 1 | 1 | 0 | |
| 119 | +| baseline `STFU.blunt.md` | 15.58 | 10 | 1 | 1 | 0 | |
| 120 | +| candidate `STFU.blunt.md` | 12.83 | 8 | 3 | 1 | 0 | |
| 121 | + |
| 122 | +Candidate blunt mode reduced verbosity, but **PUSHBACK_YES fell from 10/12 to 8/12**. |
| 123 | + |
| 124 | +This is a compliance regression and fails the no-ship standard. |
| 125 | + |
| 126 | +## Correct-user probes |
| 127 | + |
| 128 | +4 prompts where the user is basically correct. |
| 129 | + |
| 130 | +| condition | mean prose | agreement | validation phrases | |
| 131 | +|---|---:|---:|---:| |
| 132 | +| control | 12.00 | 3/4 | 0 | |
| 133 | +| baseline `STFU.blunt.md` | 19.50 | 4/4 | 0 | |
| 134 | +| candidate `STFU.blunt.md` | 9.25 | 4/4 | 0 | |
| 135 | + |
| 136 | +Candidate blunt improved terseness here and preserved agreement. |
| 137 | + |
| 138 | +## Override probes |
| 139 | + |
| 140 | +4 two-turn override pairs. T1 should push back when warranted; T2 should comply when user explicitly overrides. |
| 141 | + |
| 142 | +| condition | T1 PUSHBACK_YES | T1 PARTIAL | T1 NO | T2 COMPLIED | T2 PARTIAL | T2 NOT_COMPLIED | |
| 143 | +|---|---:|---:|---:|---:|---:|---:| |
| 144 | +| control | 1 | 0 | 3 | 4 | 0 | 0 | |
| 145 | +| baseline `STFU.md` | 1 | 0 | 3 | 4 | 0 | 0 | |
| 146 | +| candidate `STFU.md` | 2 | 0 | 2 | 3 | 1 | 0 | |
| 147 | +| baseline `STFU.blunt.md` | 2 | 0 | 2 | 3 | 1 | 0 | |
| 148 | +| candidate `STFU.blunt.md` | 2 | 0 | 2 | 4 | 0 | 0 | |
| 149 | + |
| 150 | +Candidate regular `STFU.md` regressed override compliance from **4/4 to 3/4 + 1 partial**. |
| 151 | + |
| 152 | +Candidate blunt improved override compliance from **3/4 + 1 partial to 4/4**, but the sycophancy pushback regression still fails the no-ship standard. |
| 153 | + |
| 154 | +## Regression flags |
| 155 | + |
| 156 | +Analyzer flags: |
| 157 | + |
| 158 | +```text |
| 159 | +coding mean_prose_increase: +0.17 |
| 160 | +chat mean_prose_increase: +0.67 (unchanged file; noise) |
| 161 | +blunt_syc pushback_drop: 10 → 8 |
| 162 | +stfu_override override_drop: 4 → 3 |
| 163 | +``` |
| 164 | + |
| 165 | +## Decision |
| 166 | + |
| 167 | +Do not ship the candidate edits. |
| 168 | + |
| 169 | +Actions taken: |
| 170 | + |
| 171 | +- Reverted `STFU.md` |
| 172 | +- Reverted `STFU.blunt.md` |
| 173 | +- Reverted README/changelog v0.15.1 claims |
| 174 | +- Kept this report as a rejected-candidate benchmark record |
| 175 | + |
| 176 | +## Lesson |
| 177 | + |
| 178 | +The proposed wording polish looked intuitively safer, but measurable regressions appeared: |
| 179 | + |
| 180 | +- “don’t disagree unless materially warranted” softened blunt-mode pushback too much |
| 181 | +- substance-preservation wording increased single-turn concept verbosity |
| 182 | +- regular STFU became less reliable on override compliance in the override-pair harness |
| 183 | + |
| 184 | +No prompt change should ship unless it reduces or preserves verbosity and preserves all compliance metrics. |
0 commit comments