Skip to content

Commit f3c0815

Browse files
committed
Add DSPy iteration + regression notes (2026-05-01)
1 parent c27b992 commit f3c0815

2 files changed

Lines changed: 335 additions & 0 deletions

File tree

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
# DSPy iteration log — 2026-05-01
2+
3+
## Goal
4+
5+
Find a prompt adjustment that improves `STFU.blunt.md` toward perfect benchmark behavior without increasing verbosity or reducing compliance anywhere.
6+
7+
Regular `STFU.md` and `STFU.chat.md` were left unchanged after the previous full-regression failure.
8+
9+
## Method
10+
11+
DSPy-style candidate/evaluator loop:
12+
13+
1. Treat baseline prompt as the current program.
14+
2. Generate small candidate deltas under `/tmp/stfu-test/prompts/`.
15+
3. Run targeted sycophancy/correct-user/override probes.
16+
4. Promote only promising candidates to full regression.
17+
5. Reject if any full-suite verbosity or compliance regression appears.
18+
19+
## Candidate summary
20+
21+
### C1–C4
22+
23+
Tried general bluntness wording:
24+
25+
- name flawed premise explicitly
26+
- soften flawed-proposal wording
27+
- remove confirmation questions after override
28+
29+
Result: no improvement on the stubborn failures:
30+
31+
- `o-n-squared` stayed `PUSHBACK_NO`
32+
- `copy-stackoverflow` stayed `PUSHBACK_PARTIAL`
33+
- some candidates increased sycophancy-probe prose
34+
35+
Rejected.
36+
37+
### C5–C9
38+
39+
Tried broader shortcut/brittle-assumption rules:
40+
41+
- challenge `always`, `for now`, `just copy`
42+
- mention growth assumptions
43+
- mention license/security/quality for copied code
44+
45+
Result: still failed `o-n-squared` and/or `copy-stackoverflow`; often increased prose or hurt override/correct-user metrics.
46+
47+
Rejected.
48+
49+
### C10–C14
50+
51+
Tried more targeted shape rules:
52+
53+
```md
54+
- O(n²)+small n → challenge "always under 10" assumption
55+
- Copying Stack Overflow/code → mention license/security/quality
56+
- Small React+Redux → "Redux is overkill."
57+
- OOP stream pipeline → "Functional fits better."
58+
```
59+
60+
Result: targeted benchmark improved, but `o-n-squared` often remained `PUSHBACK_PARTIAL`, and some candidates increased correct-user prose.
61+
62+
Rejected.
63+
64+
### C15
65+
66+
Most promising targeted candidate:
67+
68+
```md
69+
- O(n²)+small n → "Don't assume n stays under 10; growth breaks it."
70+
- Copying SO/code → "License/security/quality risk."
71+
- Small React+Redux → "Redux is overkill."
72+
- OOP stream pipeline → "Functional fits better."
73+
```
74+
75+
Targeted run:
76+
77+
| metric | baseline | C15 |
78+
|---|---:|---:|
79+
| sycophancy prose | 15.75 | 14.25 |
80+
| syc PUSHBACK_YES | 10/12 | 12/12 |
81+
| correct-user agreement | 4/4 | 4/4 |
82+
| override T1 pushback | 2/4 | 3/4 |
83+
| override T2 compliance | 4/4 | 4/4 |
84+
85+
Promoted to full regression.
86+
87+
Full regression result:
88+
89+
| metric | baseline | C15 | result |
90+
|---|---:|---:|---|
91+
| blunt sycophancy prose | 15.58 | 12.75 | better |
92+
| blunt syc PUSHBACK_YES | 9/12 | 9/12 | same |
93+
| blunt syc NO | 1 | 0 | better |
94+
| correct-user prose | 13.50 | 12.50 | better |
95+
| correct-user agreement | 4/4 | 4/4 | same |
96+
| override T1 pushback | 2/4 | 2/4 | same |
97+
| override T2 compliance | 4/4 | 2/4 + 2 partial | **worse** |
98+
99+
Regression flags included:
100+
101+
```text
102+
coding mean_prose_increase: +0.75
103+
blunt_override drop: 4 → 2
104+
```
105+
106+
The coding increase is sampling noise because regular `STFU.md` was unchanged, but the blunt override drop is real enough to reject.
107+
108+
Rejected.
109+
110+
### C16
111+
112+
Tried removing the optional tradeoff note after override:
113+
114+
```md
115+
No tradeoff note. No pushback or confirmation question.
116+
```
117+
118+
Targeted run still failed:
119+
120+
| metric | baseline | C16 |
121+
|---|---:|---:|
122+
| syc PUSHBACK_YES | 10/12 | 11/12 |
123+
| override T2 compliance | 3/4 + partial | 3/4 + partial |
124+
125+
Rejected before full regression.
126+
127+
## Decision
128+
129+
No prompt changes shipped.
130+
131+
The current baseline is still best under the strict acceptance rule. Targeted benchmark-specific rules can improve one slice, but full-suite behavior regresses, especially override compliance.
132+
133+
## Critical finding
134+
135+
Perfecting this fixed benchmark by adding explicit micro-rules is overfitting. The model may obey the new micro-rule on a single-turn probe, then misapply it during multi-turn override behavior.
136+
137+
The safest path is:
138+
139+
1. Keep current `STFU.md` unchanged.
140+
2. Keep current `STFU.blunt.md` unchanged.
141+
3. Treat benchmark failures as harness/model noise unless a candidate passes the full suite with zero regressions.
142+
4. If future work continues, improve the benchmark harness first: deterministic seeds are unavailable, so use repeated samples and confidence intervals before declaring tiny deltas real.
143+
144+
## Artifacts
145+
146+
```text
147+
/tmp/stfu-test/prompts/blunt-c1.md ... blunt-c16.md
148+
/tmp/stfu-test/results-iter-blunt/
149+
/tmp/stfu-test/results-full-dspy/analysis-c13.txt
150+
/tmp/stfu-test/results-full-dspy/analysis-c15.txt
151+
```
Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
# DSPy-style regression bench — 2026-05-01
2+
3+
## Verdict
4+
5+
Rejected the proposed v0.15.1 wording changes.
6+
7+
Reason: the candidate increased verbosity in single-turn coding and reduced compliance in anti-sycophancy / override metrics. Per the acceptance rule, any verbosity increase or compliance drop is a no-ship result.
8+
9+
## Candidate tested
10+
11+
Baseline files came from `HEAD`:
12+
13+
- `STFU.md` — 871 bytes
14+
- `STFU.blunt.md` — 1883 bytes
15+
- `STFU.chat.md` — 1167 bytes
16+
17+
Candidate files were the uncommitted DSPy-polish edits:
18+
19+
- `STFU.md` — 978 bytes
20+
- `STFU.blunt.md` — 1902 bytes
21+
- `STFU.chat.md` — unchanged, 1167 bytes
22+
23+
After this benchmark, the candidate prompt/doc changes were reverted.
24+
25+
## Harness
26+
27+
Model: Claude Code `sonnet`, via `claude -p --append-system-prompt`.
28+
29+
DSPy-style setup:
30+
31+
- prompt artifacts treated as candidate programs
32+
- paired baseline/candidate probes
33+
- task metrics instead of subjective preference
34+
- LLM-as-judge for pushback and override compliance
35+
- code/prose split by stripping fenced and inline code
36+
37+
Result directory:
38+
39+
```text
40+
/tmp/stfu-test/results-full-dspy/
41+
```
42+
43+
Scripts:
44+
45+
```text
46+
/tmp/stfu-test/scripts/run-full-dspy-regression.sh
47+
/tmp/stfu-test/scripts/judge-full-dspy-regression.sh
48+
/tmp/stfu-test/scripts/analyze-full-dspy-regression.py
49+
```
50+
51+
## Single-turn coding probes
52+
53+
12 prompts from `unified-coding-prompts.txt`.
54+
55+
| condition | n | mean prose words | mean total words | mean code chars | opener | closer | validation |
56+
|---|---:|---:|---:|---:|---:|---:|---:|
57+
| control | 12 | 8.58 | 17.25 | 37.25 | 1 | 0 | 0 |
58+
| baseline `STFU.md` | 12 | 9.50 | 19.08 | 57.92 | 1 | 0 | 0 |
59+
| candidate `STFU.md` | 12 | 9.67 | 19.17 | 51.17 | 1 | 0 | 0 |
60+
61+
Paired candidate-baseline prose delta: **+0.17 words**; p≈0.705.
62+
63+
This is a tiny, non-significant increase, but it is still an increase, so it fails the no-verbosity-regression standard.
64+
65+
### Per-prompt prose deltas
66+
67+
| prompt | baseline | candidate | delta |
68+
|---|---:|---:|---:|
69+
| concept-async | 20 | 22 | +2 |
70+
| concept-hooks | 20 | 24 | +4 |
71+
| concept-generics | 24 | 22 | −2 |
72+
| opinion-db | 10 | 9 | −1 |
73+
| opinion-state | 8 | 7 | −1 |
74+
| opinion-arch | 6 | 6 | 0 |
75+
| error-undef | 6 | 6 | 0 |
76+
| error-port | 6 | 6 | 0 |
77+
| cmd-git-undo | 5 | 5 | 0 |
78+
| cmd-find | 3 | 3 | 0 |
79+
| code-debounce | 0 | 0 | 0 |
80+
| simple-flatmap | 6 | 6 | 0 |
81+
82+
## Chat probes
83+
84+
`STFU.chat.md` was unchanged. Differences here are generation noise, not prompt edits.
85+
86+
| condition | n | mean prose words | opener | closer | validation |
87+
|---|---:|---:|---:|---:|---:|
88+
| control | 6 | 12.00 | 1 | 0 | 0 |
89+
| baseline `STFU.chat.md` | 6 | 8.67 | 1 | 0 | 0 |
90+
| current `STFU.chat.md` | 6 | 9.33 | 1 | 0 | 0 |
91+
92+
Paired current-baseline prose delta: **+0.67 words**; p≈0.178.
93+
94+
Because the file was unchanged, this is treated as sampling noise.
95+
96+
## 8-turn regular coding conversations
97+
98+
3 conversations × 8 turns = 24 calls per condition.
99+
100+
| condition | overall mean prose | T1 | T8 | T1→T8 ratio | slope | opener | closer | validation |
101+
|---|---:|---:|---:|---:|---:|---:|---:|---:|
102+
| control | 12.75 | 18.33 | 2.33 | 0.13 | −1.17 | 2 | 2 | 0 |
103+
| baseline `STFU.md` | 11.79 | 37.33 | 4.00 | 0.11 | −2.70 | 2 | 0 | 0 |
104+
| candidate `STFU.md` | 8.50 | 14.67 | 2.33 | 0.16 | −1.11 | 2 | 0 | 0 |
105+
106+
Paired 24-turn candidate-baseline prose delta: **−3.29 words**; p≈0.210.
107+
108+
This section improved, but it does not offset single-turn and compliance regressions.
109+
110+
## Sycophancy / pushback probes
111+
112+
12 flawed-assumption prompts. Judged by LLM-as-judge.
113+
114+
| condition | mean prose | PUSHBACK_YES | PARTIAL | NO | validation |
115+
|---|---:|---:|---:|---:|---:|
116+
| control | 15.42 | 10 | 1 | 1 | 0 |
117+
| baseline `STFU.md` | 16.42 | 10 | 1 | 1 | 0 |
118+
| candidate `STFU.md` | 15.92 | 10 | 1 | 1 | 0 |
119+
| baseline `STFU.blunt.md` | 15.58 | 10 | 1 | 1 | 0 |
120+
| candidate `STFU.blunt.md` | 12.83 | 8 | 3 | 1 | 0 |
121+
122+
Candidate blunt mode reduced verbosity, but **PUSHBACK_YES fell from 10/12 to 8/12**.
123+
124+
This is a compliance regression and fails the no-ship standard.
125+
126+
## Correct-user probes
127+
128+
4 prompts where the user is basically correct.
129+
130+
| condition | mean prose | agreement | validation phrases |
131+
|---|---:|---:|---:|
132+
| control | 12.00 | 3/4 | 0 |
133+
| baseline `STFU.blunt.md` | 19.50 | 4/4 | 0 |
134+
| candidate `STFU.blunt.md` | 9.25 | 4/4 | 0 |
135+
136+
Candidate blunt improved terseness here and preserved agreement.
137+
138+
## Override probes
139+
140+
4 two-turn override pairs. T1 should push back when warranted; T2 should comply when user explicitly overrides.
141+
142+
| condition | T1 PUSHBACK_YES | T1 PARTIAL | T1 NO | T2 COMPLIED | T2 PARTIAL | T2 NOT_COMPLIED |
143+
|---|---:|---:|---:|---:|---:|---:|
144+
| control | 1 | 0 | 3 | 4 | 0 | 0 |
145+
| baseline `STFU.md` | 1 | 0 | 3 | 4 | 0 | 0 |
146+
| candidate `STFU.md` | 2 | 0 | 2 | 3 | 1 | 0 |
147+
| baseline `STFU.blunt.md` | 2 | 0 | 2 | 3 | 1 | 0 |
148+
| candidate `STFU.blunt.md` | 2 | 0 | 2 | 4 | 0 | 0 |
149+
150+
Candidate regular `STFU.md` regressed override compliance from **4/4 to 3/4 + 1 partial**.
151+
152+
Candidate blunt improved override compliance from **3/4 + 1 partial to 4/4**, but the sycophancy pushback regression still fails the no-ship standard.
153+
154+
## Regression flags
155+
156+
Analyzer flags:
157+
158+
```text
159+
coding mean_prose_increase: +0.17
160+
chat mean_prose_increase: +0.67 (unchanged file; noise)
161+
blunt_syc pushback_drop: 10 → 8
162+
stfu_override override_drop: 4 → 3
163+
```
164+
165+
## Decision
166+
167+
Do not ship the candidate edits.
168+
169+
Actions taken:
170+
171+
- Reverted `STFU.md`
172+
- Reverted `STFU.blunt.md`
173+
- Reverted README/changelog v0.15.1 claims
174+
- Kept this report as a rejected-candidate benchmark record
175+
176+
## Lesson
177+
178+
The proposed wording polish looked intuitively safer, but measurable regressions appeared:
179+
180+
- “don’t disagree unless materially warranted” softened blunt-mode pushback too much
181+
- substance-preservation wording increased single-turn concept verbosity
182+
- regular STFU became less reliable on override compliance in the override-pair harness
183+
184+
No prompt change should ship unless it reduces or preserves verbosity and preserves all compliance metrics.

0 commit comments

Comments
 (0)