You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Brief description of the change to `STFU.md`, `STFU.blunt.md`, or other files.
3
+
Brief description of the change to `STFU.md`, `STFU.blunt.md`, docs, or benchmark files.
4
4
5
5
## Why
6
6
7
-
Which prompt-shape, agent, or behaviour this addresses. Reference `BENCHMARKS.md` rows where applicable.
7
+
What failure mode, install path, agent behavior, or documentation gap this addresses. Reference `data/benchmarks.md`, `data/dspy-cross-model-results.md`, or `data/changelog.md` where applicable.
8
8
9
9
## Bench impact
10
10
11
-
If you ran the benchmark with this change, paste the per-agent delta:
11
+
If this changes `STFU.md` or `STFU.blunt.md`, include benchmark or manual before/after evidence:
12
12
13
-
| agent |STFU.md v0.13 (current)| this PR | Δ |
14
-
|---|---:|---:|---:|
13
+
| agent/app| current | this PR | Δ / verdict|
14
+
|---|---:|---:|---|
15
15
| claude | … | … | … |
16
16
| codex | … | … | … |
17
17
| … | … | … | … |
18
18
19
-
If you didn't run the bench, that's fine — flag it and a maintainer will run it.
19
+
If you did not run a benchmark, say so and explain why.
20
+
21
+
Docs-only / CI-only PRs can write `N/A — no prompt behavior changed`.
20
22
21
23
## Verification
22
24
23
-
-[ ]`STFU.md` deploys cleanly to documented coding-agent paths (per `data/agent-locations.md`)
24
-
-[ ]Smoke test passes (`claude -p "What's the git command to undo the last commit but keep changes staged?"` returns the bare command)
25
-
-[ ]No regression on a previously-passing prompt (manual spot check is fine)
-**−75.1%** averaged across 8-turn coding conversations
74
-
- No statistically significant decay over 8 turns (slope p=0.28; T1→T8 ratio 0.15)
75
-
- Removed `## Templates` section because it caused engagement-refusal on under-specified prompts (e.g. "TypeError: Cannot read… of undefined" → returned *"Need code or error first."* instead of helping). Compression cost: ~3 pp; reliability gain: substantial.
74
+
Headline results:
76
75
77
-
See [`data/benchmarks.md`](data/benchmarks.md) and [`data/changelog.md`](data/changelog.md) for details.
76
+
-**STFU.md v0.13.1:** −82.1% total prose reduction, 100% average compliance (5 agents × 5 prompts).
77
+
-**STFU.md v0.14.3:** −80.0% single-turn prose reduction; −75.1% across 8-turn coding conversations; no significant decay.
The regular `STFU.md` prompt was tested in two DSPy optimization runs; no candidate beat the shipped v0.16.0 prompt on the current metric. `STFU.blunt.md` improved materially in v0.18.0, especially on opencode pushback (0.38→0.81) and cursor correct-user agreement (0.44→0.89).
80
81
81
-
Round-2 optimization on a 3-5x larger probe corpus (73 train + 32 held-out per variant), validated **across 5 agent CLIs** (claude, codex, cursor-agent, gemini, opencode) with **independent codex judge** (different model family from generator → eliminates self-bias).
Biggest wins: opencode pushback 0.38→0.81 (+0.43), cursor agree-rate 0.44→0.89 (+0.45), codex prose −37% (p=0.008). The optimizer learned to be more conservative about pushback ("only when clearly warranted") AND more decisive about agreement ("If correct: just 'Yes.' or 'Fine.'").
93
-
94
-
**STFU.md (regular)**: DSPy round-2 (n=73 train) again **found no improvement** over v0.16.0. Two independent runs confirm v0.16.0 is at a local optimum on this metric. Stays as-is.
95
-
96
-
See [`data/changelog.md` §[0.18.0]](data/changelog.md) for full per-agent table, statistical analysis, and limitations.
| chat-probe mean prose words (n=6) | 17.7 |**14.3**|**−19%**|
110
-
111
-
The optimizer discovered a new `Confirm ("right?/correct?/r?") → Yes/No first` shape rule that fixed the v0.15.0 failure mode of over-hedging on legitimately-correct user statements (e.g., "Hash maps offer O(1) average-case lookups, right?"). New "Never open with validation" Style line. Statistical significance caveat: n=10 held-out makes p=0.15 expected for real effects; improvement is **directional and consistent** across all three test sets.
112
-
113
-
For the **regular `STFU.md`** prompt: DSPy optimization across the same loop **found no improvement** — all 15 candidate variations scored lower than the shipped v0.16.0 seed on training (0.540). The current STFU.md is at a local optimum on this metric. Honest result, kept as-is.
114
-
115
-
See [`data/changelog.md` §[0.17.0]](data/changelog.md) for full methodology, per-probe breakdown, and limitations.
V2 passed all five pre-committed criteria and shipped as v0.15.0. v0.17.0 supersedes via DSPy-optimized prompt.
82
+
See [`data/benchmarks.md`](data/benchmarks.md), [`data/dspy-cross-model-results.md`](data/dspy-cross-model-results.md), and [`data/changelog.md`](data/changelog.md) for methodology, full tables, caveats, and historical runs.
0 commit comments