Skip to content

Commit 4774c0f

Browse files
authored
docs: refresh benchmarks and add CI checks
1 parent f3c0815 commit 4774c0f

15 files changed

Lines changed: 279 additions & 154 deletions

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,31 @@
11
## What this PR changes
22

3-
Brief description of the change to `STFU.md`, `STFU.blunt.md`, or other files.
3+
Brief description of the change to `STFU.md`, `STFU.blunt.md`, docs, or benchmark files.
44

55
## Why
66

7-
Which prompt-shape, agent, or behaviour this addresses. Reference `BENCHMARKS.md` rows where applicable.
7+
What failure mode, install path, agent behavior, or documentation gap this addresses. Reference `data/benchmarks.md`, `data/dspy-cross-model-results.md`, or `data/changelog.md` where applicable.
88

99
## Bench impact
1010

11-
If you ran the benchmark with this change, paste the per-agent delta:
11+
If this changes `STFU.md` or `STFU.blunt.md`, include benchmark or manual before/after evidence:
1212

13-
| agent | STFU.md v0.13 (current) | this PR | Δ |
14-
|---|---:|---:|---:|
13+
| agent/app | current | this PR | Δ / verdict |
14+
|---|---:|---:|---|
1515
| claude ||||
1616
| codex ||||
1717
|||||
1818

19-
If you didn't run the bench, that's fine — flag it and a maintainer will run it.
19+
If you did not run a benchmark, say so and explain why.
20+
21+
Docs-only / CI-only PRs can write `N/A — no prompt behavior changed`.
2022

2123
## Verification
2224

23-
- [ ] `STFU.md` deploys cleanly to documented coding-agent paths (per `data/agent-locations.md`)
24-
- [ ] Smoke test passes (`claude -p "What's the git command to undo the last commit but keep changes staged?"` returns the bare command)
25-
- [ ] No regression on a previously-passing prompt (manual spot check is fine)
25+
- [ ] Lightweight checks pass (`node --check`, JSON validation, Python compile, Markdown links)
26+
- [ ] Prompt behavior smoke-tested if prompt files changed
27+
- [ ] Benchmark results or manual examples included if prompt behavior changed
2628

2729
## Risk of breaking other agents
2830

29-
Which other agents/prompt-shapes might this rule affect? Anything you'd want extra eyes on?
31+
Which agents, apps, or prompt shapes might this affect? Anything that needs extra review?

.github/workflows/ci.yml

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
name: CI
2+
3+
on:
4+
pull_request:
5+
push:
6+
branches: [main]
7+
8+
permissions:
9+
contents: read
10+
11+
jobs:
12+
checks:
13+
runs-on: ubuntu-latest
14+
steps:
15+
- name: Check out repository
16+
uses: actions/checkout@v4
17+
18+
- name: Set up Node.js
19+
uses: actions/setup-node@v4
20+
with:
21+
node-version: 22
22+
23+
- name: Set up Python
24+
uses: actions/setup-python@v5
25+
with:
26+
python-version: '3.x'
27+
28+
- name: Check JavaScript syntax
29+
run: |
30+
node --check bench/analyze.js
31+
node --check bench/make-charts.js
32+
33+
- name: Validate JSON data
34+
run: |
35+
python3 -m json.tool data/benchmarks-summary.json >/dev/null
36+
python3 -m json.tool data/benchmarks-matrix.json >/dev/null
37+
python3 -m json.tool data/visualizations/charts.json >/dev/null
38+
39+
- name: Check Python syntax
40+
run: python3 -m py_compile bench/dspy/*.py bench/check-md-links.py
41+
42+
- name: Check Markdown links
43+
run: python3 bench/check-md-links.py

CONTRIBUTING.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -60,15 +60,16 @@ Please include:
6060

6161
## Running checks
6262

63-
For docs-only changes, just make sure links still work.
64-
65-
For benchmark/chart changes, run:
63+
CI runs the lightweight checks below on pushes and pull requests. Run them locally before opening a PR:
6664

6765
```bash
6866
node --check bench/analyze.js
6967
node --check bench/make-charts.js
7068
python3 -m json.tool data/benchmarks-summary.json >/dev/null
69+
python3 -m json.tool data/benchmarks-matrix.json >/dev/null
7170
python3 -m json.tool data/visualizations/charts.json >/dev/null
71+
python3 -m py_compile bench/dspy/*.py bench/check-md-links.py
72+
python3 bench/check-md-links.py
7273
```
7374

7475
## Issues

README.md

Lines changed: 11 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -64,69 +64,22 @@ The difference: STFU.md aims for concise output **without** turning the assistan
6464

6565
## Benchmarks
6666

67-
The current coding-agent prompt is **871 bytes** (v0.14.3).
67+
Current prompt sizes:
6868

69-
Reference bench (v0.13.1, 5 agents × 5 prompts): **−82.1% total prose reduction** with **100% average compliance**.
69+
| File | Bytes |
70+
|---|---:|
71+
| [`STFU.md`](STFU.md) | 1,345 |
72+
| [`STFU.blunt.md`](STFU.blunt.md) | 1,640 |
7073

71-
v0.14.3 controlled ablation (Claude Sonnet 4.6, n=12 single-turn + 24 8-turn calls per condition):
72-
- **−80.0% prose reduction** vs no-prompt baseline (single-turn, paired t-test p<0.0001, Cohen's d=1.79)
73-
- **−75.1%** averaged across 8-turn coding conversations
74-
- No statistically significant decay over 8 turns (slope p=0.28; T1→T8 ratio 0.15)
75-
- Removed `## Templates` section because it caused engagement-refusal on under-specified prompts (e.g. "TypeError: Cannot read… of undefined" → returned *"Need code or error first."* instead of helping). Compression cost: ~3 pp; reliability gain: substantial.
74+
Headline results:
7675

77-
See [`data/benchmarks.md`](data/benchmarks.md) and [`data/changelog.md`](data/changelog.md) for details.
76+
- **STFU.md v0.13.1:** −82.1% total prose reduction, 100% average compliance (5 agents × 5 prompts).
77+
- **STFU.md v0.14.3:** −80.0% single-turn prose reduction; −75.1% across 8-turn coding conversations; no significant decay.
78+
- **STFU.blunt.md v0.18.0:** DSPy round-2 + 5-agent cross-model validation; avg pushback 0.848, correct-user agreement 0.912, mean prose 11.0 words, validation phrases 0%.
7879

79-
### STFU.blunt.md DSPy round-2 + cross-model validation (v0.18.0)
80+
The regular `STFU.md` prompt was tested in two DSPy optimization runs; no candidate beat the shipped v0.16.0 prompt on the current metric. `STFU.blunt.md` improved materially in v0.18.0, especially on opencode pushback (0.38→0.81) and cursor correct-user agreement (0.44→0.89).
8081

81-
Round-2 optimization on a 3-5x larger probe corpus (73 train + 32 held-out per variant), validated **across 5 agent CLIs** (claude, codex, cursor-agent, gemini, opencode) with **independent codex judge** (different model family from generator → eliminates self-bias).
82-
83-
Cross-model results vs v0.15.0 and v0.17.0:
84-
85-
| metric (avg across 5 agents) | v0.15.0 | v0.17.0 | **v0.18.0** |
86-
|---|---:|---:|---:|
87-
| pushback rate (sycophancy) | 0.746 | 0.750 | **0.848 ★** |
88-
| correct-user agree rate | 0.890 | 0.820 ⚠ | **0.912 ★** |
89-
| prose words mean | 13.6 | 13.1 | **11.0 ★** |
90-
| validation-phrase rate | 0% | 0% | 0% |
91-
92-
Biggest wins: opencode pushback 0.38→0.81 (+0.43), cursor agree-rate 0.44→0.89 (+0.45), codex prose −37% (p=0.008). The optimizer learned to be more conservative about pushback ("only when clearly warranted") AND more decisive about agreement ("If correct: just 'Yes.' or 'Fine.'").
93-
94-
**STFU.md (regular)**: DSPy round-2 (n=73 train) again **found no improvement** over v0.16.0. Two independent runs confirm v0.16.0 is at a local optimum on this metric. Stays as-is.
95-
96-
See [`data/changelog.md` §[0.18.0]](data/changelog.md) for full per-agent table, statistical analysis, and limitations.
97-
98-
### STFU.blunt.md DSPy round-1 (v0.17.0 — historical)
99-
100-
Empirical instruction evolution via [DSPy](https://github.com/stanfordnlp/dspy)-style optimization (custom COPRO-like loop, 5 candidates × 3 rounds = 15 variations evaluated). Probe set: 25 train + 10 held-out + 6 chat-sanity. Multi-objective scalar metric: pushback rate on sycophancy probes, agreement rate on correct-user probes, override compliance, terseness — minus a length penalty to avoid prompt bloat.
101-
102-
Results (Sonnet 4.6):
103-
104-
| metric | shipped (v0.15.0) | DSPy-optimized (v0.17.0) | Δ |
105-
|---|---:|---:|---|
106-
| prompt size (bytes) | 1843 | **1479** | **−20%** |
107-
| training-set score | 0.743 | **0.819** | **+10%** |
108-
| held-out (n=10) score | 0.471 | **0.658** | **+0.118 (p=0.15)** |
109-
| chat-probe mean prose words (n=6) | 17.7 | **14.3** | **−19%** |
110-
111-
The optimizer discovered a new `Confirm ("right?/correct?/r?") → Yes/No first` shape rule that fixed the v0.15.0 failure mode of over-hedging on legitimately-correct user statements (e.g., "Hash maps offer O(1) average-case lookups, right?"). New "Never open with validation" Style line. Statistical significance caveat: n=10 held-out makes p=0.15 expected for real effects; improvement is **directional and consistent** across all three test sets.
112-
113-
For the **regular `STFU.md`** prompt: DSPy optimization across the same loop **found no improvement** — all 15 candidate variations scored lower than the shipped v0.16.0 seed on training (0.540). The current STFU.md is at a local optimum on this metric. Honest result, kept as-is.
114-
115-
See [`data/changelog.md` §[0.17.0]](data/changelog.md) for full methodology, per-probe breakdown, and limitations.
116-
117-
### STFU.blunt.md V1→V2 manual ablation (v0.15.0 — historical)
118-
119-
Earlier two-iteration design (V1→V2) before DSPy optimization. Results vs base STFU and no-prompt control:
120-
121-
| metric | control | STFU | BLUNT V2 |
122-
|---|---:|---:|---:|
123-
| pushback rate (sycophancy probes) | 5/6 | 4/6 | **5/6** |
124-
| validation-phrase rate (all 12 prompts) | 1/12 | 0/12 | 0/12 |
125-
| override compliance (multi-turn) | 2/2 | 2/2 | **2/2** |
126-
| plain-coding prose words (terseness regression) | 62.2 | 16.0 | 17.2 |
127-
| correct-user agreement rate (sanity) | 1/2 | 1/2 | 1/2 |
128-
129-
V2 passed all five pre-committed criteria and shipped as v0.15.0. v0.17.0 supersedes via DSPy-optimized prompt.
82+
See [`data/benchmarks.md`](data/benchmarks.md), [`data/dspy-cross-model-results.md`](data/dspy-cross-model-results.md), and [`data/changelog.md`](data/changelog.md) for methodology, full tables, caveats, and historical runs.
13083

13184
## Example outputs
13285

bench/check-md-links.py

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
#!/usr/bin/env python3
2+
"""Validate local Markdown links in this repository."""
3+
4+
from __future__ import annotations
5+
6+
import re
7+
import sys
8+
from pathlib import Path
9+
from urllib.parse import unquote
10+
11+
ROOT = Path(__file__).resolve().parents[1]
12+
LINK_RE = re.compile(r"!?\[[^\]]*\]\(([^)]+)\)")
13+
SKIP_SCHEMES = ("http://", "https://", "mailto:", "tel:")
14+
15+
16+
def iter_markdown_lines(path: Path):
17+
in_fence = False
18+
for lineno, line in enumerate(path.read_text(encoding="utf-8", errors="replace").splitlines(), 1):
19+
stripped = line.lstrip()
20+
if stripped.startswith("```") or stripped.startswith("~~~"):
21+
in_fence = not in_fence
22+
continue
23+
if not in_fence:
24+
yield lineno, line
25+
26+
27+
def normalize_target(raw_href: str) -> str | None:
28+
href = raw_href.strip()
29+
if not href or href.startswith("#") or href.startswith(SKIP_SCHEMES):
30+
return None
31+
32+
target = href.split("#", 1)[0].strip()
33+
if not target:
34+
return None
35+
36+
if target.startswith("<") and target.endswith(">"):
37+
target = target[1:-1]
38+
39+
# Drop optional Markdown titles: [x](path.md "title")
40+
target = target.split(' "', 1)[0].split(" '", 1)[0]
41+
return unquote(target)
42+
43+
44+
def main() -> int:
45+
errors: list[str] = []
46+
checked = 0
47+
48+
for path in sorted(ROOT.rglob("*.md")):
49+
if ".git" in path.parts:
50+
continue
51+
for lineno, line in iter_markdown_lines(path):
52+
for match in LINK_RE.finditer(line):
53+
target = normalize_target(match.group(1))
54+
if target is None:
55+
continue
56+
checked += 1
57+
resolved = (path.parent / target).resolve()
58+
if not resolved.exists():
59+
rel = path.relative_to(ROOT)
60+
errors.append(f"{rel}:{lineno}: missing link target: {match.group(1)}")
61+
62+
if errors:
63+
print("Broken Markdown links:", file=sys.stderr)
64+
for error in errors:
65+
print(f" {error}", file=sys.stderr)
66+
return 1
67+
68+
print(f"Checked {checked} local Markdown links.")
69+
return 0
70+
71+
72+
if __name__ == "__main__":
73+
raise SystemExit(main())

bench/dspy/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,9 @@ Custom DSPy-style instruction-evolution loop for optimizing STFU.md and STFU.blu
2323
## Reproduce
2424

2525
```bash
26+
# Optional: override scratch output location
27+
export STFU_DSPY_DIR=/tmp/stfu-test/dspy
28+
2629
# 1. Install
2730
python3 -m pip install --user dspy
2831

@@ -31,6 +34,7 @@ python3 bench/dspy/expanded_corpus.py
3134
# → /tmp/stfu-test/dspy/probe_splits_10x.json
3235

3336
# 3. Run optimization on each variant (~30-90 min wall time, ~1500 calls each)
37+
# Uses the repository's current STFU.md / STFU.blunt.md as seeds.
3438
python3 bench/dspy/dspy_optimize_v2.py stfu
3539
python3 bench/dspy/dspy_optimize_v2.py blunt
3640
# → /tmp/stfu-test/dspy/v2/{stfu,blunt}_best.md

bench/dspy/cross_model_analyze.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
"""Analyze cross-model held-out + independent-judge results."""
22
import json
33
import math
4+
import os
45
import re
56
import subprocess
67
import sys
78
from pathlib import Path
89
from concurrent.futures import ThreadPoolExecutor, as_completed
910
from collections import defaultdict
1011

11-
R = Path("/tmp/stfu-test/dspy/cross")
12-
sys.path.insert(0, "/tmp/stfu-test/scripts")
12+
R = Path(os.environ.get("STFU_DSPY_DIR", "/tmp/stfu-test/dspy")) / "cross"
1313
from cross_model_holdout import judge_pushback_codex, judge_informative_codex
1414

1515

bench/dspy/cross_model_holdout.py

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,10 @@
1616
from concurrent.futures import ThreadPoolExecutor, as_completed
1717
from pathlib import Path
1818

19-
R = Path("/tmp/stfu-test/dspy/cross")
20-
R.mkdir(exist_ok=True)
19+
ROOT = Path(__file__).resolve().parents[2]
20+
DSPY_DIR = Path(os.environ.get("STFU_DSPY_DIR", "/tmp/stfu-test/dspy"))
21+
R = DSPY_DIR / "cross"
22+
R.mkdir(parents=True, exist_ok=True)
2123

2224
# ── Agent runners — uniform interface ────────────────────────────────
2325
def run_claude(system: str, user: str) -> str:
@@ -241,20 +243,27 @@ def run_one_agent_probe(agent: str, prompt_label: str, system_prompt: str,
241243
}
242244

243245

246+
def read_optional(path: Path) -> str | None:
247+
return path.read_text() if path.exists() else None
248+
249+
244250
def main(variant: str):
245-
splits = json.loads(Path("/tmp/stfu-test/dspy/probe_splits_10x.json").read_text())
251+
splits_path = DSPY_DIR / "probe_splits_10x.json"
252+
if not splits_path.exists():
253+
raise SystemExit(f"Missing {splits_path}. Run bench/dspy/expanded_corpus.py first.")
254+
255+
splits = json.loads(splits_path.read_text())
246256
test = splits[variant]["test"]
247257

248258
if variant == "stfu":
249259
prompts = {
250-
"shipped": Path("/tmp/stfu-test/prompts/old-stfu-v016.md").read_text(),
251-
"optimized": Path("/tmp/stfu-test/dspy/v2/stfu_best.md").read_text() if Path("/tmp/stfu-test/dspy/v2/stfu_best.md").exists() else None,
260+
"shipped": (ROOT / "STFU.md").read_text(),
261+
"optimized": read_optional(DSPY_DIR / "v2" / "stfu_best.md"),
252262
}
253263
else:
254264
prompts = {
255-
"shipped": Path("/tmp/stfu-test/prompts/old-stfu-blunt-v015.md").read_text(),
256-
"v017": Path("/tmp/stfu-repo/STFU.blunt.md").read_text(),
257-
"optimized": Path("/tmp/stfu-test/dspy/v2/blunt_best.md").read_text() if Path("/tmp/stfu-test/dspy/v2/blunt_best.md").exists() else None,
265+
"shipped": (ROOT / "STFU.blunt.md").read_text(),
266+
"optimized": read_optional(DSPY_DIR / "v2" / "blunt_best.md"),
258267
}
259268
prompts = {k: v for k, v in prompts.items() if v}
260269

bench/dspy/dspy_holdout_eval.py

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,20 +10,25 @@
1010
from concurrent.futures import ThreadPoolExecutor, as_completed
1111
from pathlib import Path
1212

13-
sys.path.insert(0, "/tmp/stfu-test/scripts")
1413
from dspy_optimize import (
1514
run_claude, score_stfu_probe, score_blunt_probe, evaluate_prompt
1615
)
1716

18-
OUTDIR = "/tmp/stfu-test/dspy"
17+
ROOT = Path(__file__).resolve().parents[2]
18+
OUTDIR = os.environ.get("STFU_DSPY_DIR", "/tmp/stfu-test/dspy")
1919

2020

2121
def main():
22-
splits = json.loads(Path(f"{OUTDIR}/probe_splits.json").read_text())
22+
splits_path = Path(f"{OUTDIR}/probe_splits.json")
23+
if not splits_path.exists():
24+
splits_path = Path(f"{OUTDIR}/probe_splits_10x.json")
25+
if not splits_path.exists():
26+
raise SystemExit(f"Missing {splits_path}. Run bench/dspy/expanded_corpus.py first.")
27+
splits = json.loads(splits_path.read_text())
2328

2429
# Read all 4 prompts: STFU shipped, STFU optimized, BLUNT shipped, BLUNT optimized
25-
stfu_shipped = Path("/tmp/stfu-test/prompts/old-stfu-v016.md").read_text()
26-
blunt_shipped = Path("/tmp/stfu-test/prompts/old-stfu-blunt-v015.md").read_text()
30+
stfu_shipped = (ROOT / "STFU.md").read_text()
31+
blunt_shipped = (ROOT / "STFU.blunt.md").read_text()
2732

2833
stfu_opt_path = Path(f"{OUTDIR}/stfu_best.md")
2934
blunt_opt_path = Path(f"{OUTDIR}/blunt_best.md")

0 commit comments

Comments
 (0)