jqbit
diff --git a/‎.github/PULL_REQUEST_TEMPLATE.md‎
Lines changed: 12 additions & 10 deletions b/‎.github/PULL_REQUEST_TEMPLATE.md‎
Lines changed: 12 additions & 10 deletions
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 43 additions & 0 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 43 additions & 0 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 4 additions & 3 deletions b/‎CONTRIBUTING.md‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎README.md‎
Lines changed: 11 additions & 58 deletions b/‎README.md‎
Lines changed: 11 additions & 58 deletions
diff --git a/‎bench/check-md-links.py‎
Lines changed: 73 additions & 0 deletions b/‎bench/check-md-links.py‎
Lines changed: 73 additions & 0 deletions
diff --git a/‎bench/dspy/README.md‎
Lines changed: 4 additions & 0 deletions b/‎bench/dspy/README.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎bench/dspy/cross_model_analyze.py‎
Lines changed: 2 additions & 2 deletions b/‎bench/dspy/cross_model_analyze.py‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎bench/dspy/cross_model_holdout.py‎
Lines changed: 17 additions & 8 deletions b/‎bench/dspy/cross_model_holdout.py‎
Lines changed: 17 additions & 8 deletions
diff --git a/‎bench/dspy/dspy_holdout_eval.py‎
Lines changed: 10 additions & 5 deletions b/‎bench/dspy/dspy_holdout_eval.py‎
Lines changed: 10 additions & 5 deletions
@@ -1,29 +1,31 @@
 ## What this PR changes
 
-Brief description of the change to `STFU.md`, `STFU.blunt.md`, or other files.
+Brief description of the change to `STFU.md`, `STFU.blunt.md`, docs, or benchmark files.
 
 ## Why
 
-Which prompt-shape, agent, or behaviour this addresses. Reference `BENCHMARKS.md` rows where applicable.
+What failure mode, install path, agent behavior, or documentation gap this addresses. Reference `data/benchmarks.md`, `data/dspy-cross-model-results.md`, or `data/changelog.md` where applicable.
 
 ## Bench impact
 
-If you ran the benchmark with this change, paste the per-agent delta:
+If this changes `STFU.md` or `STFU.blunt.md`, include benchmark or manual before/after evidence:
 
-| agent | STFU.md v0.13 (current) | this PR | Δ |
-|---|---:|---:|---:|
+| agent/app | current | this PR | Δ / verdict |
+|---|---:|---:|---|
 | claude | … | … | … |
 | codex | … | … | … |
 | … | … | … | … |
 
-If you didn't run the bench, that's fine — flag it and a maintainer will run it.
+If you did not run a benchmark, say so and explain why.
+
+Docs-only / CI-only PRs can write `N/A — no prompt behavior changed`.
 
 ## Verification
 
-- [ ] `STFU.md` deploys cleanly to documented coding-agent paths (per `data/agent-locations.md`)
-- [ ] Smoke test passes (`claude -p "What's the git command to undo the last commit but keep changes staged?"` returns the bare command)
-- [ ] No regression on a previously-passing prompt (manual spot check is fine)
+- [ ] Lightweight checks pass (`node --check`, JSON validation, Python compile, Markdown links)
+- [ ] Prompt behavior smoke-tested if prompt files changed
+- [ ] Benchmark results or manual examples included if prompt behavior changed
 
 ## Risk of breaking other agents
 
-Which other agents/prompt-shapes might this rule affect? Anything you'd want extra eyes on?
+Which agents, apps, or prompt shapes might this affect? Anything that needs extra review?
@@ -0,0 +1,43 @@
+name: CI
+
+on:
+  pull_request:
+  push:
+    branches: [main]
+
+permissions:
+  contents: read
+
+jobs:
+  checks:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Check out repository
+        uses: actions/checkout@v4
+
+      - name: Set up Node.js
+        uses: actions/setup-node@v4
+        with:
+          node-version: 22
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.x'
+
+      - name: Check JavaScript syntax
+        run: |
+          node --check bench/analyze.js
+          node --check bench/make-charts.js
+
+      - name: Validate JSON data
+        run: |
+          python3 -m json.tool data/benchmarks-summary.json >/dev/null
+          python3 -m json.tool data/benchmarks-matrix.json >/dev/null
+          python3 -m json.tool data/visualizations/charts.json >/dev/null
+
+      - name: Check Python syntax
+        run: python3 -m py_compile bench/dspy/*.py bench/check-md-links.py
+
+      - name: Check Markdown links
+        run: python3 bench/check-md-links.py
@@ -60,15 +60,16 @@ Please include:
 
 ## Running checks
 
-For docs-only changes, just make sure links still work.
-
-For benchmark/chart changes, run:
+CI runs the lightweight checks below on pushes and pull requests. Run them locally before opening a PR:
 
 ```bash
 node --check bench/analyze.js
 node --check bench/make-charts.js
 python3 -m json.tool data/benchmarks-summary.json >/dev/null
+python3 -m json.tool data/benchmarks-matrix.json >/dev/null
 python3 -m json.tool data/visualizations/charts.json >/dev/null
+python3 -m py_compile bench/dspy/*.py bench/check-md-links.py
+python3 bench/check-md-links.py
 ```
 
 ## Issues
 
@@ -64,69 +64,22 @@ The difference: STFU.md aims for concise output **without** turning the assistan
 
 ## Benchmarks
 
-The current coding-agent prompt is **871 bytes** (v0.14.3).
+Current prompt sizes:
 
-Reference bench (v0.13.1, 5 agents × 5 prompts): **−82.1% total prose reduction** with **100% average compliance**.
+| File | Bytes |
+|---|---:|
+| [`STFU.md`](STFU.md) | 1,345 |
+| [`STFU.blunt.md`](STFU.blunt.md) | 1,640 |
 
-v0.14.3 controlled ablation (Claude Sonnet 4.6, n=12 single-turn + 24 8-turn calls per condition):
-- **−80.0% prose reduction** vs no-prompt baseline (single-turn, paired t-test p<0.0001, Cohen's d=1.79)
-- **−75.1%** averaged across 8-turn coding conversations
-- No statistically significant decay over 8 turns (slope p=0.28; T1→T8 ratio 0.15)
-- Removed `## Templates` section because it caused engagement-refusal on under-specified prompts (e.g. "TypeError: Cannot read… of undefined" → returned *"Need code or error first."* instead of helping). Compression cost: ~3 pp; reliability gain: substantial.
+Headline results:
 
-See [`data/benchmarks.md`](data/benchmarks.md) and [`data/changelog.md`](data/changelog.md) for details.
+- **STFU.md v0.13.1:** −82.1% total prose reduction, 100% average compliance (5 agents × 5 prompts).
+- **STFU.md v0.14.3:** −80.0% single-turn prose reduction; −75.1% across 8-turn coding conversations; no significant decay.
+- **STFU.blunt.md v0.18.0:** DSPy round-2 + 5-agent cross-model validation; avg pushback 0.848, correct-user agreement 0.912, mean prose 11.0 words, validation phrases 0%.
 
-### STFU.blunt.md DSPy round-2 + cross-model validation (v0.18.0)
+The regular `STFU.md` prompt was tested in two DSPy optimization runs; no candidate beat the shipped v0.16.0 prompt on the current metric. `STFU.blunt.md` improved materially in v0.18.0, especially on opencode pushback (0.38→0.81) and cursor correct-user agreement (0.44→0.89).
 
-Round-2 optimization on a 3-5x larger probe corpus (73 train + 32 held-out per variant), validated **across 5 agent CLIs** (claude, codex, cursor-agent, gemini, opencode) with **independent codex judge** (different model family from generator → eliminates self-bias).
-
-Cross-model results vs v0.15.0 and v0.17.0:
-
-| metric (avg across 5 agents) | v0.15.0 | v0.17.0 | **v0.18.0** |
-|---|---:|---:|---:|
-| pushback rate (sycophancy) | 0.746 | 0.750 | **0.848 ★** |
-| correct-user agree rate | 0.890 | 0.820 ⚠ | **0.912 ★** |
-| prose words mean | 13.6 | 13.1 | **11.0 ★** |
-| validation-phrase rate | 0% | 0% | 0% |
-
-Biggest wins: opencode pushback 0.38→0.81 (+0.43), cursor agree-rate 0.44→0.89 (+0.45), codex prose −37% (p=0.008). The optimizer learned to be more conservative about pushback ("only when clearly warranted") AND more decisive about agreement ("If correct: just 'Yes.' or 'Fine.'").
-
-**STFU.md (regular)**: DSPy round-2 (n=73 train) again **found no improvement** over v0.16.0. Two independent runs confirm v0.16.0 is at a local optimum on this metric. Stays as-is.
-
-See [`data/changelog.md` §[0.18.0]](data/changelog.md) for full per-agent table, statistical analysis, and limitations.
-
-### STFU.blunt.md DSPy round-1 (v0.17.0 — historical)
-
-Empirical instruction evolution via [DSPy](https://github.com/stanfordnlp/dspy)-style optimization (custom COPRO-like loop, 5 candidates × 3 rounds = 15 variations evaluated). Probe set: 25 train + 10 held-out + 6 chat-sanity. Multi-objective scalar metric: pushback rate on sycophancy probes, agreement rate on correct-user probes, override compliance, terseness — minus a length penalty to avoid prompt bloat.
-
-Results (Sonnet 4.6):
-
-| metric | shipped (v0.15.0) | DSPy-optimized (v0.17.0) | Δ |
-|---|---:|---:|---|
-| prompt size (bytes) | 1843 | **1479** | **−20%** |
-| training-set score | 0.743 | **0.819** | **+10%** |
-| held-out (n=10) score | 0.471 | **0.658** | **+0.118 (p=0.15)** |
-| chat-probe mean prose words (n=6) | 17.7 | **14.3** | **−19%** |
-
-The optimizer discovered a new `Confirm ("right?/correct?/r?") → Yes/No first` shape rule that fixed the v0.15.0 failure mode of over-hedging on legitimately-correct user statements (e.g., "Hash maps offer O(1) average-case lookups, right?"). New "Never open with validation" Style line. Statistical significance caveat: n=10 held-out makes p=0.15 expected for real effects; improvement is **directional and consistent** across all three test sets.
-
-For the **regular `STFU.md`** prompt: DSPy optimization across the same loop **found no improvement** — all 15 candidate variations scored lower than the shipped v0.16.0 seed on training (0.540). The current STFU.md is at a local optimum on this metric. Honest result, kept as-is.
-
-See [`data/changelog.md` §[0.17.0]](data/changelog.md) for full methodology, per-probe breakdown, and limitations.
-
-### STFU.blunt.md V1→V2 manual ablation (v0.15.0 — historical)
-
-Earlier two-iteration design (V1→V2) before DSPy optimization. Results vs base STFU and no-prompt control:
-
-| metric | control | STFU | BLUNT V2 |
-|---|---:|---:|---:|
-| pushback rate (sycophancy probes) | 5/6 | 4/6 | **5/6** |
-| validation-phrase rate (all 12 prompts) | 1/12 | 0/12 | 0/12 |
-| override compliance (multi-turn) | 2/2 | 2/2 | **2/2** |
-| plain-coding prose words (terseness regression) | 62.2 | 16.0 | 17.2 |
-| correct-user agreement rate (sanity) | 1/2 | 1/2 | 1/2 |
-
-V2 passed all five pre-committed criteria and shipped as v0.15.0. v0.17.0 supersedes via DSPy-optimized prompt.
+See [`data/benchmarks.md`](data/benchmarks.md), [`data/dspy-cross-model-results.md`](data/dspy-cross-model-results.md), and [`data/changelog.md`](data/changelog.md) for methodology, full tables, caveats, and historical runs.
 
 ## Example outputs
 
 
@@ -0,0 +1,73 @@
+#!/usr/bin/env python3
+"""Validate local Markdown links in this repository."""
+
+from __future__ import annotations
+
+import re
+import sys
+from pathlib import Path
+from urllib.parse import unquote
+
+ROOT = Path(__file__).resolve().parents[1]
+LINK_RE = re.compile(r"!?\[[^\]]*\]\(([^)]+)\)")
+SKIP_SCHEMES = ("http://", "https://", "mailto:", "tel:")
+
+
+def iter_markdown_lines(path: Path):
+    in_fence = False
+    for lineno, line in enumerate(path.read_text(encoding="utf-8", errors="replace").splitlines(), 1):
+        stripped = line.lstrip()
+        if stripped.startswith("```") or stripped.startswith("~~~"):
+            in_fence = not in_fence
+            continue
+        if not in_fence:
+            yield lineno, line
+
+
+def normalize_target(raw_href: str) -> str | None:
+    href = raw_href.strip()
+    if not href or href.startswith("#") or href.startswith(SKIP_SCHEMES):
+        return None
+
+    target = href.split("#", 1)[0].strip()
+    if not target:
+        return None
+
+    if target.startswith("<") and target.endswith(">"):
+        target = target[1:-1]
+
+    # Drop optional Markdown titles: [x](path.md "title")
+    target = target.split(' "', 1)[0].split(" '", 1)[0]
+    return unquote(target)
+
+
+def main() -> int:
+    errors: list[str] = []
+    checked = 0
+
+    for path in sorted(ROOT.rglob("*.md")):
+        if ".git" in path.parts:
+            continue
+        for lineno, line in iter_markdown_lines(path):
+            for match in LINK_RE.finditer(line):
+                target = normalize_target(match.group(1))
+                if target is None:
+                    continue
+                checked += 1
+                resolved = (path.parent / target).resolve()
+                if not resolved.exists():
+                    rel = path.relative_to(ROOT)
+                    errors.append(f"{rel}:{lineno}: missing link target: {match.group(1)}")
+
+    if errors:
+        print("Broken Markdown links:", file=sys.stderr)
+        for error in errors:
+            print(f"  {error}", file=sys.stderr)
+        return 1
+
+    print(f"Checked {checked} local Markdown links.")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
@@ -23,6 +23,9 @@ Custom DSPy-style instruction-evolution loop for optimizing STFU.md and STFU.blu
 ## Reproduce
 
 ```bash
+# Optional: override scratch output location
+export STFU_DSPY_DIR=/tmp/stfu-test/dspy
+
 # 1. Install
 python3 -m pip install --user dspy
 
@@ -31,6 +34,7 @@ python3 bench/dspy/expanded_corpus.py
 # → /tmp/stfu-test/dspy/probe_splits_10x.json
 
 # 3. Run optimization on each variant (~30-90 min wall time, ~1500 calls each)
+# Uses the repository's current STFU.md / STFU.blunt.md as seeds.
 python3 bench/dspy/dspy_optimize_v2.py stfu
 python3 bench/dspy/dspy_optimize_v2.py blunt
 # → /tmp/stfu-test/dspy/v2/{stfu,blunt}_best.md
 
@@ -1,15 +1,15 @@
 """Analyze cross-model held-out + independent-judge results."""
 import json
 import math
+import os
 import re
 import subprocess
 import sys
 from pathlib import Path
 from concurrent.futures import ThreadPoolExecutor, as_completed
 from collections import defaultdict
 
-R = Path("/tmp/stfu-test/dspy/cross")
-sys.path.insert(0, "/tmp/stfu-test/scripts")
+R = Path(os.environ.get("STFU_DSPY_DIR", "/tmp/stfu-test/dspy")) / "cross"
 from cross_model_holdout import judge_pushback_codex, judge_informative_codex
 
 
 
@@ -16,8 +16,10 @@
 from concurrent.futures import ThreadPoolExecutor, as_completed
 from pathlib import Path
 
-R = Path("/tmp/stfu-test/dspy/cross")
-R.mkdir(exist_ok=True)
+ROOT = Path(__file__).resolve().parents[2]
+DSPY_DIR = Path(os.environ.get("STFU_DSPY_DIR", "/tmp/stfu-test/dspy"))
+R = DSPY_DIR / "cross"
+R.mkdir(parents=True, exist_ok=True)
 
 # ── Agent runners — uniform interface ────────────────────────────────
 def run_claude(system: str, user: str) -> str:
@@ -241,20 +243,27 @@ def run_one_agent_probe(agent: str, prompt_label: str, system_prompt: str,
     }
 
 
+def read_optional(path: Path) -> str | None:
+    return path.read_text() if path.exists() else None
+
+
 def main(variant: str):
-    splits = json.loads(Path("/tmp/stfu-test/dspy/probe_splits_10x.json").read_text())
+    splits_path = DSPY_DIR / "probe_splits_10x.json"
+    if not splits_path.exists():
+        raise SystemExit(f"Missing {splits_path}. Run bench/dspy/expanded_corpus.py first.")
+
+    splits = json.loads(splits_path.read_text())
     test = splits[variant]["test"]
 
     if variant == "stfu":
         prompts = {
-            "shipped": Path("/tmp/stfu-test/prompts/old-stfu-v016.md").read_text(),
-            "optimized": Path("/tmp/stfu-test/dspy/v2/stfu_best.md").read_text() if Path("/tmp/stfu-test/dspy/v2/stfu_best.md").exists() else None,
+            "shipped": (ROOT / "STFU.md").read_text(),
+            "optimized": read_optional(DSPY_DIR / "v2" / "stfu_best.md"),
         }
     else:
         prompts = {
-            "shipped": Path("/tmp/stfu-test/prompts/old-stfu-blunt-v015.md").read_text(),
-            "v017": Path("/tmp/stfu-repo/STFU.blunt.md").read_text(),
-            "optimized": Path("/tmp/stfu-test/dspy/v2/blunt_best.md").read_text() if Path("/tmp/stfu-test/dspy/v2/blunt_best.md").exists() else None,
+            "shipped": (ROOT / "STFU.blunt.md").read_text(),
+            "optimized": read_optional(DSPY_DIR / "v2" / "blunt_best.md"),
         }
     prompts = {k: v for k, v in prompts.items() if v}
 
 
@@ -10,20 +10,25 @@
 from concurrent.futures import ThreadPoolExecutor, as_completed
 from pathlib import Path
 
-sys.path.insert(0, "/tmp/stfu-test/scripts")
 from dspy_optimize import (
     run_claude, score_stfu_probe, score_blunt_probe, evaluate_prompt
 )
 
-OUTDIR = "/tmp/stfu-test/dspy"
+ROOT = Path(__file__).resolve().parents[2]
+OUTDIR = os.environ.get("STFU_DSPY_DIR", "/tmp/stfu-test/dspy")
 
 
 def main():
-    splits = json.loads(Path(f"{OUTDIR}/probe_splits.json").read_text())
+    splits_path = Path(f"{OUTDIR}/probe_splits.json")
+    if not splits_path.exists():
+        splits_path = Path(f"{OUTDIR}/probe_splits_10x.json")
+    if not splits_path.exists():
+        raise SystemExit(f"Missing {splits_path}. Run bench/dspy/expanded_corpus.py first.")
+    splits = json.loads(splits_path.read_text())
 
     # Read all 4 prompts: STFU shipped, STFU optimized, BLUNT shipped, BLUNT optimized
-    stfu_shipped = Path("/tmp/stfu-test/prompts/old-stfu-v016.md").read_text()
-    blunt_shipped = Path("/tmp/stfu-test/prompts/old-stfu-blunt-v015.md").read_text()
+    stfu_shipped = (ROOT / "STFU.md").read_text()
+    blunt_shipped = (ROOT / "STFU.blunt.md").read_text()
 
     stfu_opt_path = Path(f"{OUTDIR}/stfu_best.md")
     blunt_opt_path = Path(f"{OUTDIR}/blunt_best.md")