|
| 1 | +``` |
| 2 | + · ✦ · ✦ · |
| 3 | + ✦ · ⚡ · ✦ |
| 4 | + ░░▒▓████▓▒░░ |
| 5 | + ▒▓█▀ ▀█▓▒ |
| 6 | + ▓█ ◆ ◆ █▓ |
| 7 | + ██ ╲ ╱ ██ |
| 8 | + ▓█ ═══⚒═══ █▓ |
| 9 | + ▒▓█▄ ▄█▓▒ |
| 10 | + ░░▒▓████▓▒░░ |
| 11 | + ▓██▓ |
| 12 | + ╔═══╧══╧═══╗ |
| 13 | + ║ THE FORGE ║ |
| 14 | + ╚══════════╝ |
| 15 | + ▄▄████████████▄▄ |
| 16 | + ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ |
| 17 | +``` |
| 18 | + |
| 19 | +# forge-loop |
| 20 | + |
| 21 | +**Autoregressive codebase improvement for [Claude Code](https://docs.anthropic.com/en/docs/claude-code).** |
| 22 | + |
| 23 | +[](LICENSE) |
| 24 | +[](CHANGELOG.md) |
| 25 | + |
| 26 | +A structured, KPI-driven, self-correcting loop that tracks metrics (coverage, speed, quality), evaluates with fresh-context subagents, rotates strategies when stagnating, and knows when it's done. |
| 27 | + |
| 28 | +``` |
| 29 | +You: /forge "API controllers" --coverage 90 --speed -30% |
| 30 | +
|
| 31 | +Forge: Measuring baseline... 85.2% coverage, 120s |
| 32 | + Strategy: coverage-push → 15 tests for edge cases |
| 33 | + 85.8% (+0.6%), 118s (-2s) ✓ |
| 34 | + ...iterates until all targets met simultaneously... |
| 35 | +``` |
| 36 | + |
| 37 | +--- |
| 38 | + |
| 39 | +## Standing on the shoulders of |
| 40 | + |
| 41 | +- **Ralph Wiggum** — [Geoff Huntley's](https://ghuntley.com/ralph/) foundational work on autonomous AI development loops. "Deterministically bad in an undeterministic world, but eventually consistent." Forge is our implementation of the Ralph loop pattern with structured KPI tracking and strategy rotation. |
| 42 | +- **Andrej Karpathy** — The autoregressive mindset: each output becomes the next input. Karpathy's work on autoregressive models and his advocacy for [vibe coding](https://x.com/karpathy/status/1886192184808149383) informed forge's core loop design — each iteration's KPIs, findings, and lessons become the next iteration's decision context. |
| 43 | +- **Tobi Lutke** — His emphasis on tight feedback loops, continuous iteration, and measuring everything resonated deeply with our approach to autonomous improvement. |
| 44 | +- **SICA** (Self-Improving Coding Agent, [ICLR 2025 SSI-FM Workshop](https://openreview.net/forum?id=gXVQdNXqoc)) — Demonstrated that compounding iterations (17% to 53% SWE-Bench) work when the agent can select strategies based on accumulated evidence. |
| 45 | + |
| 46 | +--- |
| 47 | + |
| 48 | +## How it works |
| 49 | + |
| 50 | +### The Iteration Cycle |
| 51 | + |
| 52 | +Each iteration executes one complete eight-phase cycle: |
| 53 | + |
| 54 | +| Phase | What happens | |
| 55 | +|-------|-------------| |
| 56 | +| **A. Orient** | Read forge-state file, check position + trends + stagnation count | |
| 57 | +| **B. Measure** | Run tests with coverage, capture KPIs | |
| 58 | +| **C. Evaluate** | Every 3rd iteration: spawn fresh-context subagent for unbiased audit | |
| 59 | +| **D. Decide** | Pick strategy from KPI gaps + findings + lessons | |
| 60 | +| **E. Execute** | Apply ONE focused transformation | |
| 61 | +| **F. Verify** | Tests must be green, re-measure KPIs | |
| 62 | +| **G. Record** | Update forge-state with deltas + lessons (the autoregressive step) | |
| 63 | +| **H. Complete** | All targets met simultaneously? Done. Otherwise, next iteration. | |
| 64 | + |
| 65 | +### Strategies |
| 66 | + |
| 67 | +Forge selects from named strategies based on which KPI gap is largest: |
| 68 | + |
| 69 | +| Strategy | When | Impact | |
| 70 | +|----------|------|--------| |
| 71 | +| `coverage-push` | Clear coverage gaps | Coverage | |
| 72 | +| `refactor-for-testability` | Code hard to test | Coverage | |
| 73 | +| `component-extraction` | DRY violations, repeated patterns | Coverage + Quality | |
| 74 | +| `speed-optimization` | Slow tests, sync overuse | Speed | |
| 75 | +| `dead-code-removal` | Unused code flagged by evaluation | Quality + Coverage | |
| 76 | +| `quality-polish` | Naming, complexity, clarity | Quality | |
| 77 | +| `design-system` | Duplicated UI patterns | Quality + Coverage | |
| 78 | + |
| 79 | +### Stagnation Detection |
| 80 | + |
| 81 | +When coverage improves by less than 0.1% for two consecutive iterations, forge increments a stagnation counter. Once the counter reaches 3, forge automatically rotates to a different strategy — the historically most effective one, or an untried one. No manual intervention needed. |
| 82 | + |
| 83 | +### Fresh-Context Evaluation |
| 84 | + |
| 85 | +Every 3rd iteration, forge spawns a subagent that audits the scope with zero knowledge of KPI targets or iteration history. This prevents anchoring bias — the agent evaluating the code has no stake in the numbers looking good. |
| 86 | + |
| 87 | +--- |
| 88 | + |
| 89 | +## Installation |
| 90 | + |
| 91 | +```bash |
| 92 | +git clone https://github.com/DjinnFoundry/forge-loop.git |
| 93 | +cd forge-loop |
| 94 | +./install.sh |
| 95 | +``` |
| 96 | + |
| 97 | +The installer symlinks the skill, command, and agent files into your `~/.claude/` directory. |
| 98 | + |
| 99 | +**Important**: You also need to configure the stop hook that drives iteration. See [hooks/README.md](hooks/README.md) for setup instructions. If you already have the Ralph Wiggum stop hook configured, forge works with it automatically. |
| 100 | + |
| 101 | +### Manual installation |
| 102 | + |
| 103 | +```bash |
| 104 | +mkdir -p ~/.claude/skills/forge ~/.claude/commands ~/.claude/agents |
| 105 | + |
| 106 | +cp skills/forge/SKILL.md ~/.claude/skills/forge/SKILL.md |
| 107 | +cp commands/forge.md ~/.claude/commands/forge.md |
| 108 | +cp agents/forge.md ~/.claude/agents/forge.md |
| 109 | + |
| 110 | +# Stop hook — see hooks/README.md for settings.json setup |
| 111 | +``` |
| 112 | + |
| 113 | +--- |
| 114 | + |
| 115 | +## Usage |
| 116 | + |
| 117 | +### Basic |
| 118 | + |
| 119 | +``` |
| 120 | +/forge "LiveView components" --coverage 95 --speed -20% |
| 121 | +``` |
| 122 | + |
| 123 | +### All options |
| 124 | + |
| 125 | +``` |
| 126 | +/forge "SCOPE" --coverage N --speed -N% --quality strict|moderate|lax --max-iterations N |
| 127 | +``` |
| 128 | + |
| 129 | +| Option | Default | Description | |
| 130 | +|--------|---------|-------------| |
| 131 | +| `SCOPE` | (required) | What to improve — quoted string | |
| 132 | +| `--coverage N` | baseline + 2 | Minimum coverage % target | |
| 133 | +| `--speed -N%` | -20% | Speed reduction from baseline | |
| 134 | +| `--quality` | moderate | strict (0 high, 0 med) / moderate (0 high, ≤3 med) / lax (0 high, ≤5 med) | |
| 135 | +| `--max-iterations` | 20 | Safety limit | |
| 136 | + |
| 137 | +### Control |
| 138 | + |
| 139 | +- **Pause**: Forge outputs `RALPH_PAUSE` when it needs your input |
| 140 | +- **Cancel**: `/cancel-ralph` stops the loop |
| 141 | +- **Resume**: Start a new session — it picks up the forge-state file |
| 142 | + |
| 143 | +--- |
| 144 | + |
| 145 | +## State File |
| 146 | + |
| 147 | +Forge persists its state in `.claude/forge-state.SESSION.md` — a YAML frontmatter + markdown log that survives context compaction. Each iteration appends its KPIs, strategy, actions, and lessons. This is the autoregressive memory. |
| 148 | + |
| 149 | +```yaml |
| 150 | +--- |
| 151 | +session_id: "0320-1430-a3b2" |
| 152 | +scope: "API controllers" |
| 153 | +baseline: |
| 154 | + coverage: 85.2 |
| 155 | + speed_seconds: 120 |
| 156 | + tests: 1250 |
| 157 | + failures: 0 |
| 158 | + measured_at: "2026-03-20T14:30:00Z" |
| 159 | +targets: |
| 160 | + min_coverage: 90.0 |
| 161 | + max_speed_seconds: 84 |
| 162 | + quality: "moderate" |
| 163 | + max_iterations: 20 |
| 164 | +current_strategy: "component-extraction" |
| 165 | +stagnation_count: 0 |
| 166 | +strategies_tried: |
| 167 | + - name: "coverage-push" |
| 168 | + iterations: [1, 2] |
| 169 | + coverage_delta: 0.8 |
| 170 | + speed_delta: -5 |
| 171 | +lessons: |
| 172 | + - "async:true on controller tests saves ~3s per file" |
| 173 | +--- |
| 174 | + |
| 175 | +## Iteration 1 — coverage-push |
| 176 | +- Coverage: 85.2 → 85.8 (+0.6%) |
| 177 | +- Speed: 120s → 118s (-2s) |
| 178 | +- Tests: 1250 → 1265 (+15) |
| 179 | +- Actions: Added 15 tests for data_loaders edge cases |
| 180 | +- Reality-check: 2 high, 3 medium findings |
| 181 | +- Lesson: "7 identical try-rescue blocks — extract, don't test each" |
| 182 | +``` |
| 183 | +
|
| 184 | +--- |
| 185 | +
|
| 186 | +## Architecture |
| 187 | +
|
| 188 | +``` |
| 189 | +forge-loop/ |
| 190 | +├── skills/forge/SKILL.md ← The protocol (source of truth) |
| 191 | +├── commands/forge.md ← Claude Code /forge command |
| 192 | +├── agents/forge.md ← Subagent for spawning forge on subsystems |
| 193 | +├── hooks/ ← Iteration engine |
| 194 | +│ ├── README.md ← Hook setup instructions |
| 195 | +│ └── stop-hook.sh ← Stop hook script |
| 196 | +├── install.sh ← Installer script |
| 197 | +├── CHANGELOG.md |
| 198 | +├── CONTRIBUTING.md |
| 199 | +└── README.md |
| 200 | +``` |
| 201 | + |
| 202 | +The iteration engine uses the Ralph loop pattern: each time the Claude Code session tries to exit, the stop hook re-injects the forge prompt. The forge state file provides continuity across iterations and context compactions. |
| 203 | + |
| 204 | +--- |
| 205 | + |
| 206 | +## Why not just raw loops? |
| 207 | + |
| 208 | +| Aspect | Raw loop | Forge | |
| 209 | +|--------|----------|-------| |
| 210 | +| KPI tracking | Ad-hoc | Structured state file with deltas + trends | |
| 211 | +| Strategy | Single prompt | 7 named strategies, auto-rotation on stagnation | |
| 212 | +| Evaluation | Self-evaluation (anchoring bias) | Fresh-context subagents every 3 iterations | |
| 213 | +| Memory | Context window only | Persistent state file survives compaction | |
| 214 | +| Completion | Manual / hope | Simultaneous multi-KPI gate | |
| 215 | +| Lessons | Lost between iterations | Accumulated, inform strategy selection | |
| 216 | +| Stagnation | Repeats same approach | Detects + rotates after low-delta iterations | |
| 217 | + |
| 218 | +--- |
| 219 | + |
| 220 | +## Requirements |
| 221 | + |
| 222 | +- [Claude Code](https://docs.anthropic.com/en/docs/claude-code) CLI |
| 223 | +- `jq` (for the stop hook) |
| 224 | +- A project with a test suite that reports coverage |
| 225 | + |
| 226 | +## Adapting for other languages |
| 227 | + |
| 228 | +The skill includes test runner examples for multiple languages (Elixir, Python, JavaScript, Ruby, Go). To adapt: |
| 229 | + |
| 230 | +1. Edit `skills/forge/SKILL.md` — update the MEASURE phase for your test runner |
| 231 | +2. Update the coverage/speed parsing for your output format |
| 232 | +3. Everything else (strategies, stagnation, state format) is language-agnostic |
| 233 | + |
| 234 | +## Contributing |
| 235 | + |
| 236 | +See [CONTRIBUTING.md](CONTRIBUTING.md). |
| 237 | + |
| 238 | +## License |
| 239 | + |
| 240 | +[MIT](LICENSE) |
0 commit comments