|
| 1 | +# /iterate — Trinity Training Farm Auto-Pilot |
| 2 | + |
| 3 | +## Cycle (every 15 min or manual /iterate) |
| 4 | + |
| 5 | +### 1. DIAGNOSE (30s) |
| 6 | +Check all 3 accounts via GraphQL. Record: slots free, crashed, running, finished. |
| 7 | + |
| 8 | +### 2. FIX (1 min) |
| 9 | +- Crashed → check logs → fix startCommand/env vars → redeploy |
| 10 | +- Token dead → notify user (can't fix programmatically) |
| 11 | +- Data missing → verify train_100k.txt baked in Docker image |
| 12 | +- Build failed → check Dockerfile.hslm-train, retry with usePreviousImageTag |
| 13 | +- LR=0 / loss flat → ONLY cosine schedule (flat = dead by 20K) |
| 14 | +- startCommand bug → must be null (use Dockerfile ENTRYPOINT) |
| 15 | + |
| 16 | +### 3. FILL (2 min) |
| 17 | +- Finished → record result in .trinity/experiments.json → launch next from queue |
| 18 | +- Free slot → create service → set env vars → deploy |
| 19 | +- Goal: 0 idle slots |
| 20 | +- Use variableCollectionUpsert + serviceInstanceRedeploy |
| 21 | + |
| 22 | +### 4. ANALYZE (1 min) |
| 23 | +- Compare running experiments by loss/PPL at same step count |
| 24 | +- Kill diverging (avg10 rising 3 measurements in a row) |
| 25 | +- Identify leader → create variations in queue |
| 26 | +- Current leader: v4R PPL=125, challenger: v13L PPL=187@20K |
| 27 | + |
| 28 | +### 5. CODE (if needed) |
| 29 | +- New feature blocks training → implement in Zig |
| 30 | +- Test fails → fix |
| 31 | +- Performance issue → profile → optimize |
| 32 | +- Always: `zig build` + `zig test` — no bash |
| 33 | + |
| 34 | +### 6. DOCUMENT (30s) |
| 35 | +- Update .trinity/experiments.json with results |
| 36 | +- Update SKILL.md train dashboard |
| 37 | +- Commit if changes exist |
| 38 | + |
| 39 | +## Rules |
| 40 | + |
| 41 | +1. **Cosine schedule ALWAYS** — flat = dead LR (R4 proved ceiling at loss=6.0) |
| 42 | +2. **Don't touch running** — only crashed and finished |
| 43 | +3. **0 idle slots** — every free slot = wasted time |
| 44 | +4. **Zig only** — no bash scripts |
| 45 | +5. **Data in experiment DB** — every result recorded |
| 46 | +6. **Leader spawns variations** — best config → +/-lr, +/-batch, +features |
| 47 | +7. **Kill the dead** — diverging, speed collapse, LR exhausted |
| 48 | + |
| 49 | +## Experiment Priority Queue |
| 50 | + |
| 51 | +1. Repeat leader (v4R config: adam 3e-4 cosine 100K) on latest code |
| 52 | +2. Leader variations (+/-lr, +/-batch, +warmup) |
| 53 | +3. New features (phi-scale, adaptive-sparsity, full-ternary, ternary-schedule) |
| 54 | +4. LR sweep (1e-4 → 5e-3, cosine only) |
| 55 | +5. Batch sweep (66 → 1056, with grad_accum) |
| 56 | +6. LAMB optimizer variants (promising: v13L PPL=187@20K) |
| 57 | + |
| 58 | +## Success Metrics |
| 59 | + |
| 60 | +- PPL < 125 → new King (beats v4R) |
| 61 | +- tok/s > 15K on Railway → code speedup confirmed |
| 62 | +- 30+ parallel experiments → farm loaded |
| 63 | +- 0 crashed > 10 min → auto-recovery works |
| 64 | + |
| 65 | +## Farm Accounts |
| 66 | + |
| 67 | +| Account | Project ID | Env ID | Token Env | |
| 68 | +|---------|-----------|--------|-----------| |
| 69 | +| primary | aa0efa7f-... | 6748f1ad-... | RAILWAY_API_TOKEN | |
| 70 | +| farm-2 | ca4303d2-... | d8602284-... | RAILWAY_API_TOKEN_2 | |
| 71 | +| farm-3 | 292e8862-... | 912e9084-... | RAILWAY_API_TOKEN_3 | |
| 72 | + |
| 73 | +## Key GraphQL Mutations |
| 74 | + |
| 75 | +``` |
| 76 | +# Set env vars |
| 77 | +variableCollectionUpsert(input: { projectId, environmentId, serviceId, variables: {...} }) |
| 78 | +
|
| 79 | +# Clear startCommand (CRITICAL — must be null for Zig entrypoint) |
| 80 | +serviceInstanceUpdate(serviceId, environmentId, input: { startCommand: null }) |
| 81 | +
|
| 82 | +# Redeploy |
| 83 | +serviceInstanceRedeploy(serviceId, environmentId) |
| 84 | +
|
| 85 | +# Delete stale service |
| 86 | +serviceDelete(id) |
| 87 | +``` |
0 commit comments