Skip to content

Commit 7cfc8ac

Browse files
Antigravity Agentclaude
andcommitted
chore: update agent state files and iteration docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent f978755 commit 7cfc8ac

5 files changed

Lines changed: 514 additions & 3 deletions

File tree

.claude/ITERATE.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# /iterate — Trinity Training Farm Auto-Pilot
2+
3+
## Cycle (every 15 min or manual /iterate)
4+
5+
### 1. DIAGNOSE (30s)
6+
Check all 3 accounts via GraphQL. Record: slots free, crashed, running, finished.
7+
8+
### 2. FIX (1 min)
9+
- Crashed → check logs → fix startCommand/env vars → redeploy
10+
- Token dead → notify user (can't fix programmatically)
11+
- Data missing → verify train_100k.txt baked in Docker image
12+
- Build failed → check Dockerfile.hslm-train, retry with usePreviousImageTag
13+
- LR=0 / loss flat → ONLY cosine schedule (flat = dead by 20K)
14+
- startCommand bug → must be null (use Dockerfile ENTRYPOINT)
15+
16+
### 3. FILL (2 min)
17+
- Finished → record result in .trinity/experiments.json → launch next from queue
18+
- Free slot → create service → set env vars → deploy
19+
- Goal: 0 idle slots
20+
- Use variableCollectionUpsert + serviceInstanceRedeploy
21+
22+
### 4. ANALYZE (1 min)
23+
- Compare running experiments by loss/PPL at same step count
24+
- Kill diverging (avg10 rising 3 measurements in a row)
25+
- Identify leader → create variations in queue
26+
- Current leader: v4R PPL=125, challenger: v13L PPL=187@20K
27+
28+
### 5. CODE (if needed)
29+
- New feature blocks training → implement in Zig
30+
- Test fails → fix
31+
- Performance issue → profile → optimize
32+
- Always: `zig build` + `zig test` — no bash
33+
34+
### 6. DOCUMENT (30s)
35+
- Update .trinity/experiments.json with results
36+
- Update SKILL.md train dashboard
37+
- Commit if changes exist
38+
39+
## Rules
40+
41+
1. **Cosine schedule ALWAYS** — flat = dead LR (R4 proved ceiling at loss=6.0)
42+
2. **Don't touch running** — only crashed and finished
43+
3. **0 idle slots** — every free slot = wasted time
44+
4. **Zig only** — no bash scripts
45+
5. **Data in experiment DB** — every result recorded
46+
6. **Leader spawns variations** — best config → +/-lr, +/-batch, +features
47+
7. **Kill the dead** — diverging, speed collapse, LR exhausted
48+
49+
## Experiment Priority Queue
50+
51+
1. Repeat leader (v4R config: adam 3e-4 cosine 100K) on latest code
52+
2. Leader variations (+/-lr, +/-batch, +warmup)
53+
3. New features (phi-scale, adaptive-sparsity, full-ternary, ternary-schedule)
54+
4. LR sweep (1e-4 → 5e-3, cosine only)
55+
5. Batch sweep (66 → 1056, with grad_accum)
56+
6. LAMB optimizer variants (promising: v13L PPL=187@20K)
57+
58+
## Success Metrics
59+
60+
- PPL < 125 → new King (beats v4R)
61+
- tok/s > 15K on Railway → code speedup confirmed
62+
- 30+ parallel experiments → farm loaded
63+
- 0 crashed > 10 min → auto-recovery works
64+
65+
## Farm Accounts
66+
67+
| Account | Project ID | Env ID | Token Env |
68+
|---------|-----------|--------|-----------|
69+
| primary | aa0efa7f-... | 6748f1ad-... | RAILWAY_API_TOKEN |
70+
| farm-2 | ca4303d2-... | d8602284-... | RAILWAY_API_TOKEN_2 |
71+
| farm-3 | 292e8862-... | 912e9084-... | RAILWAY_API_TOKEN_3 |
72+
73+
## Key GraphQL Mutations
74+
75+
```
76+
# Set env vars
77+
variableCollectionUpsert(input: { projectId, environmentId, serviceId, variables: {...} })
78+
79+
# Clear startCommand (CRITICAL — must be null for Zig entrypoint)
80+
serviceInstanceUpdate(serviceId, environmentId, input: { startCommand: null })
81+
82+
# Redeploy
83+
serviceInstanceRedeploy(serviceId, environmentId)
84+
85+
# Delete stale service
86+
serviceDelete(id)
87+
```

.ralph/state/wake_count

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
53
1+
88

0 commit comments

Comments
 (0)