|
1 | 1 | # Night Evolution Map — Cloud Dev Pipeline Hardening |
2 | 2 | # 2026-03-11 night → 2026-03-12 morning |
3 | 3 |
|
4 | | -## Current State (updated) |
5 | | -- 3 PRs merged (#129, #130, #138), 3 closed with conflicts (#132, #133, #139) |
6 | | -- PR #141 (Golden Chain pipeline from agent-140) — CI pending, compile fixes pushed |
7 | | -- Docker image rebuilt with heartbeat + pipefail + Telegram fixes |
8 | | -- 2 Railway services reusable (agent-126, agent-131) |
9 | | -- agent-126 redeployed for issue #134, agent-131 for issue #135 |
| 4 | +## Final Status |
10 | 5 |
|
11 | | -## Evolution Phases (Priority Order) |
| 6 | +### PRs Merged (5) + Closed (2 quality) |
| 7 | +| PR | Title | Source | |
| 8 | +|----|-------|--------| |
| 9 | +| #129 | JSONL event persistence + deduplication | agent-124 | |
| 10 | +| #130 | Git worktree isolation for faster startup | agent-125 | |
| 11 | +| #138 | Buffer size increase, CLAUDE.md.agent to .gitignore | agent-136 | |
| 12 | +| #141 | Golden Chain pipeline — tri cloud pipeline/verify/merge | agent-140 | |
| 13 | +| #142 | Telegram log streaming — batch every 5s + output classifier | agent-131 | |
12 | 14 |
|
13 | | -### Phase 1: Merge Ready PRs ✅ DONE |
14 | | -- [x] PR #129 (JSONL event persistence) → merged |
15 | | -- [x] PR #130 (git worktree isolation) → merged |
16 | | -- [x] PR #138 (fix #136 — buffer + .gitignore) → merged |
17 | | -- [x] PR #132, #133, #139 → closed (conflicts after merges) |
18 | | -- [x] PR #141 (agent-140 Golden Chain) → compile fixes pushed, CI pending |
| 15 | +### PRs Closed (3, superseded by direct fixes) |
| 16 | +| PR | Reason | |
| 17 | +|----|--------| |
| 18 | +| #132 | Merge conflict after #129/#130 merged | |
| 19 | +| #133 | Merge conflict after #129/#130 merged | |
| 20 | +| #139 | Merge conflict, fixes applied directly | |
19 | 21 |
|
20 | | -### Phase 2: Entrypoint Hardening ✅ DONE |
21 | | -- [x] Heartbeat subshell bug → temp file `/tmp/agent_heartbeat_state` |
22 | | -- [x] `report_status()` writes to heartbeat file |
23 | | -- [x] Telegram notification ordering (BEFORE LAST_STATUS update) |
24 | | -- [x] `set -eo pipefail` + `#!/bin/bash` shebang |
25 | | -- [x] HTML escape helper `escape_html()` |
26 | | -- [x] `send_telegram()` uses temp file for JSON (no escaping issues) |
27 | | -- [x] Docker image rebuilt + pushed to GHCR |
| 22 | +### Issues Closed (5) |
| 23 | +| Issue | Resolution | |
| 24 | +|-------|-----------| |
| 25 | +| #134 | Fixed: u32→i64 timestamp, entry_idx dedup | |
| 26 | +| #135 | Fixed: VOLUME shadow, worktree -b branch | |
| 27 | +| #136 | Fixed via PR #138 merge | |
| 28 | +| #137 | Fixed: pipefail, bash shebang, Telegram ordering | |
| 29 | +| #140 | Fixed via PR #141 merge | |
28 | 30 |
|
29 | | -### Phase 3: Orchestrator CLI (partially done by agent-140) |
30 | | -Agent-140's PR #141 adds: |
31 | | -- [x] `tri cloud pipeline <N>` — spawn → monitor → verify → merge → cleanup |
32 | | -- [x] `tri cloud verify <N>` — local zig build check |
33 | | -- [x] `tri cloud merge <N>` — merge PR via gh CLI |
34 | | -- [x] Enhanced `tri cloud agents` — stuck detection, health indicators, elapsed formatting |
35 | | -Already working from before: |
36 | | -- [x] `tri cloud spawn <N>` — calls Railway API |
37 | | -- [x] `tri cloud kill <N>` — delete service |
38 | | -- [x] `tri cloud agents` — list active containers |
39 | | -- [ ] `tri cloud logs <N>` — fetch Railway deploy logs |
40 | | -- [ ] Service recycling in CLI (currently manual via env var update) |
| 31 | +### Direct Commits to Main (3) |
| 32 | +1. `b470c5ae7` — heartbeat subshell + pipefail + Telegram ordering + HTML escape |
| 33 | +2. `fe6dc534e` — u32 overflow, entry_idx duplicates, VOLUME shadow, worktree conflict |
| 34 | +3. Merge commits for PRs #129, #130, #138, #141 |
41 | 35 |
|
42 | | -### Phase 4: Auto-Pipeline (in PR #141) |
43 | | -- [x] Spawn → monitor heartbeats → detect DONE/FAIL (in PR) |
44 | | -- [x] On DONE: fetch PR, run `zig build` locally (in PR) |
45 | | -- [x] On pass: auto-merge PR (in PR) |
46 | | -- [x] On fail: respawn (max 3x) (in PR) |
47 | | -- [ ] Create fix-issue with review on failure |
48 | | -- [ ] Cleanup container after completion |
| 36 | +### Docker Image Rebuilt (2x) |
| 37 | +- First: heartbeat + pipefail + Telegram fixes |
| 38 | +- Second: VOLUME shadow removal + worktree branch fix |
49 | 39 |
|
50 | | -### Phase 5: Monitoring & Metrics |
51 | | -- [x] JSONL event persistence (PR #129 merged) |
52 | | -- [ ] Agent solve rate dashboard |
53 | | -- [ ] Cost per agent tracking |
54 | | -- [ ] Token usage estimation |
55 | | -- [ ] Success/fail/retry counters |
| 40 | +## Phase Completion |
56 | 41 |
|
57 | | -### Phase 6: Agent Intelligence |
58 | | -- [x] Agent reads CLAUDE.md (via SOUL.md injection) |
59 | | -- [x] Better commit messages (include issue number) |
60 | | -- [ ] Agent checks out existing branch for fix-issues |
61 | | -- [ ] Agent runs `zig build -Dci=true` instead of full build |
62 | | -- [ ] Multi-file context awareness |
| 42 | +| Phase | Status | Detail | |
| 43 | +|-------|--------|--------| |
| 44 | +| 1. Merge PRs | DONE | 4 merged, 3 closed | |
| 45 | +| 2. Entrypoint Hardening | DONE | 6 fixes applied | |
| 46 | +| 3. Orchestrator CLI | 80% | pipeline/verify/merge added, logs TBD | |
| 47 | +| 4. Auto-Pipeline | 70% | In PR #141, needs testing | |
| 48 | +| 5. Monitoring | 30% | JSONL working, dashboard TBD | |
| 49 | +| 6. Agent Intelligence | 20% | SOUL.md works, branch reuse TBD | |
63 | 50 |
|
64 | | -## Active Agents |
65 | | -- agent-126 → issue #134 (fix PR #129 bugs — u32 timestamp, duplicates, buffer) |
66 | | -- agent-131 → issue #135 (fix PR #130 bugs — VOLUME shadow, worktree conflicts) |
| 51 | +## Remaining Open Issues |
| 52 | +- #131 feat(cloud): Stream all container logs to Telegram in realtime |
| 53 | +- #126 Cloud Dev: Structured ACI protocol |
| 54 | +- #128, #127 FPGA/pipeline TODOs (lower priority) |
67 | 55 |
|
68 | | -## Next Steps |
69 | | -1. Wait for PR #141 CI → merge if passes |
70 | | -2. Monitor agents #134, #135 → review PRs when ready |
71 | | -3. Spawn agent for #137 (fix PR #133 bugs) when slot frees up |
72 | | -4. Create issue for `tri cloud logs` command |
73 | | -5. Create issue for service recycling in CLI |
| 56 | +## Key Fixes Applied |
| 57 | +1. Heartbeat reads from temp file (subshell isolation solved) |
| 58 | +2. Telegram gets notifications on every status change (ordering fix) |
| 59 | +3. HTML escaping + safe JSON via temp files |
| 60 | +4. `#!/bin/bash` + `set -eo pipefail` |
| 61 | +5. `i64` timestamps (no more u32 overflow) |
| 62 | +6. No duplicate JSONL entries |
| 63 | +7. No VOLUME shadowing bare repo |
| 64 | +8. Concurrent agents get unique branches |
| 65 | +9. Golden Chain: `tri cloud pipeline <N>` automates full cycle |
| 66 | +10. Telegram `editMessageText` — 1 dashboard message updated in place |
| 67 | +11. `NO_COLOR=1` in containers for clean output |
| 68 | +12. Worktree lock/unlock prevents accidental pruning |
| 69 | +13. Workflow reuses services instead of delete+create (avoids 25/day limit) |
74 | 70 |
|
75 | | -## Constraints |
76 | | -- 2 Railway services available (agent-126, agent-131) |
77 | | -- z.ai proxy ~8min per agent run |
78 | | -- Telegram 30 msg/min rate limit |
79 | | -- Docker rebuild ~90s (cached layers) |
| 71 | +## Active Agents (latest cycle — 16:33 UTC) |
| 72 | +- **ubuntu** service → #126 — 🔴 FAILED (0 commits, 619s — issue too abstract for autonomous agent) |
| 73 | +- **Agents Anywhere** service → #131 — 🔵 DONE → PR #142 merged |
| 74 | +- **Agents Anywhere** service → #115 (VIBEE eqlPrimitive fix) — 🔴 DONE but push failed 3x, no PR created |
| 75 | +- **ubuntu** service → #114 (VIBEE undefined Field type) — 🔴 DONE but push failed (git auth bug) |
| 76 | +- **Agents Anywhere** service → #116 (Re-verify stale ast-check) — 🔴 FAILED (gh can't read issue — missing --repo) |
| 77 | +- PR #143 from agent-126 — 🔴 CLOSED (review: grep -oP not portable, worktree cleanup order) |
| 78 | +- **Docker rebuild #3** — fixes: `gh auth setup-git`, `--repo` on all gh commands, PUSH_OK tracking |
| 79 | +- **ubuntu** service → #114 (RETRY) — 🚀 REDEPLOYED 16:55 UTC with fixed image |
| 80 | +- **Agents Anywhere** service → #116 (RETRY) — 🚀 REDEPLOYED 16:55 UTC with fixed image |
| 81 | + |
| 82 | +## Bug Found & Fixed This Cycle |
| 83 | +14. `sleepApplication: true` on "Agents Anywhere" service — Railway was sleeping container before entrypoint ran. Fixed via `serviceInstanceUpdate` + redeploy. |
| 84 | + |
| 85 | +## Lessons Learned |
| 86 | +1. Railway MCP `deploy` uploads source, NOT Docker image — use GraphQL API |
| 87 | +2. `startCommand` overrides Docker ENTRYPOINT — must set via serviceInstanceUpdate |
| 88 | +3. 25 service/day creation limit — never delete+create, always reuse |
| 89 | +4. `variableCollectionUpsert` needs actual values, not empty shell vars |
| 90 | +5. Service names with spaces break Railway CLI — avoid spaces in service names |
| 91 | +6. `sleepApplication: true` silently kills agent containers — always set to false for batch jobs |
| 92 | +7. Abstract/design issues (#126 "Structured ACI protocol") produce 0 commits — agents need concrete, code-level tasks with specific files/functions to modify |
| 93 | +8. `retry "git push ... 2>/dev/null" || true` silently swallows push failures — agent reports DONE with no PR. Fixed: track PUSH_OK, skip PR creation if push fails, report FAILED explicitly |
| 94 | +9. **CRITICAL**: `gh auth login` only configures `gh` CLI, NOT `git push`. Fixed: `gh auth setup-git` |
| 95 | +10. **CRITICAL**: All `gh issue/pr` commands lack `--repo` flag — bare-repo worktrees have no git remote context. Fixed: extract `GH_REPO` from `REPO_URL`, add `--repo` to all gh calls |
| 96 | +11. Docker rebuild #3 deployed with fixes #8-10. Both services redeployed 16:55 UTC |
0 commit comments