Operating Loops in Production

Running a loop is operations work. This doc covers cost, logging, metrics, and when to pause or kill.

Token & Cost Budgeting

Estimate before scheduling:

npx @cobusgreyling/loop-cost --pattern <id> --cadence <interval> --level L1
npx @cobusgreyling/loop-init . --pattern <id>   # scaffolds loop-budget.md + loop-run-log.md + loop-budget skill

loop-audit scores cost observability and caps L3 until budget + run log + LOOP.md budget section exist.

Rough planning factors:

Factor	Impact
Cadence	Linear multiplier (5m vs 1d = 288× runs/day)
Sub-agents per run	Each = full model + tool round-trips
Context size	Large repos + full CI logs = expensive triage
Verifier model	Stronger model on verifier = worth it for unattended

Example Estimates (order of magnitude)

Assume ~50k tokens per light triage run, ~200k per run with implementer + verifier:

Loop	Cadence	Runs/day	Rough daily tokens
Daily triage (report only)	1d	1	~50k
CI sweeper (light)	15m	96	~5M (if every run is full — avoid)
PR babysitter	5m	288	High — use early exit

Best practice: Triage pass is cheap; spawn sub-agents only when state says actionable. Empty watchlist → exit in <5k tokens.

Budget Rules

## Loop Budget — Project X
- Max tokens/day: 2M (adjust to your plan)
- On exceed: pause schedulers, notify human
- Max sub-agent spawns per run: 3

Encode in skill or scheduler prompt: "If no high-priority items, exit immediately."

Logging Each Run

Minimum log entry (append to loop-run-log.md or structured JSON):

{
  "run_id": "2026-06-09T08:15:00Z",
  "pattern": "daily-triage",
  "duration_s": 45,
  "items_found": 4,
  "actions_taken": 1,
  "escalations": 0,
  "tokens_estimate": 52000,
  "outcome": "success"
}

Human-readable alternative in STATE.md footer:

---
Run log: 2026-06-09 08:15 | 4 findings | 1 worktree opened | 0 escalations

Metrics Dashboard (Template)

Track weekly in a spreadsheet or Notion:

Metric	PR Babysitter	Daily Triage	CI Sweeper
Runs
Actionable findings
Auto-fixes proposed
Human escalations
False positives
Mean time to human awareness
Token spend (est.)

Pattern-specific success metrics are in each pattern.

When to Slow Down

Token budget >80% mid-week
False positive rate >30% on triage
Same item escalated 2+ times in 48h
Major release week — pause auto-fix loops, report-only

When to Pause

Production incident in progress (loop may fight hotfix)
Breaking schema migration underway
Key human reviewer OOO + auto-merge was enabled (don't)

When to Kill a Loop

Consistent S2 failures from failure-modes.md
Cost > value for 2 consecutive weeks
Team mutes all notifications
Pattern replaced by event-driven alternative (e.g. CI Action only)

Kill checklist:

scheduler_delete / disable Automation / remove Action
Archive state file with status: retired
Post-mortem in stories/ (optional but valuable)

Upgrade Path

Report-only (L1) → 1–2 weeks stable triage
       ↓
Small auto-wins (L2) → verifier + worktree + max attempts
       ↓
Connectors (L2+) → PRs/tickets updated automatically
       ↓
Unattended (L3) → only with denylist, budget, metrics, human gates

Never skip L1 for a new pattern on a production repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operating Loops in Production

Token & Cost Budgeting

Example Estimates (order of magnitude)

Budget Rules

Logging Each Run

Metrics Dashboard (Template)

When to Slow Down

When to Pause

When to Kill a Loop

Upgrade Path

FilesExpand file tree

operating-loops.md

Latest commit

History

operating-loops.md

File metadata and controls

Operating Loops in Production

Token & Cost Budgeting

Example Estimates (order of magnitude)

Budget Rules

Logging Each Run

Metrics Dashboard (Template)

When to Slow Down

When to Pause

When to Kill a Loop

Upgrade Path