|
| 1 | +# LLM Agent Economics: Closed-Loop Desktop Automation |
| 2 | + |
| 3 | +*Analysis date: March 3, 2026. Pricing verified against Anthropic and OpenAI docs.* |
| 4 | + |
| 5 | +## Context |
| 6 | + |
| 7 | +OpenAdapt uses a closed-loop architecture where: |
| 8 | +1. **Claude Sonnet 4.6** executes desktop tasks via `computer_use` (the agent) |
| 9 | +2. **GPT-4.1-mini** verifies step outcomes via low-res screenshots (the verifier) |
| 10 | +3. A `DemoController` state machine orchestrates retry and replan on failure |
| 11 | + |
| 12 | +This document analyzes the unit economics of this approach and compares alternatives. |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +## 1. API Pricing (March 2026) |
| 17 | + |
| 18 | +| Model | Input / 1M tokens | Output / 1M tokens | Cache Read | Cache Write (5-min TTL) | |
| 19 | +|-------|-------------------|---------------------|------------|------------------------| |
| 20 | +| Claude Sonnet 4.6 | $3.00 | $15.00 | $0.30 (10%) | $3.75 (1.25x) | |
| 21 | +| Claude Opus 4.6 | $5.00 | $25.00 | $0.50 (10%) | $6.25 (1.25x) | |
| 22 | +| GPT-4.1-mini | $0.40 | $1.60 | — | — | |
| 23 | +| GPT-4.1-nano | $0.02 | $0.15 | — | — | |
| 24 | + |
| 25 | +### Image token costs |
| 26 | + |
| 27 | +| Provider | Formula | 1280x720 screenshot | Cost per image | |
| 28 | +|----------|---------|---------------------|----------------| |
| 29 | +| Claude | `(width * height) / 750` | ~1,229 tokens | $0.0037 (Sonnet) | |
| 30 | +| GPT-4.1-mini (`detail: low`) | Fixed 85 tokens | 85 tokens | $0.000034 | |
| 31 | + |
| 32 | +The VLM verifier is ~100x cheaper per image than Claude because `detail: low` collapses any image to 85 fixed tokens. |
| 33 | + |
| 34 | +--- |
| 35 | + |
| 36 | +## 2. Measured Cost: Task `04d9aeaf` (LibreOffice Calc) |
| 37 | + |
| 38 | +Task: create a sheet with 4 headers, compute annual changes for 3 asset columns, format as percentages. 21 steps in the human recording. |
| 39 | + |
| 40 | +### 2A. Claude agent (cumulative conversation) |
| 41 | + |
| 42 | +The `ClaudeComputerUseAgent` maintains a **multi-turn conversation** — each API call includes all prior screenshots and messages. This makes cost **quadratic** in task length: |
| 43 | + |
| 44 | +| Step | Cumulative input tokens (est.) | Cumulative screenshots | |
| 45 | +|------|-------------------------------|----------------------| |
| 46 | +| 1 | ~2,500 | 1 | |
| 47 | +| 5 | ~12,000 | 5 | |
| 48 | +| 10 | ~25,000 | 10 | |
| 49 | +| 15 | ~40,000 | 15 | |
| 50 | +| 20 | ~55,000 | 20 | |
| 51 | +| 25 | ~70,000 | 25 | |
| 52 | + |
| 53 | +Per-step composition: ~500 system prompt + ~800 user message + ~400 plan progress + ~1,229 screenshot + ~200 assistant response. |
| 54 | + |
| 55 | +Total across 25 steps (triangular sum): ~906K input tokens, ~6.3K output tokens. |
| 56 | + |
| 57 | +| Component | Tokens | Cost | |
| 58 | +|-----------|--------|------| |
| 59 | +| Claude input (25 steps) | ~906K | $2.72 | |
| 60 | +| Claude output (25 steps) | ~6.3K | $0.09 | |
| 61 | +| **Claude agent total** | | **~$2.81** | |
| 62 | +| With prompt caching (est. 65% cacheable) | | **~$1.50–2.00** | |
| 63 | + |
| 64 | +### 2B. VLM verifier (independent calls) |
| 65 | + |
| 66 | +Each verification call is independent (no conversation history). With `detail: low`, image cost is negligible. |
| 67 | + |
| 68 | +| Call type | Count | Input tokens/call | Output tokens/call | Total cost | |
| 69 | +|-----------|-------|-------------------|-------------------|------------| |
| 70 | +| Step verification | ~15 | ~285 | ~100 | $0.004 | |
| 71 | +| Replan | ~2 | ~585 | ~500 | $0.002 | |
| 72 | +| Goal verification | ~1 | ~300 | ~100 | $0.000 | |
| 73 | +| **VLM verifier total** | | | | **~$0.006** | |
| 74 | + |
| 75 | +### 2C. Total per-task cost |
| 76 | + |
| 77 | +| Scenario | Cost | |
| 78 | +|----------|------| |
| 79 | +| Single attempt (25 steps) | **$2.82** | |
| 80 | +| With prompt caching | **$1.50–2.00** | |
| 81 | +| 3 attempts to succeed | **$6.00–8.50** | |
| 82 | +| 5 attempts to succeed | **$10.00–14.10** | |
| 83 | + |
| 84 | +--- |
| 85 | + |
| 86 | +## 3. Cost Scaling |
| 87 | + |
| 88 | +### By task length |
| 89 | + |
| 90 | +Cost grows **quadratically** because each step adds linearly more context to all subsequent calls, and the total is the sum of an arithmetic series. |
| 91 | + |
| 92 | +| Task length | Single attempt | 3 attempts | Human ($20/hr) | |
| 93 | +|-------------|---------------|------------|----------------| |
| 94 | +| 5 steps | $0.30–0.60 | $0.90–1.80 | $0.50 (1.5 min) | |
| 95 | +| 10 steps | $0.80–1.20 | $2.40–3.60 | $0.83 (2.5 min) | |
| 96 | +| 20 steps | $2.00–3.00 | $6.00–9.00 | $1.33 (4 min) | |
| 97 | +| 30 steps | $4.00–6.00 | $12.00–18 | $2.00 (6 min) | |
| 98 | +| 50 steps | $8.00–12.00 | $24.00–36 | $3.33 (10 min) | |
| 99 | + |
| 100 | +**Crossover point**: The agent is cheaper than a $20/hr human only for simple 5-step tasks that succeed on the first attempt. |
| 101 | + |
| 102 | +### At scale: 1,000 tasks/day |
| 103 | + |
| 104 | +| Metric | Claude agent (current) | Human workforce | |
| 105 | +|--------|----------------------|-----------------| |
| 106 | +| Cost per task (avg 15-step, 2 attempts) | $3.60 | $1.00 | |
| 107 | +| Daily cost | $3,600 | $1,000 | |
| 108 | +| Monthly cost | $108,000 | $30,000 | |
| 109 | +| Success rate | ~40–60% (est.) | ~95–99% | |
| 110 | +| Latency per task | 10–30 min | 2–5 min | |
| 111 | +| Availability | 24/7, instant scaling | Business hours, hiring lag | |
| 112 | + |
| 113 | +The API agent is **3–4x more expensive** than human workers at scale, with lower reliability. |
| 114 | + |
| 115 | +--- |
| 116 | + |
| 117 | +## 4. Observed Eval Results |
| 118 | + |
| 119 | +### Without controller (March 2, 2026) |
| 120 | + |
| 121 | +| Run | Steps | WAA Score | Behavior | |
| 122 | +|-----|-------|-----------|----------| |
| 123 | +| Zero-shot | 30/30 | 0% | Productive but unfocused; entered 10 formulas for 2 columns | |
| 124 | +| Demo-conditioned (rigid) | 16/30 | 0% | Confused by UI state mismatch; quit early | |
| 125 | +| Demo-conditioned (multi-level) | 11/30 | 0% | Followed plan precisely; quit early after 1 column | |
| 126 | + |
| 127 | +### With controller (March 3, 2026) |
| 128 | + |
| 129 | +| Metric | Value | |
| 130 | +|--------|-------| |
| 131 | +| Steps used | 25/30 | |
| 132 | +| Duration | ~28 minutes | |
| 133 | +| Steps verified by VLM | 7/13 | |
| 134 | +| Steps failed/skipped | 6/13 | |
| 135 | +| Retries triggered | 2 per failed step | |
| 136 | +| Replans triggered | 1 (right-click → "+" icon) | |
| 137 | +| WAA formal score | 0% (missing cells B3–B6, no % formatting) | |
| 138 | +| VLM goal assessment | "verified" at 90% confidence | |
| 139 | + |
| 140 | +The controller prevented premature quitting (its main design goal) and demonstrated working retry/replan. The task was "almost" completed — all architectural components functioned but the agent didn't finish all spreadsheet columns. |
| 141 | + |
| 142 | +--- |
| 143 | + |
| 144 | +## 5. Alternative Approaches |
| 145 | + |
| 146 | +### 5A. Fine-tuned 7B VLM (e.g., Qwen2.5-VL-7B) |
| 147 | + |
| 148 | +| Metric | Value | |
| 149 | +|--------|-------| |
| 150 | +| Inference cost per request | ~$0.000014 (A100 @ $1/hr, ~20 req/s) | |
| 151 | +| Cost per 25-step task | ~$0.00035 | |
| 152 | +| Cost reduction vs Claude | **~8,000x** | |
| 153 | +| Training data needed | 500–1,000 successful trajectories | |
| 154 | +| Training data cost | $3,000–14,000 (at $6–14/trajectory via Claude) | |
| 155 | + |
| 156 | +Reference: ShowUI-Aloha achieves 60.1% on OSWorld with a 2B model using the {Think, Action, Expect} format. |
| 157 | + |
| 158 | +### 5B. RL-trained model (verl-agent / GiGPO) |
| 159 | + |
| 160 | +| Metric | Value | |
| 161 | +|--------|-------| |
| 162 | +| Training cost (VM + GPU) | $3,000–5,000 one-time | |
| 163 | +| Inference cost | Same as fine-tuned VLM (~$0.00035/task) | |
| 164 | +| Key advantage | Learns from failures; per-step credit via GiGPO | |
| 165 | + |
| 166 | +### 5C. Hybrid architecture (recommended) |
| 167 | + |
| 168 | +| Tier | Role | Model | Cost/task | |
| 169 | +|------|------|-------|-----------| |
| 170 | +| 1. Planning | Generate plans from demos (cached, amortized) | Claude Sonnet | $0.005 | |
| 171 | +| 2. Execution | Step-by-step action selection | Fine-tuned 7B | $0.0004 | |
| 172 | +| 3. Verification | Screenshot-based step checking | GPT-4.1-mini | $0.006 | |
| 173 | +| 4. Recovery | Replan on failure (20% of tasks) | Claude Sonnet | $0.04 | |
| 174 | +| **Total** | | | **~$0.05** | |
| 175 | + |
| 176 | +At 1,000 tasks/day: **$50/day = $1,500/month** (vs. $108K for pure Claude, vs. $30K for humans). |
| 177 | + |
| 178 | +--- |
| 179 | + |
| 180 | +## 6. Strategic Phasing |
| 181 | + |
| 182 | +### Phase 1: Loop as product (now → 6 months) |
| 183 | + |
| 184 | +Target high-value enterprise tasks where the human alternative costs $25+/task (30+ minute tasks, after-hours automation, compliance workflows). At $3–14/task, this is a 2–8x savings. |
| 185 | + |
| 186 | +This phase generates both **revenue** and **training data**. |
| 187 | + |
| 188 | +### Phase 2: Hybrid (6–18 months) |
| 189 | + |
| 190 | +Use collected trajectories to train execution models. Deploy tiered architecture (Section 5C). Drop per-task cost to ~$0.05. Competitive moat: trained model + demo library + verification pipeline. |
| 191 | + |
| 192 | +### Phase 3: Trained model as product (18+ months) |
| 193 | + |
| 194 | +Claude used only for cold-start on new task types. Per-task cost approaches hardware-only (~$0.001). Moat: accumulated training data + task-specific weights. |
| 195 | + |
| 196 | +### The flywheel |
| 197 | + |
| 198 | +``` |
| 199 | +Claude agent attempts task (expensive, generates data) |
| 200 | + → VLM verifier labels each step (cheap) |
| 201 | + → Successful trajectories → training data |
| 202 | + → Fine-tune / RL-train smaller model |
| 203 | + → Smaller model handles easy tasks (~free) |
| 204 | + → Claude handles only hard/novel tasks |
| 205 | + → More successes → more training data |
| 206 | + → Smaller model handles more tasks |
| 207 | + → Claude needed less and less |
| 208 | +``` |
| 209 | + |
| 210 | +--- |
| 211 | + |
| 212 | +## 7. Immediate Optimizations |
| 213 | + |
| 214 | +| Optimization | Impact | Effort | |
| 215 | +|-------------|--------|--------| |
| 216 | +| **Prompt caching** (Anthropic) | –30–50% on Claude costs | Low (add cache breakpoints) | |
| 217 | +| **Conversation truncation** (keep last 3–5 screenshots, summarize earlier) | –50–60% on long tasks | Medium | |
| 218 | +| **Switch verifier to GPT-4.1-nano** ($0.02/$0.15) | –95% on verifier costs (already negligible) | Trivial | |
| 219 | +| **Log all (screenshot, action, verification) tuples** | Future training data | Low | |
| 220 | +| **Token usage logging** per API call | Measure actual vs estimated costs | Low | |
| 221 | + |
| 222 | +Conversation truncation is the single highest-impact optimization. Step 25 currently sends ~70K input tokens; keeping only the last 5 screenshots would reduce it to ~15K, cutting total Claude cost by ~60%. |
| 223 | + |
| 224 | +--- |
| 225 | + |
| 226 | +## 8. Summary |
| 227 | + |
| 228 | +| Approach | Cost/task | Latency | Success rate | Moat | Timeline | |
| 229 | +|----------|-----------|---------|-------------|------|----------| |
| 230 | +| Claude closed-loop (current) | $2.82–14 | 10–30 min | ~40–60% | None | Now | |
| 231 | +| + caching + truncation | $1.00–5 | 8–20 min | ~40–60% | Low | Weeks | |
| 232 | +| + fine-tuned 7B execution | ~$0.05 | 3–8 min | ~50–70% | Medium | 6 months | |
| 233 | +| + RL-trained model | $0.005–0.05 | 2–5 min | ~60–80% | High | 12 months | |
| 234 | +| Human worker | $1–2.50 | 3–5 min | ~95–99% | None | Always | |
| 235 | + |
| 236 | +**Bottom line**: The closed-loop LLM agent is viable today only for high-value tasks where the human alternative costs $25+/task. For general-purpose desktop automation at scale, the economics require a transition to trained smaller models. The demo-conditioned controller + VLM verifier architecture is the right foundation for this data-collection flywheel. |
0 commit comments