[Discussion][vscode] Cost estimation accuracy — linear token counting vs. cumulative context & prompt caching

### Summary

The current cost estimation calculates token counts **linearly** (each message counted once) and applies **full public API prices** without considering prompt caching. After analyzing the codebase and the underlying mechanics of multi-turn LLM conversations, I'd like to discuss the accuracy implications and potential improvements.

### How token estimation currently works

The extension uses two parallel systems:

1. **Estimated tokens** — Character-ratio-based estimation (`tokenEstimation.ts`), counting each user message and assistant response **once**:
   ```
   Turn 1: estimate(user1) + estimate(resp1)
   Turn 2: estimate(user2) + estimate(resp2)
   Total = simple sum (linear growth)
   ```

2. **Actual tokens** — From API `usage` fields when available (`request.result.promptTokens`), which include the full cumulative context and naturally reflect quadratic growth.

3. **Cost estimation** — `calculateEstimatedCost()` uses `inputCostPerMillion` / `outputCostPerMillion` from `modelPricing.json`, with no cache-aware pricing.

### The gap: what the estimate misses

In reality, each API call re-sends the **entire conversation history** as input:

```
Turn 1 input: [System Prompt] + [Tools] + [user1]
Turn 2 input: [System Prompt] + [Tools] + [user1] + [resp1] + [user2]
Turn N input: [System Prompt] + [Tools] + [all previous turns] + [userN]
```

This means:
- **Input tokens grow linearly per turn** → total consumption grows **O(n²)**
- The linear estimate misses: System Prompt (~10-20K tokens), Tool definitions (~2-5K tokens), and cumulative history re-sending

### But prompt caching changes the economics significantly

Modern LLM APIs cache repeated prefixes:

| Provider | Mechanism | Cache hit discount | TTL |
|----------|-----------|-------------------|-----|
| Anthropic | Explicit (`cache_control`) | **90%** off | 5 min (refreshed on hit) |
| OpenAI | Automatic prefix matching | **50%** off | ~5-10 min |
| Google | Context Caching API | **75%** off | Custom |

In a multi-turn conversation:
- **System Prompt + Tools**: Almost always cached (especially on high-traffic platforms like GitHub Copilot where millions of requests share the same prefix)
- **Previous conversation history**: Cached if the gap between turns is < TTL
- **Only the new content** (current user message + previous response) is uncached

This means the **linear estimate accidentally approximates the uncached portion** — the part that matters most for actual cost.

### Quantitative analysis

Using Claude Sonnet 4 ($3/M input, $0.3/M cached, $15/M output), 10-turn conversation, with each turn ~500 input + 500 output tokens:

| Scenario | Input Cost | Output Cost | Total | vs. Linear Estimate |
|----------|-----------|-------------|-------|-------------------|
| **Linear estimate** (current plugin) | $0.015 | $0.075 | **$0.090** | 1.0x |
| **Actual cost, no caching** | $0.143 | $0.075 | **$0.218** | 2.4x |
| **Actual cost, with caching** (System Prompt always warm) | $0.041 | $0.075 | **$0.116** | 1.3x |

The linear estimate is ~1.3x off from the real cached cost — surprisingly close! The main source of error is the previous turn's response becoming new (uncached) input in the next turn.

> **Note on platform-level caching**: For platforms like GitHub Copilot, the System Prompt KV cache is likely **always warm** due to high request volume across millions of users. This is not API-level prompt caching with a 5-minute TTL — it's infrastructure-level prefix sharing in the inference engine (vLLM, TensorRT-LLM, etc.). The system prompt's marginal cost per request approaches zero.
>
> Additionally, GitHub likely negotiates **enterprise contracts** with model providers (Anthropic, OpenAI, Google) with custom pricing, extended cache TTLs, and dedicated inference infrastructure — significantly different from public API rates. The public prices used in `modelPricing.json` serve as a useful reference benchmark, though the actual platform costs may be 3-10x lower.

### What the extension already does well for Claude Desktop/Code

I noticed that `claudedesktop.ts` and `claudecode.ts` already read Anthropic's cache token fields:

```typescript
// claudedesktop.ts:189-191
const inputTokens = (usage.input_tokens || 0)
    + (usage.cache_creation_input_tokens || 0)
    + (usage.cache_read_input_tokens || 0);
```

However, these are **summed into a single `inputTokens`** value rather than tracked separately, so the cost calculation can't distinguish cached vs. uncached tokens.

### Suggestions for discussion

I'd love to hear the maintainer's thoughts on these potential improvements:

#### 1. Track cached tokens separately in `ModelUsage`

```typescript
export interface ModelUsage {
    [modelName: string]: {
        inputTokens: number;
        outputTokens: number;
        cachedReadTokens?: number;      // new
        cacheCreationTokens?: number;   // new
    };
}
```

This would allow the cost calculation to apply different rates to cached vs. uncached tokens.

#### 2. Add cache-aware pricing to `modelPricing.json`

```json
{
    "claude-sonnet-4": {
        "inputCostPerMillion": 3.0,
        "outputCostPerMillion": 15.0,
        "cachedInputCostPerMillion": 0.3,
        "cacheCreationCostPerMillion": 3.75
    },
    "gpt-4o": {
        "inputCostPerMillion": 2.5,
        "outputCostPerMillion": 10.0,
        "cachedInputCostPerMillion": 1.25
    }
}
```

#### 3. Consider simulating cumulative context (optional, advanced)

For sessions without `actualTokens` data, the extension could reconstruct each turn's full input size from the session log:

```
Turn N estimated input = configurable_system_overhead 
    + sum(all previous user messages + responses)
    + current user message
```

This would give a much more realistic token count, though it trades simplicity for accuracy.

#### 4. Label the cost estimate more clearly

If implementing cache-aware pricing isn't prioritized, perhaps the cost label could clarify what it represents:
- "Estimated cost (public API rates, no caching)" 
- vs. current unlabeled "Estimated Cost"

This sets the right expectation for users.

### Additional context

- The `promptTokenDetails` breakdown (System vs. User context percentage) in the log viewer is excellent — it already shows that System + Tools can be 50-70% of prompt tokens. This data could potentially be leveraged to improve cost estimates.
- The existing `actualTokens` path (from API `usage` responses) remains the gold standard — these improvements would mainly benefit sessions where that data is unavailable.
- For Copilot users specifically, the estimate serves more as a **"what would this cost at public API rates"** reference value, which is still very useful for understanding usage patterns and relative costs between sessions.

### References

- [Anthropic Prompt Caching](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching)
- [OpenAI Prompt Caching](https://platform.openai.com/docs/guides/prompt-caching)
- Related code: `tokenEstimation.ts`, `calculateEstimatedCost()`, `modelPricing.json`, `claudedesktop.ts`, `claudecode.ts`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Discussion][vscode] Cost estimation accuracy — linear token counting vs. cumulative context & prompt caching #604

Summary

How token estimation currently works

The gap: what the estimate misses

But prompt caching changes the economics significantly

Quantitative analysis

What the extension already does well for Claude Desktop/Code

Suggestions for discussion

1. Track cached tokens separately in `ModelUsage`

2. Add cache-aware pricing to `modelPricing.json`

3. Consider simulating cumulative context (optional, advanced)

4. Label the cost estimate more clearly

Additional context

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Provider	Mechanism	Cache hit discount	TTL
Anthropic	Explicit (`cache_control`)	90% off	5 min (refreshed on hit)
OpenAI	Automatic prefix matching	50% off	~5-10 min
Google	Context Caching API	75% off	Custom

Scenario	Input Cost	Output Cost	Total	vs. Linear Estimate
Linear estimate (current plugin)	$0.015	$0.075	$0.090	1.0x
Actual cost, no caching	$0.143	$0.075	$0.218	2.4x
Actual cost, with caching (System Prompt always warm)	$0.041	$0.075	$0.116	1.3x

Uh oh!

[Discussion][vscode] Cost estimation accuracy — linear token counting vs. cumulative context & prompt caching #604

Description

Summary

How token estimation currently works

The gap: what the estimate misses

But prompt caching changes the economics significantly

Quantitative analysis

What the extension already does well for Claude Desktop/Code

Suggestions for discussion

1. Track cached tokens separately in ModelUsage

2. Add cache-aware pricing to modelPricing.json

3. Consider simulating cumulative context (optional, advanced)

4. Label the cost estimate more clearly

Additional context

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. Track cached tokens separately in `ModelUsage`

2. Add cache-aware pricing to `modelPricing.json`