Summary
The current cost estimation calculates token counts linearly (each message counted once) and applies full public API prices without considering prompt caching. After analyzing the codebase and the underlying mechanics of multi-turn LLM conversations, I'd like to discuss the accuracy implications and potential improvements.
How token estimation currently works
The extension uses two parallel systems:
-
Estimated tokens — Character-ratio-based estimation (tokenEstimation.ts), counting each user message and assistant response once:
Turn 1: estimate(user1) + estimate(resp1)
Turn 2: estimate(user2) + estimate(resp2)
Total = simple sum (linear growth)
-
Actual tokens — From API usage fields when available (request.result.promptTokens), which include the full cumulative context and naturally reflect quadratic growth.
-
Cost estimation — calculateEstimatedCost() uses inputCostPerMillion / outputCostPerMillion from modelPricing.json, with no cache-aware pricing.
The gap: what the estimate misses
In reality, each API call re-sends the entire conversation history as input:
Turn 1 input: [System Prompt] + [Tools] + [user1]
Turn 2 input: [System Prompt] + [Tools] + [user1] + [resp1] + [user2]
Turn N input: [System Prompt] + [Tools] + [all previous turns] + [userN]
This means:
- Input tokens grow linearly per turn → total consumption grows O(n²)
- The linear estimate misses: System Prompt (~10-20K tokens), Tool definitions (~2-5K tokens), and cumulative history re-sending
But prompt caching changes the economics significantly
Modern LLM APIs cache repeated prefixes:
| Provider |
Mechanism |
Cache hit discount |
TTL |
| Anthropic |
Explicit (cache_control) |
90% off |
5 min (refreshed on hit) |
| OpenAI |
Automatic prefix matching |
50% off |
~5-10 min |
| Google |
Context Caching API |
75% off |
Custom |
In a multi-turn conversation:
- System Prompt + Tools: Almost always cached (especially on high-traffic platforms like GitHub Copilot where millions of requests share the same prefix)
- Previous conversation history: Cached if the gap between turns is < TTL
- Only the new content (current user message + previous response) is uncached
This means the linear estimate accidentally approximates the uncached portion — the part that matters most for actual cost.
Quantitative analysis
Using Claude Sonnet 4 ($3/M input, $0.3/M cached, $15/M output), 10-turn conversation, with each turn ~500 input + 500 output tokens:
| Scenario |
Input Cost |
Output Cost |
Total |
vs. Linear Estimate |
| Linear estimate (current plugin) |
$0.015 |
$0.075 |
$0.090 |
1.0x |
| Actual cost, no caching |
$0.143 |
$0.075 |
$0.218 |
2.4x |
| Actual cost, with caching (System Prompt always warm) |
$0.041 |
$0.075 |
$0.116 |
1.3x |
The linear estimate is ~1.3x off from the real cached cost — surprisingly close! The main source of error is the previous turn's response becoming new (uncached) input in the next turn.
Note on platform-level caching: For platforms like GitHub Copilot, the System Prompt KV cache is likely always warm due to high request volume across millions of users. This is not API-level prompt caching with a 5-minute TTL — it's infrastructure-level prefix sharing in the inference engine (vLLM, TensorRT-LLM, etc.). The system prompt's marginal cost per request approaches zero.
Additionally, GitHub likely negotiates enterprise contracts with model providers (Anthropic, OpenAI, Google) with custom pricing, extended cache TTLs, and dedicated inference infrastructure — significantly different from public API rates. The public prices used in modelPricing.json serve as a useful reference benchmark, though the actual platform costs may be 3-10x lower.
What the extension already does well for Claude Desktop/Code
I noticed that claudedesktop.ts and claudecode.ts already read Anthropic's cache token fields:
// claudedesktop.ts:189-191
const inputTokens = (usage.input_tokens || 0)
+ (usage.cache_creation_input_tokens || 0)
+ (usage.cache_read_input_tokens || 0);
However, these are summed into a single inputTokens value rather than tracked separately, so the cost calculation can't distinguish cached vs. uncached tokens.
Suggestions for discussion
I'd love to hear the maintainer's thoughts on these potential improvements:
1. Track cached tokens separately in ModelUsage
export interface ModelUsage {
[modelName: string]: {
inputTokens: number;
outputTokens: number;
cachedReadTokens?: number; // new
cacheCreationTokens?: number; // new
};
}
This would allow the cost calculation to apply different rates to cached vs. uncached tokens.
2. Add cache-aware pricing to modelPricing.json
{
"claude-sonnet-4": {
"inputCostPerMillion": 3.0,
"outputCostPerMillion": 15.0,
"cachedInputCostPerMillion": 0.3,
"cacheCreationCostPerMillion": 3.75
},
"gpt-4o": {
"inputCostPerMillion": 2.5,
"outputCostPerMillion": 10.0,
"cachedInputCostPerMillion": 1.25
}
}
3. Consider simulating cumulative context (optional, advanced)
For sessions without actualTokens data, the extension could reconstruct each turn's full input size from the session log:
Turn N estimated input = configurable_system_overhead
+ sum(all previous user messages + responses)
+ current user message
This would give a much more realistic token count, though it trades simplicity for accuracy.
4. Label the cost estimate more clearly
If implementing cache-aware pricing isn't prioritized, perhaps the cost label could clarify what it represents:
- "Estimated cost (public API rates, no caching)"
- vs. current unlabeled "Estimated Cost"
This sets the right expectation for users.
Additional context
- The
promptTokenDetails breakdown (System vs. User context percentage) in the log viewer is excellent — it already shows that System + Tools can be 50-70% of prompt tokens. This data could potentially be leveraged to improve cost estimates.
- The existing
actualTokens path (from API usage responses) remains the gold standard — these improvements would mainly benefit sessions where that data is unavailable.
- For Copilot users specifically, the estimate serves more as a "what would this cost at public API rates" reference value, which is still very useful for understanding usage patterns and relative costs between sessions.
References
Summary
The current cost estimation calculates token counts linearly (each message counted once) and applies full public API prices without considering prompt caching. After analyzing the codebase and the underlying mechanics of multi-turn LLM conversations, I'd like to discuss the accuracy implications and potential improvements.
How token estimation currently works
The extension uses two parallel systems:
Estimated tokens — Character-ratio-based estimation (
tokenEstimation.ts), counting each user message and assistant response once:Actual tokens — From API
usagefields when available (request.result.promptTokens), which include the full cumulative context and naturally reflect quadratic growth.Cost estimation —
calculateEstimatedCost()usesinputCostPerMillion/outputCostPerMillionfrommodelPricing.json, with no cache-aware pricing.The gap: what the estimate misses
In reality, each API call re-sends the entire conversation history as input:
This means:
But prompt caching changes the economics significantly
Modern LLM APIs cache repeated prefixes:
cache_control)In a multi-turn conversation:
This means the linear estimate accidentally approximates the uncached portion — the part that matters most for actual cost.
Quantitative analysis
Using Claude Sonnet 4 ($3/M input, $0.3/M cached, $15/M output), 10-turn conversation, with each turn ~500 input + 500 output tokens:
The linear estimate is ~1.3x off from the real cached cost — surprisingly close! The main source of error is the previous turn's response becoming new (uncached) input in the next turn.
What the extension already does well for Claude Desktop/Code
I noticed that
claudedesktop.tsandclaudecode.tsalready read Anthropic's cache token fields:However, these are summed into a single
inputTokensvalue rather than tracked separately, so the cost calculation can't distinguish cached vs. uncached tokens.Suggestions for discussion
I'd love to hear the maintainer's thoughts on these potential improvements:
1. Track cached tokens separately in
ModelUsageThis would allow the cost calculation to apply different rates to cached vs. uncached tokens.
2. Add cache-aware pricing to
modelPricing.json{ "claude-sonnet-4": { "inputCostPerMillion": 3.0, "outputCostPerMillion": 15.0, "cachedInputCostPerMillion": 0.3, "cacheCreationCostPerMillion": 3.75 }, "gpt-4o": { "inputCostPerMillion": 2.5, "outputCostPerMillion": 10.0, "cachedInputCostPerMillion": 1.25 } }3. Consider simulating cumulative context (optional, advanced)
For sessions without
actualTokensdata, the extension could reconstruct each turn's full input size from the session log:This would give a much more realistic token count, though it trades simplicity for accuracy.
4. Label the cost estimate more clearly
If implementing cache-aware pricing isn't prioritized, perhaps the cost label could clarify what it represents:
This sets the right expectation for users.
Additional context
promptTokenDetailsbreakdown (System vs. User context percentage) in the log viewer is excellent — it already shows that System + Tools can be 50-70% of prompt tokens. This data could potentially be leveraged to improve cost estimates.actualTokenspath (from APIusageresponses) remains the gold standard — these improvements would mainly benefit sessions where that data is unavailable.References
tokenEstimation.ts,calculateEstimatedCost(),modelPricing.json,claudedesktop.ts,claudecode.ts