Skip to content

[Discussion][vscode] Cost estimation accuracy — linear token counting vs. cumulative context & prompt caching #604

@tianzheng-zhou

Description

@tianzheng-zhou

Summary

The current cost estimation calculates token counts linearly (each message counted once) and applies full public API prices without considering prompt caching. After analyzing the codebase and the underlying mechanics of multi-turn LLM conversations, I'd like to discuss the accuracy implications and potential improvements.

How token estimation currently works

The extension uses two parallel systems:

  1. Estimated tokens — Character-ratio-based estimation (tokenEstimation.ts), counting each user message and assistant response once:

    Turn 1: estimate(user1) + estimate(resp1)
    Turn 2: estimate(user2) + estimate(resp2)
    Total = simple sum (linear growth)
    
  2. Actual tokens — From API usage fields when available (request.result.promptTokens), which include the full cumulative context and naturally reflect quadratic growth.

  3. Cost estimationcalculateEstimatedCost() uses inputCostPerMillion / outputCostPerMillion from modelPricing.json, with no cache-aware pricing.

The gap: what the estimate misses

In reality, each API call re-sends the entire conversation history as input:

Turn 1 input: [System Prompt] + [Tools] + [user1]
Turn 2 input: [System Prompt] + [Tools] + [user1] + [resp1] + [user2]
Turn N input: [System Prompt] + [Tools] + [all previous turns] + [userN]

This means:

  • Input tokens grow linearly per turn → total consumption grows O(n²)
  • The linear estimate misses: System Prompt (~10-20K tokens), Tool definitions (~2-5K tokens), and cumulative history re-sending

But prompt caching changes the economics significantly

Modern LLM APIs cache repeated prefixes:

Provider Mechanism Cache hit discount TTL
Anthropic Explicit (cache_control) 90% off 5 min (refreshed on hit)
OpenAI Automatic prefix matching 50% off ~5-10 min
Google Context Caching API 75% off Custom

In a multi-turn conversation:

  • System Prompt + Tools: Almost always cached (especially on high-traffic platforms like GitHub Copilot where millions of requests share the same prefix)
  • Previous conversation history: Cached if the gap between turns is < TTL
  • Only the new content (current user message + previous response) is uncached

This means the linear estimate accidentally approximates the uncached portion — the part that matters most for actual cost.

Quantitative analysis

Using Claude Sonnet 4 ($3/M input, $0.3/M cached, $15/M output), 10-turn conversation, with each turn ~500 input + 500 output tokens:

Scenario Input Cost Output Cost Total vs. Linear Estimate
Linear estimate (current plugin) $0.015 $0.075 $0.090 1.0x
Actual cost, no caching $0.143 $0.075 $0.218 2.4x
Actual cost, with caching (System Prompt always warm) $0.041 $0.075 $0.116 1.3x

The linear estimate is ~1.3x off from the real cached cost — surprisingly close! The main source of error is the previous turn's response becoming new (uncached) input in the next turn.

Note on platform-level caching: For platforms like GitHub Copilot, the System Prompt KV cache is likely always warm due to high request volume across millions of users. This is not API-level prompt caching with a 5-minute TTL — it's infrastructure-level prefix sharing in the inference engine (vLLM, TensorRT-LLM, etc.). The system prompt's marginal cost per request approaches zero.

Additionally, GitHub likely negotiates enterprise contracts with model providers (Anthropic, OpenAI, Google) with custom pricing, extended cache TTLs, and dedicated inference infrastructure — significantly different from public API rates. The public prices used in modelPricing.json serve as a useful reference benchmark, though the actual platform costs may be 3-10x lower.

What the extension already does well for Claude Desktop/Code

I noticed that claudedesktop.ts and claudecode.ts already read Anthropic's cache token fields:

// claudedesktop.ts:189-191
const inputTokens = (usage.input_tokens || 0)
    + (usage.cache_creation_input_tokens || 0)
    + (usage.cache_read_input_tokens || 0);

However, these are summed into a single inputTokens value rather than tracked separately, so the cost calculation can't distinguish cached vs. uncached tokens.

Suggestions for discussion

I'd love to hear the maintainer's thoughts on these potential improvements:

1. Track cached tokens separately in ModelUsage

export interface ModelUsage {
    [modelName: string]: {
        inputTokens: number;
        outputTokens: number;
        cachedReadTokens?: number;      // new
        cacheCreationTokens?: number;   // new
    };
}

This would allow the cost calculation to apply different rates to cached vs. uncached tokens.

2. Add cache-aware pricing to modelPricing.json

{
    "claude-sonnet-4": {
        "inputCostPerMillion": 3.0,
        "outputCostPerMillion": 15.0,
        "cachedInputCostPerMillion": 0.3,
        "cacheCreationCostPerMillion": 3.75
    },
    "gpt-4o": {
        "inputCostPerMillion": 2.5,
        "outputCostPerMillion": 10.0,
        "cachedInputCostPerMillion": 1.25
    }
}

3. Consider simulating cumulative context (optional, advanced)

For sessions without actualTokens data, the extension could reconstruct each turn's full input size from the session log:

Turn N estimated input = configurable_system_overhead 
    + sum(all previous user messages + responses)
    + current user message

This would give a much more realistic token count, though it trades simplicity for accuracy.

4. Label the cost estimate more clearly

If implementing cache-aware pricing isn't prioritized, perhaps the cost label could clarify what it represents:

  • "Estimated cost (public API rates, no caching)"
  • vs. current unlabeled "Estimated Cost"

This sets the right expectation for users.

Additional context

  • The promptTokenDetails breakdown (System vs. User context percentage) in the log viewer is excellent — it already shows that System + Tools can be 50-70% of prompt tokens. This data could potentially be leveraged to improve cost estimates.
  • The existing actualTokens path (from API usage responses) remains the gold standard — these improvements would mainly benefit sessions where that data is unavailable.
  • For Copilot users specifically, the estimate serves more as a "what would this cost at public API rates" reference value, which is still very useful for understanding usage patterns and relative costs between sessions.

References

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions