Fix reasoning token cost normalization

crmne · crmne · commit f59f9d0083b5 · 2026-05-06T22:16:19.000+01:00
diff --git a/docs/_advanced/models.md b/docs/_advanced/models.md
@@ -225,10 +225,11 @@ puts cost.input
 puts cost.output
 puts cost.cache_read
 puts cost.cache_write
+puts cost.thinking
 puts cost.total
 ```
 
-Costs use RubyLLM's normalized token buckets: standard input, output, cache read, and cache write. See [Tracking Token Usage]({% link _core_features/chat.md %}#tracking-token-usage) for the provider comparison table and what RubyLLM exposes consistently across providers.
+Costs use RubyLLM's normalized token buckets: standard input, billable output, cache read, cache write, and separately priced thinking when the model registry exposes a distinct reasoning-token price. See [Tracking Token Usage]({% link _core_features/chat.md %}#tracking-token-usage) for the provider comparison table and what RubyLLM exposes consistently across providers.
 
 Most applications use the shorter helpers on messages, chats, and agents:
 
@@ -245,7 +246,7 @@ cost = RubyLLM::Cost.aggregate(messages.map(&:cost))
 cost.total
 ```
 
-If pricing is incomplete for tokens that were used, the affected cost and `cost.total` return `nil`.
+If pricing is incomplete for tokens that were used, the affected cost and `cost.total` return `nil`. Cost helpers cover token-priced conversation usage; provider-specific add-ons such as search-query charges remain available in the provider's raw usage payload.
 
 ## Connecting to Custom Endpoints & Using Unlisted Models
 {: .d-inline-block }
diff --git a/docs/_advanced/rails.md b/docs/_advanced/rails.md
@@ -549,11 +549,12 @@ Persisted chats and messages expose the same normalized token and cost helpers a
 message = chat_record.messages.last
 
 message.input_tokens       # Standard input tokens
-message.output_tokens      # Output tokens
+message.output_tokens      # Billable output tokens
 message.cache_read_tokens  # Prompt cache reads
 message.cache_write_tokens # Prompt cache writes
 
 message.cost.total
+message.cost.thinking # When the model has distinct reasoning-token pricing
 chat_record.cost.total
 ```
 
diff --git a/docs/_advanced/upgrading.md b/docs/_advanced/upgrading.md
@@ -40,7 +40,7 @@ Use the new cache names in new code:
 
 ```ruby
 response.input_tokens       # Standard input tokens
-response.output_tokens      # Output tokens
+response.output_tokens      # Billable output tokens
 response.cache_read_tokens  # Tokens served from prompt cache
 response.cache_write_tokens # Tokens written to prompt cache
 
@@ -79,6 +79,8 @@ agent.cost.total
 
 Cost helpers are available from 1.15 onward. They return `nil` for any cost bucket whose pricing is missing, and `cost.total` is also `nil` when a used bucket has incomplete pricing.
 
+`thinking_tokens` remains available from 1.10. From 1.15 onward, `output_tokens` is normalized as the billable output bucket. Do not add `thinking_tokens` to `output_tokens` yourself; RubyLLM includes thinking in output when the provider bills it as output, and exposes `cost.thinking` only for models with distinct reasoning-token pricing.
+
 See [Tracking Token Usage]({% link _core_features/chat.md %}#tracking-token-usage) for the provider comparison table and the exact normalized token semantics RubyLLM exposes.
 
 # Upgrade to 1.14
diff --git a/docs/_core_features/chat.md b/docs/_core_features/chat.md
@@ -613,7 +613,7 @@ Understanding token usage is important for managing costs and staying within con
 response = chat.ask "Explain the Ruby Global Interpreter Lock (GIL)."
 
 input_tokens = response.input_tokens   # Standard input tokens
-output_tokens = response.output_tokens # Output tokens
+output_tokens = response.output_tokens # Billable output tokens
 cache_read_tokens = response.cache_read_tokens # Tokens served from the provider's prompt cache - v1.15+
 cache_write_tokens = response.cache_write_tokens # Tokens written to cache - v1.15+
 thinking_tokens = response.thinking_tokens # Thinking tokens when providers report them - v1.10.0+
@@ -632,6 +632,7 @@ puts "Input Cost: $#{format('%.6f', response.cost.input)}" if response.cost.inpu
 puts "Output Cost: $#{format('%.6f', response.cost.output)}" if response.cost.output
 puts "Cache Read Cost: $#{format('%.6f', response.cost.cache_read)}" if response.cost.cache_read
 puts "Cache Write Cost: $#{format('%.6f', response.cost.cache_write)}" if response.cost.cache_write
+puts "Thinking Cost: $#{format('%.6f', response.cost.thinking)}" if response.cost.thinking
 puts "Total Cost: $#{format('%.6f', response.cost.total)}" if response.cost.total
 
 # Total tokens for the entire conversation so far
@@ -660,9 +661,11 @@ This means the same RubyLLM code works across providers: `input_tokens` for stan
 
 `cache_read_tokens` and `cache_write_tokens` are available from v1.15+ and are also exposed as `response.tokens.cache_read` and `response.tokens.cache_write`. The older `cached_tokens` and `cache_creation_tokens` methods remain available for compatibility with v1.9.0+ code.
 
-Thinking token usage is available via `response.thinking_tokens` and `response.tokens.thinking` when providers report it. For providers that do not include thinking token counts, these values remain `nil`.
+Thinking token usage is available via `response.thinking_tokens` and `response.tokens.thinking` when providers report it. For most providers, thinking/reasoning tokens are a breakdown of output work, not an extra bucket to add yourself. RubyLLM keeps `output_tokens` as the billable output bucket: OpenAI-style providers that include reasoning in completion tokens stay as-is, while OpenAI-compatible providers that report reasoning outside completion tokens are normalized so `output_tokens` includes the billable generated total.
 
-Cost helpers are available from v1.15+. RubyLLM uses token usage from the provider and pricing from the model registry. If the registry is missing pricing for tokens that were used, the affected cost and `cost.total` return `nil` instead of pretending the cost was zero.
+When a model has distinct reasoning-token pricing, `response.cost.thinking` prices that bucket separately. Otherwise, thinking tokens are treated as part of `response.cost.output` and `response.cost.thinking` stays `nil`.
+
+Cost helpers are available from v1.15+. RubyLLM uses token usage from the provider and pricing from the model registry. If the registry is missing pricing for tokens that were used, the affected cost and `cost.total` return `nil` instead of pretending the cost was zero. These helpers cover token-priced conversation usage; provider-specific add-ons such as search-query charges are left to the provider's raw usage payload.
 
 Refer to the [Working with Models Guide]({% link _advanced/models.md %}) for details on accessing model-specific pricing.
 
diff --git a/docs/_core_features/streaming.md b/docs/_core_features/streaming.md
@@ -58,7 +58,7 @@ Key attributes of a `Chunk`:
 *   `chunk.model_id`: The model generating the response (usually present).
 *   `chunk.tool_calls`: A hash containing partial or complete tool call information if the model is invoking a [Tool]({% link _core_features/tools.md %}). The arguments might be streamed incrementally.
 *   `chunk.input_tokens`: Standard input tokens for the request (often `nil` until the final chunk). From v1.15 onward, cache reads and writes are exposed separately as `chunk.cache_read_tokens` and `chunk.cache_write_tokens` when providers report them.
-*   `chunk.output_tokens`: Cumulative output tokens *up to this chunk* (behavior varies by provider, often only accurate in the final chunk).
+*   `chunk.output_tokens`: Cumulative billable output tokens *up to this chunk* (behavior varies by provider, often only accurate in the final chunk). From v1.15 onward, this includes thinking/reasoning tokens when the provider bills them as output.
 *   `chunk.thinking`: Optional thinking output when providers stream it.
 
 > Do not rely on token counts being present or accurate in every chunk. They are typically finalized only in the last chunk or the final returned message.
diff --git a/docs/_core_features/thinking.md b/docs/_core_features/thinking.md
@@ -96,6 +96,8 @@ response.thinking&.text
 response.thinking_tokens
 ```
 
+`thinking_tokens` is usually a breakdown of generated output work. From v1.15 onward, RubyLLM normalizes `output_tokens` as the billable output bucket, so you should not add `thinking_tokens` to `output_tokens` for cost calculations. When a model has distinct reasoning-token pricing, the cost is exposed separately as `response.cost.thinking`.
+
 ### Upgrading Existing Installations
 
 For 1.10 upgrades, consider using the [upgrade guide]({% link _advanced/upgrading.md %}#upgrade-to-1-10) to run the generator.
diff --git a/lib/ruby_llm/cost.rb b/lib/ruby_llm/cost.rb
@@ -3,7 +3,7 @@
 module RubyLLM
   # Represents the cost of token usage for a model response.
   class Cost
-    COMPONENTS = %i[input output cache_read cache_write].freeze
+    COMPONENTS = %i[input output cache_read cache_write thinking].freeze
     PER_MILLION = 1_000_000.0
 
     attr_reader :tokens, :model
@@ -47,6 +47,12 @@ def cache_write
       amount_for(:cache_write)
     end
 
+    def thinking
+      amount_for(:thinking)
+    end
+
+    alias reasoning thinking
+
     alias cached_input cache_read
     alias cache_creation cache_write
 
@@ -66,6 +72,7 @@ def to_h
         output: output,
         cache_read: cache_read,
         cache_write: cache_write,
+        thinking: thinking,
         total: total
       }.compact
     end
@@ -78,6 +85,7 @@ def tokens?
 
     def missing?(component)
       return @missing.include?(component) if aggregate?
+      return false if component == :thinking && !thinking_priced_separately?
 
       tokens = tokens_for(component)
       tokens.to_i.positive? && price_for(component).nil?
@@ -121,6 +129,8 @@ def tokens_for(component)
         tokens.cache_read
       when :cache_write
         tokens.cache_write
+      when :thinking
+        tokens.thinking if thinking_priced_separately?
       end
     end
 
@@ -134,13 +144,23 @@ def price_for(component)
         text_pricing.cache_read_input
       when :cache_write
         text_pricing.cache_write_input
+      when :thinking
+        text_pricing.reasoning_output
       end
     end
 
     def text_pricing
       model&.pricing&.text_tokens || RubyLLM::Model::PricingCategory.new
     end
 
+    def thinking_priced_separately?
+      reasoning_price = text_pricing.reasoning_output
+      return false unless reasoning_price
+
+      output_price = text_pricing.output
+      output_price.nil? || reasoning_price != output_price
+    end
+
     def normalize_model(model)
       return RubyLLM.models.find(model.to_s) if model.is_a?(String) || model.is_a?(Symbol)
       return model.to_llm if model.respond_to?(:to_llm)
diff --git a/lib/ruby_llm/model/pricing_category.rb b/lib/ruby_llm/model/pricing_category.rb
@@ -27,6 +27,10 @@ def cache_write_input
         standard&.cache_write_input_per_million || standard&.cache_creation_input_per_million
       end
 
+      def reasoning_output
+        standard&.reasoning_output_per_million
+      end
+
       alias cached_input cache_read_input
       alias cache_creation_input cache_write_input
 
diff --git a/lib/ruby_llm/providers/openai/chat.rb b/lib/ruby_llm/providers/openai/chat.rb
@@ -61,7 +61,7 @@ def parse_completion_response(response)
           return unless message_data
 
           usage = data['usage'] || {}
-          thinking_tokens = usage.dig('completion_tokens_details', 'reasoning_tokens')
+          thinking_tokens = thinking_tokens(usage)
           content, thinking_from_blocks = extract_content_and_thinking(message_data['content'])
           thinking_text = thinking_from_blocks || extract_thinking_text(message_data)
           thinking_signature = extract_thinking_signature(message_data)
@@ -72,7 +72,7 @@ def parse_completion_response(response)
             thinking: Thinking.build(text: thinking_text, signature: thinking_signature),
             tool_calls: parse_tool_calls(message_data['tool_calls']),
             input_tokens: input_tokens(usage),
-            output_tokens: usage['completion_tokens'],
+            output_tokens: output_tokens(usage),
             cached_tokens: cache_read_tokens(usage),
             cache_creation_tokens: cache_write_tokens(usage),
             thinking_tokens: thinking_tokens,
@@ -90,6 +90,25 @@ def input_tokens(usage)
           [prompt_tokens.to_i - cache_read_tokens(usage).to_i - cache_write_tokens(usage).to_i, 0].max
         end
 
+        def output_tokens(usage)
+          completion_tokens = usage['completion_tokens']
+          return unless completion_tokens
+
+          completion_tokens = completion_tokens.to_i
+          generated_tokens = generated_tokens_from_total(usage)
+          return completion_tokens unless generated_tokens && generated_tokens > completion_tokens
+
+          generated_tokens
+        end
+
+        def generated_tokens_from_total(usage)
+          prompt_tokens = usage['prompt_tokens']
+          total_tokens = usage['total_tokens']
+          return unless prompt_tokens && total_tokens
+
+          [total_tokens.to_i - prompt_tokens.to_i, 0].max
+        end
+
         def cache_read_tokens(usage)
           usage.dig('prompt_tokens_details', 'cached_tokens') || usage['prompt_cache_hit_tokens']
         end
@@ -98,6 +117,10 @@ def cache_write_tokens(usage)
           usage.dig('prompt_tokens_details', 'cache_write_tokens') || 0
         end
 
+        def thinking_tokens(usage)
+          usage.dig('completion_tokens_details', 'reasoning_tokens') || usage['reasoning_tokens']
+        end
+
         def format_messages(messages)
           messages.map do |msg|
             {
diff --git a/lib/ruby_llm/providers/openai/streaming.rb b/lib/ruby_llm/providers/openai/streaming.rb
@@ -27,10 +27,10 @@ def build_chunk(data)
             ),
             tool_calls: parse_tool_calls(delta['tool_calls'], parse_arguments: false),
             input_tokens: OpenAI::Chat.input_tokens(usage),
-            output_tokens: usage['completion_tokens'],
+            output_tokens: OpenAI::Chat.output_tokens(usage),
             cached_tokens: OpenAI::Chat.cache_read_tokens(usage),
             cache_creation_tokens: OpenAI::Chat.cache_write_tokens(usage),
-            thinking_tokens: usage.dig('completion_tokens_details', 'reasoning_tokens')
+            thinking_tokens: OpenAI::Chat.thinking_tokens(usage)
           )
         end
 
diff --git a/lib/ruby_llm/providers/openrouter/chat.rb b/lib/ruby_llm/providers/openrouter/chat.rb
@@ -60,7 +60,7 @@ def parse_completion_response(response)
           return unless message_data
 
           usage = data['usage'] || {}
-          thinking_tokens = usage.dig('completion_tokens_details', 'reasoning_tokens')
+          thinking_tokens = thinking_tokens(usage)
           thinking_text = extract_thinking_text(message_data)
           thinking_signature = extract_thinking_signature(message_data)
 
@@ -70,7 +70,7 @@ def parse_completion_response(response)
             thinking: Thinking.build(text: thinking_text, signature: thinking_signature),
             tool_calls: OpenAI::Tools.parse_tool_calls(message_data['tool_calls']),
             input_tokens: input_tokens(usage),
-            output_tokens: usage['completion_tokens'],
+            output_tokens: output_tokens(usage),
             cached_tokens: cache_read_tokens(usage),
             cache_creation_tokens: cache_write_tokens(usage),
             thinking_tokens: thinking_tokens,
@@ -88,6 +88,10 @@ def input_tokens(usage)
           [prompt_tokens.to_i - cache_read_tokens(usage).to_i - cache_write_tokens(usage).to_i, 0].max
         end
 
+        def output_tokens(usage)
+          OpenAI::Chat.output_tokens(usage)
+        end
+
         def cache_read_tokens(usage)
           usage.dig('prompt_tokens_details', 'cached_tokens') || usage['prompt_cache_hit_tokens']
         end
@@ -96,6 +100,10 @@ def cache_write_tokens(usage)
           usage.dig('prompt_tokens_details', 'cache_write_tokens') || 0
         end
 
+        def thinking_tokens(usage)
+          OpenAI::Chat.thinking_tokens(usage)
+        end
+
         def format_messages(messages)
           messages.map do |msg|
             {
diff --git a/lib/ruby_llm/providers/openrouter/streaming.rb b/lib/ruby_llm/providers/openrouter/streaming.rb
@@ -25,10 +25,10 @@ def build_chunk(data)
             ),
             tool_calls: OpenAI::Tools.parse_tool_calls(delta['tool_calls'], parse_arguments: false),
             input_tokens: OpenRouter::Chat.input_tokens(usage),
-            output_tokens: usage['completion_tokens'],
+            output_tokens: OpenRouter::Chat.output_tokens(usage),
             cached_tokens: OpenRouter::Chat.cache_read_tokens(usage),
             cache_creation_tokens: OpenRouter::Chat.cache_write_tokens(usage),
-            thinking_tokens: usage.dig('completion_tokens_details', 'reasoning_tokens')
+            thinking_tokens: OpenRouter::Chat.thinking_tokens(usage)
           )
         end
 
diff --git a/spec/ruby_llm/cost_spec.rb b/spec/ruby_llm/cost_spec.rb
@@ -50,6 +50,62 @@
       expect(cost.cache_creation).to eq(cost.cache_write)
     end
 
+    it 'does not price thinking tokens separately when output already includes them' do
+      tokens = RubyLLM::Tokens.new(input: 50, output: 1306, thinking: 1087)
+      cost = described_class.new(tokens:, model:)
+
+      expect(cost.output).to be_within(0.0000000001).of(0.002612)
+      expect(cost.thinking).to be_nil
+      expect(cost.reasoning).to be_nil
+      expect(cost.total).to be_within(0.0000000001).of(0.002662)
+    end
+
+    it 'prices thinking tokens separately when the model has distinct reasoning pricing' do
+      reasoning_model = RubyLLM::Model::Info.new(
+        id: 'reasoning-priced-model',
+        name: 'Reasoning Priced Model',
+        provider: 'perplexity',
+        pricing: {
+          text_tokens: {
+            standard: {
+              input_per_million: 2.0,
+              output_per_million: 8.0,
+              reasoning_output_per_million: 3.0
+            }
+          }
+        }
+      )
+      tokens = RubyLLM::Tokens.new(input: 33, output: 11_395, thinking: 193_947)
+      cost = described_class.new(tokens:, model: reasoning_model)
+
+      expect(cost.input).to be_within(0.0000000001).of(0.000066)
+      expect(cost.output).to be_within(0.0000000001).of(0.09116)
+      expect(cost.thinking).to be_within(0.0000000001).of(0.581841)
+      expect(cost.total).to be_within(0.0000000001).of(0.673067)
+    end
+
+    it 'does not double-count thinking tokens when reasoning pricing matches output pricing' do
+      inclusive_model = RubyLLM::Model::Info.new(
+        id: 'inclusive-reasoning-model',
+        name: 'Inclusive Reasoning Model',
+        provider: 'openrouter',
+        pricing: {
+          text_tokens: {
+            standard: {
+              output_per_million: 12.0,
+              reasoning_output_per_million: 12.0
+            }
+          }
+        }
+      )
+      tokens = RubyLLM::Tokens.new(output: 1_000, thinking: 800)
+      cost = described_class.new(tokens:, model: inclusive_model)
+
+      expect(cost.output).to eq(0.012)
+      expect(cost.thinking).to be_nil
+      expect(cost.total).to eq(0.012)
+    end
+
     it 'reads legacy cache pricing keys' do
       legacy_model = RubyLLM::Model::Info.new(
         id: 'legacy-priced-model',
diff --git a/spec/ruby_llm/providers/open_ai/chat_spec.rb b/spec/ruby_llm/providers/open_ai/chat_spec.rb
diff --git a/spec/ruby_llm/providers/open_router/chat_spec.rb b/spec/ruby_llm/providers/open_router/chat_spec.rb