Skip to content

Commit f59f9d0

Browse files
committed
Fix reasoning token cost normalization
1 parent 5b736ce commit f59f9d0

15 files changed

Lines changed: 247 additions & 17 deletions

File tree

docs/_advanced/models.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -225,10 +225,11 @@ puts cost.input
225225
puts cost.output
226226
puts cost.cache_read
227227
puts cost.cache_write
228+
puts cost.thinking
228229
puts cost.total
229230
```
230231

231-
Costs use RubyLLM's normalized token buckets: standard input, output, cache read, and cache write. See [Tracking Token Usage]({% link _core_features/chat.md %}#tracking-token-usage) for the provider comparison table and what RubyLLM exposes consistently across providers.
232+
Costs use RubyLLM's normalized token buckets: standard input, billable output, cache read, cache write, and separately priced thinking when the model registry exposes a distinct reasoning-token price. See [Tracking Token Usage]({% link _core_features/chat.md %}#tracking-token-usage) for the provider comparison table and what RubyLLM exposes consistently across providers.
232233

233234
Most applications use the shorter helpers on messages, chats, and agents:
234235

@@ -245,7 +246,7 @@ cost = RubyLLM::Cost.aggregate(messages.map(&:cost))
245246
cost.total
246247
```
247248

248-
If pricing is incomplete for tokens that were used, the affected cost and `cost.total` return `nil`.
249+
If pricing is incomplete for tokens that were used, the affected cost and `cost.total` return `nil`. Cost helpers cover token-priced conversation usage; provider-specific add-ons such as search-query charges remain available in the provider's raw usage payload.
249250

250251
## Connecting to Custom Endpoints & Using Unlisted Models
251252
{: .d-inline-block }

docs/_advanced/rails.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -549,11 +549,12 @@ Persisted chats and messages expose the same normalized token and cost helpers a
549549
message = chat_record.messages.last
550550

551551
message.input_tokens # Standard input tokens
552-
message.output_tokens # Output tokens
552+
message.output_tokens # Billable output tokens
553553
message.cache_read_tokens # Prompt cache reads
554554
message.cache_write_tokens # Prompt cache writes
555555

556556
message.cost.total
557+
message.cost.thinking # When the model has distinct reasoning-token pricing
557558
chat_record.cost.total
558559
```
559560

docs/_advanced/upgrading.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ Use the new cache names in new code:
4040

4141
```ruby
4242
response.input_tokens # Standard input tokens
43-
response.output_tokens # Output tokens
43+
response.output_tokens # Billable output tokens
4444
response.cache_read_tokens # Tokens served from prompt cache
4545
response.cache_write_tokens # Tokens written to prompt cache
4646

@@ -79,6 +79,8 @@ agent.cost.total
7979

8080
Cost helpers are available from 1.15 onward. They return `nil` for any cost bucket whose pricing is missing, and `cost.total` is also `nil` when a used bucket has incomplete pricing.
8181

82+
`thinking_tokens` remains available from 1.10. From 1.15 onward, `output_tokens` is normalized as the billable output bucket. Do not add `thinking_tokens` to `output_tokens` yourself; RubyLLM includes thinking in output when the provider bills it as output, and exposes `cost.thinking` only for models with distinct reasoning-token pricing.
83+
8284
See [Tracking Token Usage]({% link _core_features/chat.md %}#tracking-token-usage) for the provider comparison table and the exact normalized token semantics RubyLLM exposes.
8385

8486
# Upgrade to 1.14

docs/_core_features/chat.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -613,7 +613,7 @@ Understanding token usage is important for managing costs and staying within con
613613
response = chat.ask "Explain the Ruby Global Interpreter Lock (GIL)."
614614

615615
input_tokens = response.input_tokens # Standard input tokens
616-
output_tokens = response.output_tokens # Output tokens
616+
output_tokens = response.output_tokens # Billable output tokens
617617
cache_read_tokens = response.cache_read_tokens # Tokens served from the provider's prompt cache - v1.15+
618618
cache_write_tokens = response.cache_write_tokens # Tokens written to cache - v1.15+
619619
thinking_tokens = response.thinking_tokens # Thinking tokens when providers report them - v1.10.0+
@@ -632,6 +632,7 @@ puts "Input Cost: $#{format('%.6f', response.cost.input)}" if response.cost.inpu
632632
puts "Output Cost: $#{format('%.6f', response.cost.output)}" if response.cost.output
633633
puts "Cache Read Cost: $#{format('%.6f', response.cost.cache_read)}" if response.cost.cache_read
634634
puts "Cache Write Cost: $#{format('%.6f', response.cost.cache_write)}" if response.cost.cache_write
635+
puts "Thinking Cost: $#{format('%.6f', response.cost.thinking)}" if response.cost.thinking
635636
puts "Total Cost: $#{format('%.6f', response.cost.total)}" if response.cost.total
636637

637638
# Total tokens for the entire conversation so far
@@ -660,9 +661,11 @@ This means the same RubyLLM code works across providers: `input_tokens` for stan
660661

661662
`cache_read_tokens` and `cache_write_tokens` are available from v1.15+ and are also exposed as `response.tokens.cache_read` and `response.tokens.cache_write`. The older `cached_tokens` and `cache_creation_tokens` methods remain available for compatibility with v1.9.0+ code.
662663

663-
Thinking token usage is available via `response.thinking_tokens` and `response.tokens.thinking` when providers report it. For providers that do not include thinking token counts, these values remain `nil`.
664+
Thinking token usage is available via `response.thinking_tokens` and `response.tokens.thinking` when providers report it. For most providers, thinking/reasoning tokens are a breakdown of output work, not an extra bucket to add yourself. RubyLLM keeps `output_tokens` as the billable output bucket: OpenAI-style providers that include reasoning in completion tokens stay as-is, while OpenAI-compatible providers that report reasoning outside completion tokens are normalized so `output_tokens` includes the billable generated total.
664665

665-
Cost helpers are available from v1.15+. RubyLLM uses token usage from the provider and pricing from the model registry. If the registry is missing pricing for tokens that were used, the affected cost and `cost.total` return `nil` instead of pretending the cost was zero.
666+
When a model has distinct reasoning-token pricing, `response.cost.thinking` prices that bucket separately. Otherwise, thinking tokens are treated as part of `response.cost.output` and `response.cost.thinking` stays `nil`.
667+
668+
Cost helpers are available from v1.15+. RubyLLM uses token usage from the provider and pricing from the model registry. If the registry is missing pricing for tokens that were used, the affected cost and `cost.total` return `nil` instead of pretending the cost was zero. These helpers cover token-priced conversation usage; provider-specific add-ons such as search-query charges are left to the provider's raw usage payload.
666669

667670
Refer to the [Working with Models Guide]({% link _advanced/models.md %}) for details on accessing model-specific pricing.
668671

docs/_core_features/streaming.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ Key attributes of a `Chunk`:
5858
* `chunk.model_id`: The model generating the response (usually present).
5959
* `chunk.tool_calls`: A hash containing partial or complete tool call information if the model is invoking a [Tool]({% link _core_features/tools.md %}). The arguments might be streamed incrementally.
6060
* `chunk.input_tokens`: Standard input tokens for the request (often `nil` until the final chunk). From v1.15 onward, cache reads and writes are exposed separately as `chunk.cache_read_tokens` and `chunk.cache_write_tokens` when providers report them.
61-
* `chunk.output_tokens`: Cumulative output tokens *up to this chunk* (behavior varies by provider, often only accurate in the final chunk).
61+
* `chunk.output_tokens`: Cumulative billable output tokens *up to this chunk* (behavior varies by provider, often only accurate in the final chunk). From v1.15 onward, this includes thinking/reasoning tokens when the provider bills them as output.
6262
* `chunk.thinking`: Optional thinking output when providers stream it.
6363

6464
> Do not rely on token counts being present or accurate in every chunk. They are typically finalized only in the last chunk or the final returned message.

docs/_core_features/thinking.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,8 @@ response.thinking&.text
9696
response.thinking_tokens
9797
```
9898

99+
`thinking_tokens` is usually a breakdown of generated output work. From v1.15 onward, RubyLLM normalizes `output_tokens` as the billable output bucket, so you should not add `thinking_tokens` to `output_tokens` for cost calculations. When a model has distinct reasoning-token pricing, the cost is exposed separately as `response.cost.thinking`.
100+
99101
### Upgrading Existing Installations
100102

101103
For 1.10 upgrades, consider using the [upgrade guide]({% link _advanced/upgrading.md %}#upgrade-to-1-10) to run the generator.

lib/ruby_llm/cost.rb

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
module RubyLLM
44
# Represents the cost of token usage for a model response.
55
class Cost
6-
COMPONENTS = %i[input output cache_read cache_write].freeze
6+
COMPONENTS = %i[input output cache_read cache_write thinking].freeze
77
PER_MILLION = 1_000_000.0
88

99
attr_reader :tokens, :model
@@ -47,6 +47,12 @@ def cache_write
4747
amount_for(:cache_write)
4848
end
4949

50+
def thinking
51+
amount_for(:thinking)
52+
end
53+
54+
alias reasoning thinking
55+
5056
alias cached_input cache_read
5157
alias cache_creation cache_write
5258

@@ -66,6 +72,7 @@ def to_h
6672
output: output,
6773
cache_read: cache_read,
6874
cache_write: cache_write,
75+
thinking: thinking,
6976
total: total
7077
}.compact
7178
end
@@ -78,6 +85,7 @@ def tokens?
7885

7986
def missing?(component)
8087
return @missing.include?(component) if aggregate?
88+
return false if component == :thinking && !thinking_priced_separately?
8189

8290
tokens = tokens_for(component)
8391
tokens.to_i.positive? && price_for(component).nil?
@@ -121,6 +129,8 @@ def tokens_for(component)
121129
tokens.cache_read
122130
when :cache_write
123131
tokens.cache_write
132+
when :thinking
133+
tokens.thinking if thinking_priced_separately?
124134
end
125135
end
126136

@@ -134,13 +144,23 @@ def price_for(component)
134144
text_pricing.cache_read_input
135145
when :cache_write
136146
text_pricing.cache_write_input
147+
when :thinking
148+
text_pricing.reasoning_output
137149
end
138150
end
139151

140152
def text_pricing
141153
model&.pricing&.text_tokens || RubyLLM::Model::PricingCategory.new
142154
end
143155

156+
def thinking_priced_separately?
157+
reasoning_price = text_pricing.reasoning_output
158+
return false unless reasoning_price
159+
160+
output_price = text_pricing.output
161+
output_price.nil? || reasoning_price != output_price
162+
end
163+
144164
def normalize_model(model)
145165
return RubyLLM.models.find(model.to_s) if model.is_a?(String) || model.is_a?(Symbol)
146166
return model.to_llm if model.respond_to?(:to_llm)

lib/ruby_llm/model/pricing_category.rb

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,10 @@ def cache_write_input
2727
standard&.cache_write_input_per_million || standard&.cache_creation_input_per_million
2828
end
2929

30+
def reasoning_output
31+
standard&.reasoning_output_per_million
32+
end
33+
3034
alias cached_input cache_read_input
3135
alias cache_creation_input cache_write_input
3236

lib/ruby_llm/providers/openai/chat.rb

Lines changed: 25 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ def parse_completion_response(response)
6161
return unless message_data
6262

6363
usage = data['usage'] || {}
64-
thinking_tokens = usage.dig('completion_tokens_details', 'reasoning_tokens')
64+
thinking_tokens = thinking_tokens(usage)
6565
content, thinking_from_blocks = extract_content_and_thinking(message_data['content'])
6666
thinking_text = thinking_from_blocks || extract_thinking_text(message_data)
6767
thinking_signature = extract_thinking_signature(message_data)
@@ -72,7 +72,7 @@ def parse_completion_response(response)
7272
thinking: Thinking.build(text: thinking_text, signature: thinking_signature),
7373
tool_calls: parse_tool_calls(message_data['tool_calls']),
7474
input_tokens: input_tokens(usage),
75-
output_tokens: usage['completion_tokens'],
75+
output_tokens: output_tokens(usage),
7676
cached_tokens: cache_read_tokens(usage),
7777
cache_creation_tokens: cache_write_tokens(usage),
7878
thinking_tokens: thinking_tokens,
@@ -90,6 +90,25 @@ def input_tokens(usage)
9090
[prompt_tokens.to_i - cache_read_tokens(usage).to_i - cache_write_tokens(usage).to_i, 0].max
9191
end
9292

93+
def output_tokens(usage)
94+
completion_tokens = usage['completion_tokens']
95+
return unless completion_tokens
96+
97+
completion_tokens = completion_tokens.to_i
98+
generated_tokens = generated_tokens_from_total(usage)
99+
return completion_tokens unless generated_tokens && generated_tokens > completion_tokens
100+
101+
generated_tokens
102+
end
103+
104+
def generated_tokens_from_total(usage)
105+
prompt_tokens = usage['prompt_tokens']
106+
total_tokens = usage['total_tokens']
107+
return unless prompt_tokens && total_tokens
108+
109+
[total_tokens.to_i - prompt_tokens.to_i, 0].max
110+
end
111+
93112
def cache_read_tokens(usage)
94113
usage.dig('prompt_tokens_details', 'cached_tokens') || usage['prompt_cache_hit_tokens']
95114
end
@@ -98,6 +117,10 @@ def cache_write_tokens(usage)
98117
usage.dig('prompt_tokens_details', 'cache_write_tokens') || 0
99118
end
100119

120+
def thinking_tokens(usage)
121+
usage.dig('completion_tokens_details', 'reasoning_tokens') || usage['reasoning_tokens']
122+
end
123+
101124
def format_messages(messages)
102125
messages.map do |msg|
103126
{

lib/ruby_llm/providers/openai/streaming.rb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,10 +27,10 @@ def build_chunk(data)
2727
),
2828
tool_calls: parse_tool_calls(delta['tool_calls'], parse_arguments: false),
2929
input_tokens: OpenAI::Chat.input_tokens(usage),
30-
output_tokens: usage['completion_tokens'],
30+
output_tokens: OpenAI::Chat.output_tokens(usage),
3131
cached_tokens: OpenAI::Chat.cache_read_tokens(usage),
3232
cache_creation_tokens: OpenAI::Chat.cache_write_tokens(usage),
33-
thinking_tokens: usage.dig('completion_tokens_details', 'reasoning_tokens')
33+
thinking_tokens: OpenAI::Chat.thinking_tokens(usage)
3434
)
3535
end
3636

0 commit comments

Comments
 (0)