Skip to content

Commit d33d0c4

Browse files
committed
Fix session-summary cost over-billing: discount cache reads, add cache-hit row
1 parent 7761c70 commit d33d0c4

9 files changed

Lines changed: 329 additions & 35 deletions

File tree

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,16 @@ All notable changes to Sofos are documented in this file.
44

55
## [Unreleased]
66

7+
### Fixed
8+
9+
- **Session summary "Estimated cost" now accounts for the cache discount.** `calculate_cost` was billing every input token at the full rate, ignoring the `cache_read_input_tokens` and `cache_creation_input_tokens` fields the 0.2.6 fix had started collecting. With a high cache-hit rate that overstated the bill by ~3× (e.g. for `gpt-5.5` at 75% hit, displayed `$0.50` vs. real `$0.16` per 100k input). The cost function now subtracts cache reads from the uncached portion, prices them at 10% of the base input rate (both providers), and bills Anthropic 5-min cache writes at 125% of base. Provider semantics are normalized inside `calculate_cost`: OpenAI's `input_tokens` already includes cached, Anthropic's excludes them.
10+
711
### Added
812

13+
- **Cache-hit indicator in the session summary.** When a turn has any cached or written tokens, the summary now shows `cache read: N (M% hit)` and (Anthropic only) `cache write: N` underneath the input row, and the displayed `Input tokens` row now reflects the total the model actually saw (cached + uncached) on both providers — previously Anthropic's row understated by the cached portion.
14+
15+
- **"Finished in Xs" turn-completion marker.** A dimmed `Finished in 1m 34s` line prints after the assistant has fully finished a turn (last text reply, last tool call, last continuation) so the prompt-ready signal is unambiguous. Steer messages typed mid-turn don't reset the timer — they fold into the same turn and the marker still prints once at the end. Skipped on interrupt or error.
16+
917
- **Bare `"Bash"` entry in allow / deny acts as a blanket rule.** Adding `"Bash"` to `permissions.allow` in `~/.sofos/config.toml` or `.sofos/config.local.toml` auto-passes every bash command (no Yes/No/remember prompt) except those in the built-in forbidden set (`rm`, `chmod`, `sudo`, …) — useful when you've decided to trust sofos with shell access in a project. Symmetrically, `"Bash"` in `permissions.deny` auto-rejects every bash command. The blanket entry beats every more-specific rule (`Bash(cmd:*)` wildcards, exact-match entries, the built-in allow-list); when both lists contain `"Bash"`, deny wins. Structural safety (`>` redirection, `<<`, `git push` and friends, parent traversal, external-path prompts) still applies.
1018

1119
### Changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -151,7 +151,7 @@ Analyze https://example.com/chart.png
151151
152152
### Cost Tracking
153153
154-
Exit summary shows token usage and estimated cost (based on official API pricing).
154+
Exit summary shows token usage and estimated cost based on official API pricing. When the provider prompt cache served any tokens during the session, a `cache read: N (M% hit)` row appears under the input total, and the estimated cost reflects the cache discount (10% of base input on both providers, plus 125% for Anthropic 5-min cache writes).
155155
156156
### CLI Options
157157

src/repl/mod.rs

Lines changed: 29 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ use colored::Colorize;
2222
use std::path::PathBuf;
2323
use std::sync::atomic::{AtomicBool, Ordering};
2424
use std::sync::{Arc, Mutex};
25-
use std::time::Duration;
25+
use std::time::{Duration, Instant};
2626
use tokio::time::sleep;
2727

2828
/// Shared buffer used by the TUI to inject user messages mid-turn. The UI
@@ -267,6 +267,13 @@ impl Repl {
267267
user_input: &str,
268268
pasted_images: Vec<crate::clipboard::PastedImage>,
269269
) -> Result<()> {
270+
// Record turn start so we can show "Finished in Xs" when the
271+
// model is fully done (after every text reply, tool call, and
272+
// continuation). Steer messages typed mid-turn don't reset
273+
// this — they're folded into the same turn via `SteerQueue` and
274+
// the same `process_message` call keeps running until the
275+
// agent loop exits.
276+
let turn_start = Instant::now();
270277
let (remaining_text, image_refs) = extract_image_references(user_input);
271278

272279
let has_images = !image_refs.is_empty() || !pasted_images.is_empty();
@@ -584,8 +591,7 @@ impl Repl {
584591
}
585592
};
586593

587-
self.session_state
588-
.add_tokens(response.usage.input_tokens, response.usage.output_tokens);
594+
self.session_state.add_usage(&response.usage);
589595

590596
let mut handler = ResponseHandler::new(
591597
self.client.clone(),
@@ -607,13 +613,21 @@ impl Repl {
607613
&mut self.session_state.display_messages,
608614
&mut self.session_state.total_input_tokens,
609615
&mut self.session_state.total_output_tokens,
616+
&mut self.session_state.total_cache_read_tokens,
617+
&mut self.session_state.total_cache_creation_tokens,
610618
));
611619

612620
// Always preserve conversation state so the AI retains context on retry
613621
self.session_state.conversation = handler.conversation().clone();
614622

615623
match result {
616-
Ok(_) => Ok(()),
624+
Ok(_) => {
625+
println!(
626+
"{}",
627+
UI::format_turn_finished(turn_start.elapsed()).dimmed()
628+
);
629+
Ok(())
630+
}
617631
Err(SofosError::Interrupted) => Ok(()),
618632
Err(e) => {
619633
// Add error context so the AI knows what happened on next turn.
@@ -672,6 +686,8 @@ impl Repl {
672686
&self.model_config.model,
673687
self.session_state.total_input_tokens,
674688
self.session_state.total_output_tokens,
689+
self.session_state.total_cache_read_tokens,
690+
self.session_state.total_cache_creation_tokens,
675691
);
676692

677693
Ok(())
@@ -694,12 +710,14 @@ impl Repl {
694710
Ok(())
695711
}
696712

697-
pub fn get_session_summary(&self) -> (String, u32, u32) {
698-
(
699-
self.model_config.model.clone(),
700-
self.session_state.total_input_tokens,
701-
self.session_state.total_output_tokens,
702-
)
713+
pub fn get_session_summary(&self) -> tui::event::ExitSummary {
714+
tui::event::ExitSummary {
715+
model: self.model_config.model.clone(),
716+
input_tokens: self.session_state.total_input_tokens,
717+
output_tokens: self.session_state.total_output_tokens,
718+
cache_read_tokens: self.session_state.total_cache_read_tokens,
719+
cache_creation_tokens: self.session_state.total_cache_creation_tokens,
720+
}
703721
}
704722

705723
pub fn handle_clear_command(&mut self) -> Result<()> {
@@ -998,8 +1016,7 @@ impl Repl {
9981016
.conversation
9991017
.replace_with_summary(summary_text, split_point);
10001018

1001-
self.session_state
1002-
.add_tokens(response.usage.input_tokens, response.usage.output_tokens);
1019+
self.session_state.add_usage(&response.usage);
10031020

10041021
let tokens_after = self.session_state.conversation.estimate_total_tokens();
10051022
println!(

src/repl/response_handler.rs

Lines changed: 46 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,24 @@ impl ResponseHandler {
6565
}
6666
}
6767

68+
/// Fold a `Usage` payload into the per-turn running totals carried
69+
/// by `handle_response`. Centralised so the four-counter increment
70+
/// stays consistent across the three sites that consume responses
71+
/// (auto-continue after reasoning-only blocks, tool-result loop,
72+
/// max-iterations summary).
73+
fn accumulate_usage(
74+
usage: &crate::api::Usage,
75+
total_input: &mut u32,
76+
total_output: &mut u32,
77+
total_cache_read: &mut u32,
78+
total_cache_creation: &mut u32,
79+
) {
80+
*total_input += usage.input_tokens;
81+
*total_output += usage.output_tokens;
82+
*total_cache_read += usage.cache_read_input_tokens.unwrap_or(0);
83+
*total_cache_creation += usage.cache_creation_input_tokens.unwrap_or(0);
84+
}
85+
6886
/// Atomically drain all pending steer messages the user typed while
6987
/// this turn was running. Returns `None` if the queue is empty, or
7088
/// `Some(text)` with the messages joined by blank lines (preserving
@@ -83,12 +101,15 @@ impl ResponseHandler {
83101
Some(messages.join("\n\n"))
84102
}
85103

104+
#[allow(clippy::too_many_arguments)]
86105
pub async fn handle_response(
87106
&mut self,
88107
mut content_blocks: Vec<ContentBlock>,
89108
display_messages: &mut Vec<DisplayMessage>,
90109
total_input_tokens: &mut u32,
91110
total_output_tokens: &mut u32,
111+
total_cache_read_tokens: &mut u32,
112+
total_cache_creation_tokens: &mut u32,
92113
) -> Result<()> {
93114
let mut iteration = 0;
94115

@@ -108,6 +129,8 @@ impl ResponseHandler {
108129
display_messages,
109130
total_input_tokens,
110131
total_output_tokens,
132+
total_cache_read_tokens,
133+
total_cache_creation_tokens,
111134
)
112135
.await?;
113136
return Ok(());
@@ -154,8 +177,13 @@ impl ResponseHandler {
154177
{
155178
let response = self.get_next_response(&[], display_messages).await?;
156179

157-
*total_input_tokens += response.usage.input_tokens;
158-
*total_output_tokens += response.usage.output_tokens;
180+
Self::accumulate_usage(
181+
&response.usage,
182+
total_input_tokens,
183+
total_output_tokens,
184+
total_cache_read_tokens,
185+
total_cache_creation_tokens,
186+
);
159187

160188
if response.content.is_empty() {
161189
println!(
@@ -223,8 +251,13 @@ impl ResponseHandler {
223251

224252
let response = self.get_next_response(&tool_uses, display_messages).await?;
225253

226-
*total_input_tokens += response.usage.input_tokens;
227-
*total_output_tokens += response.usage.output_tokens;
254+
Self::accumulate_usage(
255+
&response.usage,
256+
total_input_tokens,
257+
total_output_tokens,
258+
total_cache_read_tokens,
259+
total_cache_creation_tokens,
260+
);
228261

229262
if std::env::var("SOFOS_DEBUG").is_ok() {
230263
eprintln!(
@@ -570,6 +603,8 @@ impl ResponseHandler {
570603
display_messages: &mut Vec<DisplayMessage>,
571604
total_input_tokens: &mut u32,
572605
total_output_tokens: &mut u32,
606+
total_cache_read_tokens: &mut u32,
607+
total_cache_creation_tokens: &mut u32,
573608
) -> Result<()> {
574609
UI::print_warning("Maximum tool iterations reached. Stopping to prevent infinite loop.");
575610

@@ -601,8 +636,13 @@ impl ResponseHandler {
601636

602637
match response_result {
603638
Ok(response) => {
604-
*total_input_tokens += response.usage.input_tokens;
605-
*total_output_tokens += response.usage.output_tokens;
639+
Self::accumulate_usage(
640+
&response.usage,
641+
total_input_tokens,
642+
total_output_tokens,
643+
total_cache_read_tokens,
644+
total_cache_creation_tokens,
645+
);
606646

607647
for block in &response.content {
608648
if let ContentBlock::Text { text } = block {

src/repl/tui/event.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ pub struct ExitSummary {
1212
pub model: String,
1313
pub input_tokens: u32,
1414
pub output_tokens: u32,
15+
pub cache_read_tokens: u32,
16+
pub cache_creation_tokens: u32,
1517
}
1618

1719
/// Tool access mode shown in the status line.

src/repl/tui/mod.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -222,6 +222,8 @@ pub fn run(mut repl: Repl) -> Result<()> {
222222
&summary.model,
223223
summary.input_tokens,
224224
summary.output_tokens,
225+
summary.cache_read_tokens,
226+
summary.cache_creation_tokens,
225227
);
226228
// The summary emits its own leading newline when it prints; if
227229
// it short-circuited, the cursor is still parked at the end of

src/repl/tui/worker.rs

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,8 @@ impl<'a> ShutdownSender<'a> {
7979
model: String::new(),
8080
input_tokens: 0,
8181
output_tokens: 0,
82+
cache_read_tokens: 0,
83+
cache_creation_tokens: 0,
8284
});
8385
let _ = self.ui_tx.send(UiEvent::WorkerShutdown(summary));
8486
self.sent = true;
@@ -167,16 +169,12 @@ fn run(
167169
}
168170
}
169171

170-
let (model, input_tokens, output_tokens) = repl.get_session_summary();
172+
let summary = repl.get_session_summary();
171173
if let Err(e) = repl.save_current_session() {
172174
UI::print_warning(&format!("Failed to save session: {}", e));
173175
}
174176
flush_captured_streams();
175-
shutdown.set_summary(ExitSummary {
176-
model,
177-
input_tokens,
178-
output_tokens,
179-
});
177+
shutdown.set_summary(summary);
180178
shutdown.send_now();
181179
}
182180

src/session/state.rs

Lines changed: 26 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,26 @@ pub struct SessionState {
1010
pub conversation: ConversationHistory,
1111
/// Display-friendly message history for UI
1212
pub display_messages: Vec<DisplayMessage>,
13-
/// Total input tokens consumed in this session
13+
/// Total input tokens consumed in this session.
14+
/// Provider semantics differ:
15+
///
16+
/// - OpenAI Responses API: this is the **total** count, of which
17+
/// `total_cache_read_tokens` is a subset.
18+
/// - Anthropic Messages API: this is **uncached** new tokens only;
19+
/// cache read/creation are tracked separately and disjoint.
20+
///
21+
/// `calculate_cost` normalizes this when computing the bill.
1422
pub total_input_tokens: u32,
1523
/// Total output tokens generated in this session
1624
pub total_output_tokens: u32,
25+
/// Tokens served from the provider prompt cache (charged at a
26+
/// reduced rate). Both providers report this; semantics relative to
27+
/// `total_input_tokens` differ as documented above.
28+
pub total_cache_read_tokens: u32,
29+
/// Tokens written to the Anthropic prompt cache (charged at a
30+
/// premium). OpenAI does not surface a creation counter and leaves
31+
/// this at 0.
32+
pub total_cache_creation_tokens: u32,
1733
}
1834

1935
impl SessionState {
@@ -24,6 +40,8 @@ impl SessionState {
2440
display_messages: Vec::new(),
2541
total_input_tokens: 0,
2642
total_output_tokens: 0,
43+
total_cache_read_tokens: 0,
44+
total_cache_creation_tokens: 0,
2745
}
2846
}
2947

@@ -33,10 +51,14 @@ impl SessionState {
3351
self.display_messages.clear();
3452
self.total_input_tokens = 0;
3553
self.total_output_tokens = 0;
54+
self.total_cache_read_tokens = 0;
55+
self.total_cache_creation_tokens = 0;
3656
}
3757

38-
pub fn add_tokens(&mut self, input_tokens: u32, output_tokens: u32) {
39-
self.total_input_tokens += input_tokens;
40-
self.total_output_tokens += output_tokens;
58+
pub fn add_usage(&mut self, usage: &crate::api::Usage) {
59+
self.total_input_tokens += usage.input_tokens;
60+
self.total_output_tokens += usage.output_tokens;
61+
self.total_cache_read_tokens += usage.cache_read_input_tokens.unwrap_or(0);
62+
self.total_cache_creation_tokens += usage.cache_creation_input_tokens.unwrap_or(0);
4163
}
4264
}

0 commit comments

Comments
 (0)