per-tier /think budgets, persistent session counters, Anthropic cleanup

alexylon · alexylon · commit 77a11074a5cb · 2026-05-05T20:47:11.000+03:00
diff --git a/AGENTS.md b/AGENTS.md
@@ -415,6 +415,10 @@ This prints:
 - [ ] Docs: Updated README if user-visible change?
 - [ ] UX: Error messages clear and actionable?
 
+## `notes/` directory
+
+Gitignored scratchpad for helper files the user asks to be created there — typically markdown (current proposal/plan files, side references during a refactor, etc.). Safe to read for context; nothing in `notes/` ships with the repo.
+
 ## Non-Negotiable Rules
 
 - Use idiomatic Rust and repository naming conventions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -30,6 +30,8 @@ All notable changes to Sofos are documented in this file.
 
 ### Fixed
 
+- **Session token counters now persist across `--resume`.** Previously every counter (`total_input_tokens`, `total_output_tokens`, `total_cache_read_tokens`, `total_cache_creation_tokens`, `peak_single_turn_input_tokens`) reset to 0 on session reload — the cost line started from zero on resume and the gpt-5.4/5.5 cliff detector forgot whether the 272K threshold had already been crossed. All five counters are now saved as part of the session JSON and restored on load. Older session files (written before this release) default every counter to 0 via `#[serde(default)]` (matching prior behaviour). **Forward-compat note:** if a session file written by this release is later opened by an older sofos, the older binary silently drops the new fields on save; mixing versions against the same session file will lose the persisted counters until you settle on one version.
+- **Empty OpenAI reasoning shells are dropped instead of round-tripped.** When a reasoning output item arrives with `id` but no visible summary AND no `encrypted_content`, the wire shape `{type: "reasoning", id, summary: []}` carries no signal and may be rejected by some OpenAI models. Sofos now skips the block in that exact configuration; reasoning items with either a summary or encrypted CoT are preserved unchanged.
 - **Streaming Anthropic responses now round-trip server-side `compaction` content blocks.** The streaming path used to silently drop them, so on a streaming-enabled Anthropic session the next turn would re-send the pre-compaction history and Anthropic would re-compact (extra cost). The non-streaming path was already correct via serde; this brings streaming into parity.
 - **OpenAI reasoning items round-trip in the right order relative to their assistant message.** Reasoning items were being emitted in the input array *after* the message they preceded, breaking encrypted_content round-trip continuity on the server side. Now correctly placed before.
 - **Tool-cache breakpoint actually lands on Anthropic when OpenAI's web-search tool is registered.** The stamper used to no-op when `OpenAIWebSearch` was the last entry in the tool list, leaving Anthropic with no tool-defs cache breakpoint at all. Now finds the last *Anthropic-compatible* tool to stamp.
diff --git a/src/api/openai.rs b/src/api/openai.rs
@@ -157,43 +157,17 @@ impl OpenAIClient {
                     }
                 }
                 "reasoning" => {
-                    // Pack the whole reasoning output item (id + every
-                    // summary entry + encrypted_content) into one block
-                    // so the next request can round-trip it as a single
-                    // `{type: "reasoning"}` input item. Splitting it
-                    // into one Summary block per entry would lose the
-                    // shared `id`/`encrypted_content` and force the
-                    // server to rederive the hidden chain-of-thought
-                    // on every tool round-trip.
                     let summary_texts: Vec<String> = item
                         .summary
                         .into_iter()
                         .filter(|s| s.summary_type == "summary_text" && !s.text.trim().is_empty())
                         .map(|s| s.text)
                         .collect();
-                    if let Some(rid) = item.id {
-                        // TODO: if `summary_texts` is empty AND
-                        // `item.encrypted_content` is None, we
-                        // round-trip an empty reasoning shell
-                        // (`{type: "reasoning", id, summary: []}`).
-                        // Theoretical wire-shape edge case — OpenAI
-                        // may reject it. Drop the block in that
-                        // configuration if it ever shows up in real
-                        // responses.
-                        content_blocks.push(ContentBlock::Reasoning {
-                            id: rid,
-                            summary: summary_texts,
-                            encrypted_content: item.encrypted_content,
-                        });
-                    } else {
-                        // No id means this isn't a real reasoning item
-                        // (e.g. an old payload predating the field) —
-                        // fall back to per-text Summary blocks so the
-                        // visible reasoning still surfaces.
-                        for text in summary_texts {
-                            content_blocks.push(ContentBlock::Summary { summary: text });
-                        }
-                    }
+                    content_blocks.extend(reasoning_item_to_blocks(
+                        item.id,
+                        summary_texts,
+                        item.encrypted_content,
+                    ));
                 }
                 _ => {
                     if std::env::var("SOFOS_DEBUG").is_ok() {
@@ -577,6 +551,46 @@ struct OpenAIInputTokensDetails {
     cached_tokens: Option<u32>,
 }
 
+/// Convert a single OpenAI `reasoning` output item into the content
+/// blocks sofos stores in conversation history.
+///
+/// With an `id` present, the whole item (id + visible summary +
+/// encrypted CoT) packs into one [`ContentBlock::Reasoning`] so the
+/// next request can round-trip it as a single `{type: "reasoning"}`
+/// input — splitting into per-summary blocks would lose the shared
+/// `id`/`encrypted_content` and force the server to rederive the
+/// hidden chain-of-thought on every tool round-trip.
+///
+/// Two edge cases:
+/// 1. `id` present but neither summary nor encrypted_content — drop
+///    the block. The wire shape `{type: "reasoning", id, summary: []}`
+///    is rejected by some OpenAI models, and the block carries no
+///    signal worth round-tripping anyway.
+/// 2. No `id` (old payloads predating the field) — fall back to
+///    per-text [`ContentBlock::Summary`] blocks so the visible
+///    reasoning still surfaces.
+fn reasoning_item_to_blocks(
+    id: Option<String>,
+    summary_texts: Vec<String>,
+    encrypted_content: Option<String>,
+) -> Vec<ContentBlock> {
+    if let Some(rid) = id {
+        if summary_texts.is_empty() && encrypted_content.is_none() {
+            return Vec::new();
+        }
+        vec![ContentBlock::Reasoning {
+            id: rid,
+            summary: summary_texts,
+            encrypted_content,
+        }]
+    } else {
+        summary_texts
+            .into_iter()
+            .map(|text| ContentBlock::Summary { summary: text })
+            .collect()
+    }
+}
+
 #[cfg(test)]
 mod tests {
     use super::*;
@@ -756,4 +770,83 @@ mod tests {
         let usage: OpenAIResponseUsage = serde_json::from_value(json).unwrap();
         assert!(usage.input_tokens_details.is_none());
     }
+
+    #[test]
+    fn reasoning_item_drops_empty_shell_when_neither_summary_nor_encrypted() {
+        // `{type: "reasoning", id, summary: []}` with no
+        // encrypted_content carries no signal and some OpenAI models
+        // reject the wire shape — drop instead of round-tripping.
+        let blocks = reasoning_item_to_blocks(Some("rs_abc".to_string()), Vec::new(), None);
+        assert!(
+            blocks.is_empty(),
+            "empty reasoning shell must be dropped, got {blocks:?}"
+        );
+    }
+
+    #[test]
+    fn reasoning_item_keeps_block_when_encrypted_content_present() {
+        // Encrypted CoT alone is enough signal to round-trip — the
+        // server uses it to resume hidden reasoning even with no
+        // visible summary.
+        let blocks = reasoning_item_to_blocks(
+            Some("rs_abc".to_string()),
+            Vec::new(),
+            Some("encrypted_blob".to_string()),
+        );
+        assert_eq!(blocks.len(), 1);
+        assert!(matches!(
+            &blocks[0],
+            ContentBlock::Reasoning {
+                summary,
+                encrypted_content: Some(_),
+                ..
+            } if summary.is_empty()
+        ));
+    }
+
+    #[test]
+    fn reasoning_item_keeps_block_when_summary_present() {
+        let blocks = reasoning_item_to_blocks(
+            Some("rs_abc".to_string()),
+            vec!["thought".to_string()],
+            None,
+        );
+        assert_eq!(blocks.len(), 1);
+        assert!(matches!(
+            &blocks[0],
+            ContentBlock::Reasoning { summary, .. } if summary == &vec!["thought".to_string()]
+        ));
+    }
+
+    #[test]
+    fn reasoning_item_keeps_block_when_both_summary_and_encrypted_present() {
+        // Common path — a reasoning model with `summary: "auto"` and
+        // `include[reasoning.encrypted_content]` returns both. Both
+        // must round-trip on the same block to preserve the link
+        // between the visible summary and the hidden CoT.
+        let blocks = reasoning_item_to_blocks(
+            Some("rs_abc".to_string()),
+            vec!["thought".to_string()],
+            Some("encrypted_blob".to_string()),
+        );
+        assert_eq!(blocks.len(), 1);
+        assert!(matches!(
+            &blocks[0],
+            ContentBlock::Reasoning {
+                summary,
+                encrypted_content: Some(_),
+                ..
+            } if summary == &vec!["thought".to_string()]
+        ));
+    }
+
+    #[test]
+    fn reasoning_item_without_id_falls_back_to_summary_blocks() {
+        // Old payloads predating the `id` field — the visible
+        // reasoning still surfaces but loses its round-trip handle.
+        let blocks = reasoning_item_to_blocks(None, vec!["a".to_string(), "b".to_string()], None);
+        assert_eq!(blocks.len(), 2);
+        assert!(matches!(blocks[0], ContentBlock::Summary { .. }));
+        assert!(matches!(blocks[1], ContentBlock::Summary { .. }));
+    }
 }
diff --git a/src/error.rs b/src/error.rs
@@ -160,8 +160,16 @@ impl SofosError {
                         "Set OPENAI_API_KEY environment variable or use --openai-api-key flag"
                             .to_string(),
                     )
-                } else if msg.contains("thinking_budget") && msg.contains("max_tokens") {
-                    Some("Increase --max-tokens or decrease --thinking-budget".to_string())
+                } else if msg.contains("max_tokens") && msg.contains("thinking-budget ceiling") {
+                    // Matches the validation message in `Repl::new`. The
+                    // suggestion no longer mentions `--thinking-budget`
+                    // because that flag is inert — the legacy budget is
+                    // picked per-effort tier in `request_builder`, not
+                    // from the flag value.
+                    Some(format!(
+                        "Increase --max-tokens above {} or set --reasoning-effort off",
+                        crate::api::anthropic::LEGACY_THINKING_BUDGET_HIGH
+                    ))
                 } else {
                     None
                 }
@@ -242,3 +250,40 @@ impl SofosError {
 }
 
 pub type Result<T> = std::result::Result<T, SofosError>;
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    /// Regression: the validation message in `Repl::new` was rewritten
+    /// (`thinking_budget >= max_tokens` → `max_tokens <= legacy thinking-
+    /// budget ceiling`) when `/think` started picking budgets per-effort
+    /// instead of from the inert flag. The classifier here must still
+    /// recognise the new wording and surface a useful hint.
+    #[test]
+    fn config_hint_fires_on_new_max_tokens_validation_message() {
+        // Mirror the exact format string used at the validation site so
+        // a future rewording on either side breaks this test loudly.
+        let err = SofosError::Config(format!(
+            "max_tokens ({}) must exceed the legacy thinking-budget ceiling ({}). \
+             Use a higher --max-tokens or set --reasoning-effort off.",
+            crate::api::anthropic::LEGACY_THINKING_BUDGET_HIGH,
+            crate::api::anthropic::LEGACY_THINKING_BUDGET_HIGH
+        ));
+        let hint = err.hint().expect("hint must fire on the new message");
+        assert!(
+            hint.contains("Increase --max-tokens"),
+            "hint should mention --max-tokens, got: {hint}"
+        );
+        assert!(
+            hint.contains(&crate::api::anthropic::LEGACY_THINKING_BUDGET_HIGH.to_string()),
+            "hint should embed the actual ceiling, got: {hint}"
+        );
+        // `--thinking-budget` is inert; the suggestion must not point
+        // users at a flag that no longer does anything.
+        assert!(
+            !hint.contains("--thinking-budget"),
+            "suggestion must not advise tweaking the inert --thinking-budget flag, got: {hint}"
+        );
+    }
+}
diff --git a/src/main.rs b/src/main.rs
@@ -115,10 +115,15 @@ fn main() -> Result<()> {
             crate::api::anthropic::effort_label(cli.reasoning_effort)
         ));
     } else if cli.reasoning_effort.is_enabled() {
+        // Display the per-effort tier budget actually sent
+        // (`request_builder` no longer reads the inert
+        // `--thinking-budget` flag) so the startup banner matches
+        // what hits the API.
+        let budget = crate::api::anthropic::legacy_thinking_budget(cli.reasoning_effort);
         startup_banner.push_str(&format!(
             "{} (budget: {} tokens)\n",
             "Extended thinking: enabled".bright_green(),
-            cli.thinking_budget
+            budget
         ));
     }
 
diff --git a/src/repl/mod.rs b/src/repl/mod.rs
@@ -14,7 +14,9 @@ use crate::api::{CreateMessageRequest, ImageSource, LlmClient, MessageContentBlo
 use crate::config::{ModelConfig, NORMAL_MODE_MESSAGE, SAFE_MODE_MESSAGE};
 use crate::error::{Result, SofosError};
 use crate::mcp::McpManager;
-use crate::session::{DisplayMessage, HistoryManager, SessionMetadata, SessionState};
+use crate::session::{
+    DisplayMessage, HistoryManager, SessionMetadata, SessionState, SessionTokenCounters,
+};
 use crate::tools::ToolExecutor;
 use crate::tools::image::{ImageLoader, ImageReference, extract_image_references};
 use crate::ui::{UI, set_safe_mode_cursor_style};
@@ -251,7 +253,13 @@ impl Repl {
             format!("effort: {}", crate::api::anthropic::effort_label(effort))
         } else if matches!(self.client, Anthropic(_)) {
             if effort.is_enabled() {
-                format!("thinking: {} tok", self.model_config.thinking_budget)
+                // The legacy non-adaptive shape's `budget_tokens` is
+                // picked from the effort tier in `request_builder`, not
+                // from the (inert) `--thinking-budget` flag. Display the
+                // value we actually send so the status line reflects
+                // reality.
+                let budget = crate::api::anthropic::legacy_thinking_budget(effort);
+                format!("thinking: {} tok", budget)
             } else {
                 "thinking: off".to_string()
             }
@@ -717,6 +725,13 @@ impl Repl {
             self.session_state.conversation.messages(),
             &self.session_state.display_messages,
             self.session_state.conversation.system_prompt(),
+            SessionTokenCounters {
+                total_input_tokens: self.session_state.total_input_tokens,
+                total_output_tokens: self.session_state.total_output_tokens,
+                total_cache_read_tokens: self.session_state.total_cache_read_tokens,
+                total_cache_creation_tokens: self.session_state.total_cache_creation_tokens,
+                peak_single_turn_input_tokens: self.session_state.peak_single_turn_input_tokens,
+            },
         )?;
 
         Ok(())
@@ -789,10 +804,15 @@ impl Repl {
             );
         } else if matches!(self.client, Anthropic(_)) {
             if effort.is_enabled() {
+                // Display the per-effort tier budget actually sent
+                // (`request_builder` no longer reads the inert
+                // `--thinking-budget` flag) so the `/think` output
+                // matches what hits the API.
+                let budget = crate::api::anthropic::legacy_thinking_budget(effort);
                 println!(
                     "\n{} (budget: {} tokens)\n",
                     "Extended thinking: enabled".bright_green(),
-                    self.model_config.thinking_budget
+                    budget
                 );
             } else {
                 println!("\n{}\n", "Extended thinking: disabled".bright_yellow());
@@ -871,6 +891,19 @@ impl Repl {
             .conversation
             .restore_messages(session.api_messages.clone());
         self.session_state.display_messages = session.display_messages.clone();
+        // Restore every persisted token counter so the cost summary
+        // stays accurate across the resume. Older session files written
+        // before persistence was added default the whole
+        // `token_counters` struct to all-zero via `#[serde(default)]`
+        // on each field, matching the pre-persistence behaviour for
+        // those old files.
+        self.session_state.total_input_tokens = session.token_counters.total_input_tokens;
+        self.session_state.total_output_tokens = session.token_counters.total_output_tokens;
+        self.session_state.total_cache_read_tokens = session.token_counters.total_cache_read_tokens;
+        self.session_state.total_cache_creation_tokens =
+            session.token_counters.total_cache_creation_tokens;
+        self.session_state.peak_single_turn_input_tokens =
+            session.token_counters.peak_single_turn_input_tokens;
 
         println!(
             "{} {} ({} messages)",
diff --git a/src/session/history.rs b/src/session/history.rs
diff --git a/src/session/mod.rs b/src/session/mod.rs
diff --git a/src/session/state.rs b/src/session/state.rs