alexylon
diff --git a/‎CHANGELOG.md‎
Lines changed: 29 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 22 additions & 14 deletions b/‎README.md‎
Lines changed: 22 additions & 14 deletions
diff --git a/‎src/api/anthropic.rs‎
Lines changed: 93 additions & 16 deletions b/‎src/api/anthropic.rs‎
Lines changed: 93 additions & 16 deletions
diff --git a/‎src/api/mod.rs‎
Lines changed: 4 additions & 0 deletions b/‎src/api/mod.rs‎
Lines changed: 4 additions & 0 deletions
@@ -4,6 +4,35 @@ All notable changes to Sofos are documented in this file.
 
 ## [Unreleased]
 
+### Added
+
+- **Anthropic server-side compaction** is now enabled on Claude Opus 4.7, Opus 4.6, and Sonnet 4.6. Sofos sends the `compact-2026-01-12` beta header and a `context_management.edits[type=compact_20260112]` block on every request to those models; when the request crosses the per-model auto-compact threshold (~250K tokens), the API itself summarises older turns and returns a `compaction` content block, dropping the pre-compaction messages server-side on subsequent requests. No extra round-trip — the compaction summary arrives in the same response as the user's reply.
+- **OpenAI encrypted-reasoning round-trip.** Requests that enable reasoning now include `include: ["reasoning.encrypted_content"]`. Sofos captures the opaque encrypted-CoT blob alongside the visible reasoning summary and round-trips both on the next call, so the model resumes its hidden chain-of-thought across tool calls instead of regenerating it. Cuts hidden-reasoning output tokens on multi-call agentic turns.
+- **Per-model `ModelInfo` registry** consolidates context-window, auto-compact threshold, adaptive-thinking flag, server-compaction flag, and pricing (including tiered-pricing rules) into one struct per model. Adding a new model is one struct literal in `src/api/model_info.rs`.
+- **Tiered-pricing detection for GPT-5.4 and GPT-5.5.** Sofos tracks the largest single-turn input observed across the session. If any single prompt crosses the documented 272K threshold, the cost calculator switches to premium rates (2× input, 1.5× output) for the rest of the session — matching what OpenAI actually bills.
+- **1-hour cache TTL on stable prefixes.** System prompt, the last-listed tool definition, and the sticky message anchor now use Anthropic's `ttl: "1h"` ephemeral cache. The rolling breakpoint stays at 5 min because it moves every turn; paying the 2× write premium for a one-turn slot would burn cache writes for nothing.
+- **Middle truncation for tool outputs.** Large bash / search / file-read / diff / MCP outputs preserve both the head and the tail (separated by a `…N tokens truncated…` marker) instead of the head-only cut sofos previously applied. The diagnostic tail (last error line, ripgrep totals, exit messages) now survives truncation.
+- **`compaction` content block type** added to the on-the-wire schema and the saved-session schema, so Anthropic's server-side summaries persist across save / load.
+- **Honest server-side cost line.** The session summary now correctly accounts for the 1-hour cache write premium (200% of base input) on top of the existing 5-minute cache write premium (125%).
+
+### Changed
+
+- **CLI `-t` / `--enable-thinking` is replaced with `-e` / `--reasoning-effort <off|low|medium|high>`** (default `medium`). The previous binary on/off knob is gone — `medium` is now the default-on state because `high` materially raises hidden-reasoning token cost on routine coding work, and `off` is the absolute-cheapest path. **Breaking change**: scripts using `-t` need updating.
+- **`/think on` / `/think off` are replaced with `/think <off|low|medium|high>`.** `/think` (no argument) still shows status. `on` and `off` no longer parse as commands. **Breaking change**.
+- **Auto-compact threshold lowered.** Conversations now compact at ~250K tokens on 1M-window models (Opus 4.7 / 4.6, Sonnet 4.6, GPT-5.4 / 5.5), ~170K on Haiku 4.5, ~250K on the GPT-5.3-Codex 400K window. Previously sat at 800K (non-codex) / 300K (codex), which left meaningful cost on the table re-sending huge prefixes on every tool round-trip.
+- **Default reasoning effort is now `medium`** (was `high`). Verified roughly 3–5× cheaper hidden-reasoning bill on routine coding turns. Use `-e high` or `/think high` for hard tasks.
+- **Reasoning summaries are suppressed on the OpenAI thinking-off path.** When `effort: off`, sofos sends `reasoning.effort = "minimal"` with no `summary` field, so the model returns no summary blocks at all (they bill as output tokens).
+- **Model context windows corrected.** Claude Opus 4.7 / 4.6 and Sonnet 4.6 are 1,000,000 tokens (were 200K in the table); GPT-5.4 and GPT-5.5 are 1,050,000 tokens (were 400K). The drop-trim safety floor is now per-model API-aware (95% of the real window) instead of a flat 250K.
+- **Anthropic beta header now opts into both `token-efficient-tools-2025-02-19` and `compact-2026-01-12`.**
+
+### Fixed
+
+- **OpenAI reasoning items round-trip in the right order relative to their assistant message.** Reasoning items were being emitted in the input array *after* the message they preceded, breaking encrypted_content round-trip continuity on the server side. Now correctly placed before.
+- **Tool-cache breakpoint actually lands on Anthropic when OpenAI's web-search tool is registered.** The stamper used to no-op when `OpenAIWebSearch` was the last entry in the tool list, leaving Anthropic with no tool-defs cache breakpoint at all. Now finds the last *Anthropic-compatible* tool to stamp.
+- **OpenAI `Reasoning` blocks no longer leak to Anthropic on provider switch.** A session that started on OpenAI accumulates `Reasoning` content blocks; switching to Anthropic mid-session would have sent those blocks to the Messages API, which doesn't recognise the type. The Anthropic sanitiser now drops them.
+- **`peak_single_turn_input_tokens` is updated for every iteration of multi-tool turns**, not just the first. Long tool chains crossing the GPT-5.5 272K cliff inside the loop now correctly switch the cost line to premium rates.
+- **Stale duplicate cache breakpoint on `read_file_tool` removed.** The tool definition carried an inline `cache_control` that, combined with the request-builder's last-tool stamp, could push the request to a 5th breakpoint (Anthropic limits to 4).
+
 ## [0.2.7] - 2026-05-04
 
 ### Fixed
 
@@ -112,8 +112,8 @@ sofos
 
 - `/resume` - Resume previous session
 - `/clear` - Clear conversation history
-- `/think [on|off]` - Toggle extended thinking (shows status if no arg)
-- `/compact` - Summarize older messages via the LLM to reclaim context tokens (auto-triggers at 80% usage)
+- `/think [off|low|medium|high]` - Set reasoning effort (shows status if no arg)
+- `/compact` - Summarize older messages via the LLM to reclaim context tokens. Triggers automatically at the per-model auto-compact threshold (~250K tokens on 1M-window models, ~170K on Haiku, ~250K on Codex). On Claude Opus 4.7 / 4.6 / Sonnet 4.6 the API itself runs the summarization server-side via the `compact-2026-01-12` beta — no extra round-trip.
 - `/s` - Safe mode (read-only, prompt: **`:`**)
 - `/n` - Normal mode (all tools, prompt: **`>`**)
 - `/exit`, `/quit`, `/q`, `Ctrl+D` - Exit with cost summary
@@ -125,7 +125,7 @@ sofos
 
 **Scrollback:** Sofos runs as an inline viewport at the bottom of your terminal — the rest of the terminal is normal scrollback, so use your terminal emulator's own scrollbar, mouse wheel, and text selection / copy-paste.
 
-**Status line:** Shown below the input box. Updates live as you change state (`/s`, `/n`, `/think`) — model, mode (`normal`/`safe`), reasoning config (`thinking: <N> tok` / `effort: high`), and running token totals.
+**Status line:** Shown below the input box. Updates live as you change state (`/s`, `/n`, `/think`) — model, mode (`normal`/`safe`), reasoning config (`effort: off|low|medium|high` for OpenAI and Claude Opus 4.7+; `thinking: <N> tok` for older Claude models with manual budgets), and running token totals.
 
 ### Image Vision
 
@@ -151,7 +151,9 @@ Analyze https://example.com/chart.png
 
 ### Cost Tracking
 
-Exit summary shows token usage and estimated cost based on official API pricing. When the provider prompt cache served any tokens during the session, a `cache read: N (M% hit)` row appears under the input total, and the estimated cost reflects the cache discount (10% of base input on both providers, plus 125% for Anthropic 5-min cache writes).
+Exit summary shows token usage and estimated cost based on official API pricing. When the provider prompt cache served any tokens during the session, a `cache read: N (M% hit)` row appears under the input total, and the estimated cost reflects the cache discount (10% of base input on both providers, plus 125% for Anthropic 5-min writes and 200% for 1-hour writes).
+
+**Tiered pricing detection.** GPT-5.4 and GPT-5.5 charge a session-wide premium (2× input, 1.5× output) once any single prompt crosses 272K input tokens. Sofos tracks the largest single-turn input observed and switches the cost calculator to premium rates if the cliff is ever crossed, so the displayed cost reflects what OpenAI actually bills.
 
 ### CLI Options
 
@@ -166,25 +168,31 @@ Exit summary shows token usage and estimated cost based on official API pricing.
     --model <MODEL>          Model to use (default: claude-sonnet-4-6)
     --morph-model <MODEL>    Morph model (default: morph-v3-fast)
     --max-tokens <N>         Max response tokens (default: 32768)
--t, --enable-thinking        Enable extended thinking (default: false)
-    --thinking-budget <N>    Token budget for thinking (Claude only, default: 5120, must be < max-tokens)
+-e, --reasoning-effort <LV>  Reasoning effort: off, low, medium, high (default: medium)
+    --thinking-budget <N>    Token budget for older Claude models with manual budgets (default: 5120, must be < max-tokens). Ignored on Claude Opus 4.7+ and on OpenAI.
 -v, --verbose                Verbose logging
 ```
 
-### Extended Thinking
+### Reasoning Effort
 
-Enable for complex reasoning tasks (disabled by default):
+Sofos exposes four levels — `off`, `low`, `medium`, `high` — applied uniformly across providers. Default is `medium`; `high` is opt-in because it materially raises hidden-reasoning token cost on routine coding work.
 
 ```bash
-sofos -t                                             # Default 5120 token budget (Claude 4.5 / 4.6)
-sofos -t --thinking-budget 10000 --max-tokens 16000  # Custom budget (Claude 4.5 / 4.6)
+sofos -e medium                             # Default — sensible cost/quality balance
+sofos -e high                               # Hard tasks, willing to pay more
+sofos -e off                                # Cheapest path; no reasoning summary
+
+# Mid-session
+/think high                                 # Bump up
+/think off                                  # Drop to minimal
+/think                                      # Show current
 ```
 
-**Note:** Extended thinking works with both Claude and OpenAI models.
+**Per-provider mapping:**
 
-- **Claude 4.5 / 4.6** uses a manual token budget controlled by `--thinking-budget` (default `5120`).
-- **Claude Opus 4.7** uses adaptive thinking — the server picks the budget based on the prompt, and sofos sends `effort: high` when thinking is on and `effort: low` when off. `--thinking-budget` is ignored for this model; the status line shows `effort: high|low` instead of a token count.
-- **OpenAI (gpt-5 models)** — `/think on` sets high reasoning effort and `/think off` sets low. `--thinking-budget` is ignored.
+- **OpenAI (gpt-5 family)** — sends `reasoning.effort` matching the level (`minimal` for `off`, `low`/`medium`/`high` otherwise) and `summary: "auto"` when on, omitted when off.
+- **Claude Opus 4.7** — adaptive thinking; the server picks the budget based on the prompt, and sofos sends `output_config.effort` matching the level (`off` collapses to `low`, the lowest the API accepts). `--thinking-budget` is ignored.
+- **Older Claude (Sonnet 4.6, Opus 4.6, Haiku 4.5)** — `off` disables extended thinking; `low/medium/high` all enable it with the `--thinking-budget` token budget (default `5120`). The level is treated uniformly here pending per-tier budget mapping.
 
 ## Custom Instructions
 
 
@@ -8,27 +8,45 @@ use std::sync::atomic::{AtomicBool, Ordering};
 
 const API_BASE: &str = "https://api.anthropic.com/v1";
 const API_VERSION: &str = "2023-06-01";
-const ANTHROPIC_BETA: &str = "token-efficient-tools-2025-02-19";
+/// Comma-separated list of Anthropic beta features sofos opts in to.
+/// `token-efficient-tools-2025-02-19` shrinks tool-call envelopes;
+/// `compact-2026-01-12` enables server-side compaction (the API
+/// generates the summary itself when the request crosses a configured
+/// trigger, then drops earlier messages on subsequent turns).
+///
+/// TODO: `compact-2026-01-12` only applies to Opus 4.7, Opus 4.6,
+/// and Sonnet 4.6. Sending it on a Haiku 4.5 request relies on
+/// Anthropic's "ignore unknown beta tokens" policy. If Anthropic
+/// ever tightens validation, gate the header per-request based on
+/// `ModelInfo::supports_server_compaction` instead of pinning the
+/// value at client construction.
+const ANTHROPIC_BETA: &str = "token-efficient-tools-2025-02-19,compact-2026-01-12";
 
 /// Return true for models that *only* accept `thinking.type = "adaptive"`
 /// (paired with `output_config.effort`) and reject the legacy
 /// `{type: "enabled", budget_tokens: N}` shape with HTTP 400.
 ///
-/// Currently Opus 4.7 is the sole member of this set; Sonnet/Opus 4.6 and
-/// older continue to accept manual budgets, so we keep them on the old path
-/// to preserve the user's `--thinking-budget` knob.
+/// The set is owned by [`crate::api::ModelInfo`]; this thin wrapper
+/// preserves the call shape used by `request_builder` and `repl::mod`
+/// without forcing those sites to dereference the struct just to
+/// check one bool.
 pub fn requires_adaptive_thinking(model: &str) -> bool {
-    model.starts_with("claude-opus-4-7")
+    super::model_info::lookup(model).requires_adaptive_thinking
 }
 
-/// The string form of an "effort" level derived from the user's
-/// thinking-on/off toggle. Used both for Anthropic's `output_config.effort`
-/// (adaptive models) and OpenAI's `reasoning.effort` — the two APIs
-/// happen to share the same `high` / `low` vocabulary, so one helper
-/// keeps the request builder, TUI status line, startup banner, and
-/// `/think` messages in sync without each site hand-mapping the bool.
-pub fn effort_label(enable_thinking: bool) -> &'static str {
-    if enable_thinking { "high" } else { "low" }
+/// Map a [`ReasoningEffort`] to the string Anthropic's adaptive thinking
+/// expects in `output_config.effort` (Opus 4.7+). The API accepts
+/// `low` / `medium` / `high`; `Off` collapses to `low` because adaptive
+/// thinking has no off-switch — the conversation may already carry
+/// thinking blocks that the server cross-checks against the request,
+/// and dropping `output_config` would 400 the next turn.
+pub fn effort_label(effort: super::types::ReasoningEffort) -> &'static str {
+    use super::types::ReasoningEffort;
+    match effort {
+        ReasoningEffort::Off | ReasoningEffort::Low => "low",
+        ReasoningEffort::Medium => "medium",
+        ReasoningEffort::High => "high",
+    }
 }
 
 #[derive(Clone)]
@@ -225,6 +243,18 @@ impl AnthropicClient {
                                     }
                                     current_block_type = None;
                                 }
+                                // TODO: handle `compaction` here.
+                                // Server-side compaction
+                                // (`compact-2026-01-12` beta) emits a
+                                // `compaction` content block with a
+                                // `content` field; the streaming path
+                                // currently drops it, so when
+                                // `use_streaming` is flipped on for
+                                // Anthropic, the next request fails to
+                                // round-trip the summary and Anthropic
+                                // re-compacts (extra cost). The
+                                // non-streaming path handles this via
+                                // serde and works today.
                                 _ => {}
                             }
                         }
@@ -369,7 +399,17 @@ fn sanitize_messages_for_anthropic(messages: Vec<Message>) -> Vec<Message> {
                 let filtered_content = content
                     .into_iter()
                     .filter_map(|block| match block {
+                        // OpenAI reasoning summary block — not part of
+                        // Anthropic's content-block schema; the server
+                        // would reject the unknown type.
                         MessageContentBlock::Summary { .. } => None,
+                        // OpenAI Responses API reasoning item, packed
+                        // with `id` + `encrypted_content`. Carries no
+                        // meaning to Anthropic and uses a `type`
+                        // string the server doesn't recognise. Drop
+                        // before sending so a session that switched
+                        // providers doesn't 400 on the next turn.
+                        MessageContentBlock::Reasoning { .. } => None,
                         other => Some(other),
                     })
                     .collect();
@@ -424,9 +464,12 @@ mod tests {
     }
 
     #[test]
-    fn effort_label_maps_bool_to_high_low() {
-        assert_eq!(effort_label(true), "high");
-        assert_eq!(effort_label(false), "low");
+    fn effort_label_maps_reasoning_levels() {
+        use super::super::types::ReasoningEffort;
+        assert_eq!(effort_label(ReasoningEffort::Off), "low");
+        assert_eq!(effort_label(ReasoningEffort::Low), "low");
+        assert_eq!(effort_label(ReasoningEffort::Medium), "medium");
+        assert_eq!(effort_label(ReasoningEffort::High), "high");
     }
 
     #[test]
@@ -442,6 +485,7 @@ mod tests {
             output_config: Some(OutputConfig::with_effort("high")),
             reasoning: None,
             prompt_cache_key: None,
+            context_management: None,
         };
 
         let json = serde_json::to_value(&request).unwrap();
@@ -464,6 +508,7 @@ mod tests {
             output_config: None,
             reasoning: None,
             prompt_cache_key: None,
+            context_management: None,
         };
 
         let json = serde_json::to_value(&request).unwrap();
@@ -485,9 +530,41 @@ mod tests {
             output_config: None,
             reasoning: None,
             prompt_cache_key: Some("session-1".to_string()),
+            context_management: None,
         };
 
         let prepared = AnthropicClient::prepare_request(request);
         assert!(prepared.prompt_cache_key.is_none());
     }
+
+    #[test]
+    fn sanitizer_drops_openai_reasoning_blocks_before_anthropic_call() {
+        // Regression: a session that started on OpenAI accumulates
+        // `Reasoning` blocks with `id` + `encrypted_content`. Switching
+        // to Anthropic mid-session and forwarding those blocks would
+        // 400 on a content-block-type the server doesn't know.
+        let messages = vec![Message {
+            role: "assistant".to_string(),
+            content: MessageContent::Blocks {
+                content: vec![
+                    MessageContentBlock::Reasoning {
+                        id: "rs_abc".to_string(),
+                        summary: vec!["thought".to_string()],
+                        encrypted_content: Some("blob".to_string()),
+                        cache_control: None,
+                    },
+                    MessageContentBlock::Text {
+                        text: "real reply".to_string(),
+                        cache_control: None,
+                    },
+                ],
+            },
+        }];
+        let cleaned = sanitize_messages_for_anthropic(messages);
+        let MessageContent::Blocks { content } = &cleaned[0].content else {
+            panic!("expected blocks");
+        };
+        assert_eq!(content.len(), 1, "Reasoning block must be dropped");
+        assert!(matches!(content[0], MessageContentBlock::Text { .. }));
+    }
 }
@@ -1,9 +1,13 @@
 pub mod anthropic;
+pub mod model_info;
 pub mod morph;
 pub mod openai;
+pub mod truncate;
 pub mod types;
 pub mod utils;
 
+pub use model_info::ModelInfo;
+
 pub use anthropic::AnthropicClient;
 pub use morph::MorphClient;
 pub use openai::OpenAIClient;