Per-tier legacy /think budgets, per-model Anthropic beta gating, 50K compaction floor

alexylon · alexylon · commit 7a4ba3f612fd · 2026-05-05T20:04:27.000+03:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -23,10 +23,14 @@ All notable changes to Sofos are documented in this file.
 - **Default reasoning effort is now `medium`** (was `high`). Verified roughly 3–5× cheaper hidden-reasoning bill on routine coding turns. Use `-e high` or `/think high` for hard tasks.
 - **Reasoning summaries are suppressed on the OpenAI thinking-off path.** When `effort: off`, sofos sends `reasoning.effort = "minimal"` with no `summary` field, so the model returns no summary blocks at all (they bill as output tokens).
 - **Model context windows corrected.** Claude Opus 4.7 / 4.6 and Sonnet 4.6 are 1,000,000 tokens (were 200K in the table); GPT-5.4 and GPT-5.5 are 1,050,000 tokens (were 400K). The drop-trim safety floor is now per-model API-aware (95% of the real window) instead of a flat 250K.
-- **Anthropic beta header now opts into both `token-efficient-tools-2025-02-19` and `compact-2026-01-12`.**
+- **Anthropic beta header is now picked per-request based on `ModelInfo::supports_server_compaction`.** `token-efficient-tools-2025-02-19` ships on every Anthropic request; `compact-2026-01-12` only ships against models that actually support it (Opus 4.7, Opus 4.6, Sonnet 4.6). Removes the implicit dependency on Anthropic's "ignore unknown beta tokens" policy — if validation ever tightens, only the right requests carry the token.
+- **`/think low|medium|high` on legacy non-adaptive Anthropic models now maps to distinct `budget_tokens` values** (`Low=1024`, `Medium=5120`, `High=16384`) instead of all three collapsing to the `--thinking-budget` flag value. Affects Sonnet 4.5, Opus 4.5/4.6, Haiku 4.5; adaptive models (Opus 4.7+) and OpenAI are unchanged. The `--thinking-budget` CLI flag is now inert on every path; kept for backwards-compatibility and will be removed in a later release.
+- **Startup validation now requires `--max-tokens > 16384` whenever reasoning effort is enabled**, regardless of the current model. Catches a configuration that would have silently 400'd the next request after a runtime `/model` swap to a non-adaptive Anthropic model. Default `--max-tokens 32768` already satisfies the new check.
+- **Server-side compaction trigger clamped to Anthropic's documented 50K floor.** Defends against a hypothetical future small-window model entry whose `auto_compact_at` would otherwise drop below 50K and 400 the request.
 
 ### Fixed
 
+- **Streaming Anthropic responses now round-trip server-side `compaction` content blocks.** The streaming path used to silently drop them, so on a streaming-enabled Anthropic session the next turn would re-send the pre-compaction history and Anthropic would re-compact (extra cost). The non-streaming path was already correct via serde; this brings streaming into parity.
 - **OpenAI reasoning items round-trip in the right order relative to their assistant message.** Reasoning items were being emitted in the input array *after* the message they preceded, breaking encrypted_content round-trip continuity on the server side. Now correctly placed before.
 - **Tool-cache breakpoint actually lands on Anthropic when OpenAI's web-search tool is registered.** The stamper used to no-op when `OpenAIWebSearch` was the last entry in the tool list, leaving Anthropic with no tool-defs cache breakpoint at all. Now finds the last *Anthropic-compatible* tool to stamp.
 - **OpenAI `Reasoning` blocks no longer leak to Anthropic on provider switch.** A session that started on OpenAI accumulates `Reasoning` content blocks; switching to Anthropic mid-session would have sent those blocks to the Messages API, which doesn't recognise the type. The Anthropic sanitiser now drops them.
diff --git a/README.md b/README.md
@@ -167,9 +167,9 @@ Exit summary shows token usage and estimated cost based on official API pricing.
     --morph-api-key <KEY>    Morph API key (overrides env var)
     --model <MODEL>          Model to use (default: claude-sonnet-4-6)
     --morph-model <MODEL>    Morph model (default: morph-v3-fast)
-    --max-tokens <N>         Max response tokens (default: 32768)
+    --max-tokens <N>         Max response tokens (default: 32768; must be > 16384 when reasoning effort is enabled)
 -e, --reasoning-effort <LV>  Reasoning effort: off, low, medium, high (default: medium)
-    --thinking-budget <N>    Token budget for older Claude models with manual budgets (default: 5120, must be < max-tokens). Ignored on Claude Opus 4.7+ and on OpenAI.
+    --thinking-budget <N>    Vestigial. Currently inert on every path: legacy Claude uses a fixed per-tier budget (Low=1024, Medium=5120, High=16384), Claude Opus 4.7+ uses adaptive thinking, OpenAI uses `reasoning.effort`. Kept for backwards-compatibility; will be removed.
 -v, --verbose                Verbose logging
 ```
 
@@ -192,7 +192,7 @@ sofos -e off                                # Cheapest path; no reasoning summar
 
 - **OpenAI (gpt-5 family)** — sends `reasoning.effort` matching the level (`minimal` for `off`, `low`/`medium`/`high` otherwise) and `summary: "auto"` when on, omitted when off.
 - **Claude Opus 4.7** — adaptive thinking; the server picks the budget based on the prompt, and sofos sends `output_config.effort` matching the level (`off` collapses to `low`, the lowest the API accepts). `--thinking-budget` is ignored.
-- **Older Claude (Sonnet 4.6, Opus 4.6, Haiku 4.5)** — `off` disables extended thinking; `low/medium/high` all enable it with the `--thinking-budget` token budget (default `5120`). The level is treated uniformly here pending per-tier budget mapping.
+- **Older Claude (Sonnet 4.6, Opus 4.6, Haiku 4.5)** — `off` disables extended thinking; `low`, `medium`, and `high` each map to a distinct legacy `budget_tokens` value (`1024 / 5120 / 16384`) so the slider has a visible effect. `--thinking-budget` is ignored — the per-tier values are the source of truth.
 
 ## Custom Instructions
 
diff --git a/src/api/anthropic.rs b/src/api/anthropic.rs
@@ -8,19 +8,69 @@ use std::sync::atomic::{AtomicBool, Ordering};
 
 const API_BASE: &str = "https://api.anthropic.com/v1";
 const API_VERSION: &str = "2023-06-01";
-/// Comma-separated list of Anthropic beta features sofos opts in to.
-/// `token-efficient-tools-2025-02-19` shrinks tool-call envelopes;
-/// `compact-2026-01-12` enables server-side compaction (the API
-/// generates the summary itself when the request crosses a configured
-/// trigger, then drops earlier messages on subsequent turns).
-///
-/// TODO: `compact-2026-01-12` only applies to Opus 4.7, Opus 4.6,
-/// and Sonnet 4.6. Sending it on a Haiku 4.5 request relies on
-/// Anthropic's "ignore unknown beta tokens" policy. If Anthropic
-/// ever tightens validation, gate the header per-request based on
-/// `ModelInfo::supports_server_compaction` instead of pinning the
-/// value at client construction.
-const ANTHROPIC_BETA: &str = "token-efficient-tools-2025-02-19,compact-2026-01-12";
+/// Header name for Anthropic feature opt-ins.
+const BETA_HEADER_NAME: &str = "anthropic-beta";
+
+/// Universal beta token: shrinks tool-call envelopes. Supported on
+/// every model in the registry, so it ships unconditionally.
+const BETA_TOKEN_EFFICIENT: &str = "token-efficient-tools-2025-02-19";
+
+/// Server-side compaction beta. Gated per-request by [`anthropic_beta_for`]
+/// based on `ModelInfo::supports_server_compaction` so a Haiku 4.5
+/// request doesn't depend on Anthropic's "ignore unknown beta tokens"
+/// policy. The runtime header value comes from
+/// [`BETA_TOKEN_EFFICIENT_AND_COMPACT`] (which embeds this string as a
+/// literal); this const exists so the drift-detection test can verify
+/// the literal stays in sync with its components.
+#[allow(dead_code)]
+const BETA_COMPACT: &str = "compact-2026-01-12";
+
+/// Pre-joined string sent when both betas ship. Spelled out as a
+/// literal because `concat!` only works on literals (so it can't
+/// reference [`BETA_TOKEN_EFFICIENT`]/[`BETA_COMPACT`] directly).
+/// The `beta_with_compact_matches_components` test enforces it stays
+/// in sync with its components.
+const BETA_TOKEN_EFFICIENT_AND_COMPACT: &str =
+    "token-efficient-tools-2025-02-19,compact-2026-01-12";
+
+/// Compute the `anthropic-beta` header value for a single request.
+/// Adds `compact-2026-01-12` when the target model advertises server-
+/// side compaction, otherwise returns the base token unchanged.
+fn anthropic_beta_for(model: &str) -> &'static str {
+    if super::model_info::lookup(model).supports_server_compaction {
+        BETA_TOKEN_EFFICIENT_AND_COMPACT
+    } else {
+        BETA_TOKEN_EFFICIENT
+    }
+}
+
+/// Per-effort `budget_tokens` value for Anthropic's *legacy* non-adaptive
+/// extended-thinking shape (`{type: "enabled", budget_tokens}`). Models
+/// that require adaptive thinking (Opus 4.7+) ignore these and drive
+/// effort through `output_config.effort` instead.
+pub const LEGACY_THINKING_BUDGET_LOW: u32 = 1024;
+pub const LEGACY_THINKING_BUDGET_MEDIUM: u32 = 5120;
+pub const LEGACY_THINKING_BUDGET_HIGH: u32 = 16384;
+
+/// Anthropic's documented minimum trigger value for the
+/// `compact_20260112` context-edit. Triggers below this 400 the
+/// request, so the request builder clamps `auto_compact_at` against
+/// this floor.
+pub const COMPACTION_TRIGGER_FLOOR: u32 = 50_000;
+
+/// Map a [`ReasoningEffort`] to the legacy `budget_tokens` value.
+/// `Off` defensively collapses to `LOW` so callers that forget to
+/// pre-guard with `is_enabled()` don't panic; the request builder
+/// still gates the whole legacy branch behind `is_enabled()` so the
+/// `Off` arm is unreachable in practice.
+pub fn legacy_thinking_budget(effort: super::types::ReasoningEffort) -> u32 {
+    use super::types::ReasoningEffort;
+    match effort {
+        ReasoningEffort::Off | ReasoningEffort::Low => LEGACY_THINKING_BUDGET_LOW,
+        ReasoningEffort::Medium => LEGACY_THINKING_BUDGET_MEDIUM,
+        ReasoningEffort::High => LEGACY_THINKING_BUDGET_HIGH,
+    }
+}
 
 /// Return true for models that *only* accept `thinking.type = "adaptive"`
 /// (paired with `output_config.effort`) and reject the legacy
@@ -63,7 +113,9 @@ impl AnthropicClient {
                 .map_err(|e| SofosError::Config(format!("Invalid API key format: {}", e)))?,
         );
         headers.insert("anthropic-version", HeaderValue::from_static(API_VERSION));
-        headers.insert("anthropic-beta", HeaderValue::from_static(ANTHROPIC_BETA));
+        // `anthropic-beta` is set per-request by `anthropic_beta_for`
+        // so the compaction beta only ships when the target model
+        // actually supports it.
 
         let client = utils::build_http_client(headers, utils::REQUEST_TIMEOUT)?;
 
@@ -107,8 +159,16 @@ impl AnthropicClient {
     ) -> Result<CreateMessageResponse> {
         let url = format!("{}/messages", API_BASE);
         let request = Self::prepare_request(request);
+        let beta = anthropic_beta_for(&request.model);
 
-        let response = utils::send_once("Anthropic", self.client.post(&url).json(&request)).await?;
+        let response = utils::send_once(
+            "Anthropic",
+            self.client
+                .post(&url)
+                .header(BETA_HEADER_NAME, beta)
+                .json(&request),
+        )
+        .await?;
 
         let result = response.json::<CreateMessageResponse>().await?;
         Ok(result)
@@ -127,10 +187,18 @@ impl AnthropicClient {
     {
         let mut request = Self::prepare_request(request);
         request.stream = Some(true);
+        let beta = anthropic_beta_for(&request.model);
 
         let url = format!("{}/messages", API_BASE);
 
-        let response = utils::send_once("Anthropic", self.client.post(&url).json(&request)).await?;
+        let response = utils::send_once(
+            "Anthropic",
+            self.client
+                .post(&url)
+                .header(BETA_HEADER_NAME, beta)
+                .json(&request),
+        )
+        .await?;
 
         let mut byte_stream = response.bytes_stream();
         let mut buffer = String::new();
@@ -243,18 +311,33 @@ impl AnthropicClient {
                                     }
                                     current_block_type = None;
                                 }
-                                // TODO: handle `compaction` here.
                                 // Server-side compaction
                                 // (`compact-2026-01-12` beta) emits a
-                                // `compaction` content block with a
-                                // `content` field; the streaming path
-                                // currently drops it, so when
-                                // `use_streaming` is flipped on for
-                                // Anthropic, the next request fails to
-                                // round-trip the summary and Anthropic
-                                // re-compacts (extra cost). The
-                                // non-streaming path handles this via
-                                // serde and works today.
+                                // `compaction` content block with the
+                                // full summary in `content`. Mirror
+                                // the non-streaming serde path so the
+                                // next turn can round-trip the summary
+                                // and Anthropic doesn't re-compact.
+                                "compaction" => {
+                                    // Mirror the existing "drop malformed
+                                    // payloads silently" pattern used by
+                                    // `web_search_tool_result` above. The
+                                    // non-streaming serde path would error
+                                    // on a missing `content`; in streaming
+                                    // we don't want to kill the whole
+                                    // response over one block, so skip it
+                                    // — losing the summary forces a
+                                    // re-compact next turn but doesn't
+                                    // 400 the request.
+                                    if let Some(content) =
+                                        block.get("content").and_then(|v| v.as_str())
+                                    {
+                                        content_blocks.push(ContentBlock::Compaction {
+                                            content: content.to_string(),
+                                        });
+                                    }
+                                    current_block_type = None;
+                                }
                                 _ => {}
                             }
                         }
@@ -463,6 +546,63 @@ mod tests {
         assert!(!requires_adaptive_thinking(""));
     }
 
+    #[test]
+    fn anthropic_beta_for_gates_compaction_to_supported_models() {
+        // Opus 4.7 is on the compaction-supported list — both betas ship.
+        let with_compact = anthropic_beta_for("claude-opus-4-7");
+        assert!(with_compact.contains(BETA_TOKEN_EFFICIENT));
+        assert!(with_compact.contains(BETA_COMPACT));
+
+        // Haiku 4.5 isn't — only the universal beta should appear so
+        // we don't depend on Anthropic's "ignore unknown beta tokens"
+        // policy if validation ever tightens.
+        let without = anthropic_beta_for("claude-haiku-4-5");
+        assert!(without.contains(BETA_TOKEN_EFFICIENT));
+        assert!(!without.contains(BETA_COMPACT));
+    }
+
+    #[test]
+    fn beta_with_compact_matches_components() {
+        // `BETA_TOKEN_EFFICIENT_AND_COMPACT` is a literal that must
+        // stay in lockstep with its two component consts. Catch drift
+        // here so renaming one component without the other is a test
+        // failure rather than a silent header mismatch in production.
+        assert_eq!(
+            BETA_TOKEN_EFFICIENT_AND_COMPACT,
+            format!("{BETA_TOKEN_EFFICIENT},{BETA_COMPACT}")
+        );
+    }
+
+    #[test]
+    fn legacy_thinking_budget_helper_scales_with_effort() {
+        use super::super::types::ReasoningEffort;
+        assert_eq!(
+            legacy_thinking_budget(ReasoningEffort::Low),
+            LEGACY_THINKING_BUDGET_LOW
+        );
+        assert_eq!(
+            legacy_thinking_budget(ReasoningEffort::Medium),
+            LEGACY_THINKING_BUDGET_MEDIUM
+        );
+        assert_eq!(
+            legacy_thinking_budget(ReasoningEffort::High),
+            LEGACY_THINKING_BUDGET_HIGH
+        );
+        // Defensive default: `Off` collapses to `LOW` rather than
+        // panicking, even though the legacy branch is upstream-guarded.
+        assert_eq!(
+            legacy_thinking_budget(ReasoningEffort::Off),
+            LEGACY_THINKING_BUDGET_LOW
+        );
+        // Compile-time guard: the three tier values must stay strictly
+        // increasing. Runtime `assert!` would be a tautology on consts
+        // (clippy::assertions_on_constants), so check at const-eval time.
+        const _: () = {
+            assert!(LEGACY_THINKING_BUDGET_LOW < LEGACY_THINKING_BUDGET_MEDIUM);
+            assert!(LEGACY_THINKING_BUDGET_MEDIUM < LEGACY_THINKING_BUDGET_HIGH);
+        };
+    }
+
     #[test]
     fn effort_label_maps_reasoning_levels() {
         use super::super::types::ReasoningEffort;
diff --git a/src/cli.rs b/src/cli.rs
@@ -42,6 +42,8 @@ pub struct Cli {
     /// the tool-call JSON, surfacing as "Missing 'path' parameter".
     /// Claude Sonnet 4 and GPT-4.1 both support 32k+; smaller models
     /// cap at their own limit so this is safe as a default.
+    /// Must be > 16384 when reasoning effort is enabled (the legacy
+    /// Anthropic thinking-budget ceiling); the default 32768 satisfies it.
     #[arg(long, default_value = "32768")]
     pub max_tokens: u32,
 
@@ -53,8 +55,11 @@ pub struct Cli {
     #[arg(short = 'e', long, default_value = "medium")]
     pub reasoning_effort: crate::api::ReasoningEffort,
 
-    /// Token budget for non-adaptive Anthropic extended thinking. Ignored
-    /// on OpenAI and on Anthropic adaptive (Opus 4.7+).
+    /// Vestigial. Currently inert on every path: legacy Anthropic uses
+    /// a fixed per-tier budget (Low=1024, Medium=5120, High=16384),
+    /// Anthropic adaptive (Opus 4.7+) uses `output_config.effort`, and
+    /// OpenAI uses `reasoning.effort`. Kept for backwards-compatibility;
+    /// will be removed in a later release.
     #[arg(long, default_value = "5120")]
     pub thinking_budget: u32,
 
diff --git a/src/repl/mod.rs b/src/repl/mod.rs
@@ -139,11 +139,22 @@ impl Repl {
             eprintln!("{}", "Loaded custom instructions".bright_green());
         }
 
-        // Validate thinking budget
-        if config.reasoning_effort.is_enabled() && config.thinking_budget >= config.max_tokens {
+        // Validate that `max_tokens` leaves room for the largest legacy
+        // thinking budget we might send. The actual budget is now picked
+        // per-effort in `request_builder` (Low=1024, Medium=5120,
+        // High=16384) rather than read from the user's `--thinking-budget`
+        // flag, so the invariant we need is `max_tokens > HIGH`. We check
+        // unconditionally on enabled-thinking sessions instead of also
+        // probing the model id, because the model can be swapped mid-
+        // session via `/model` and we don't want a runtime 400.
+        if config.reasoning_effort.is_enabled()
+            && config.max_tokens <= crate::api::anthropic::LEGACY_THINKING_BUDGET_HIGH
+        {
             return Err(SofosError::Config(format!(
-                "thinking_budget ({}) must be less than max_tokens ({})",
-                config.thinking_budget, config.max_tokens
+                "max_tokens ({}) must exceed the legacy thinking-budget ceiling ({}). \
+                 Use a higher --max-tokens or set --reasoning-effort off.",
+                config.max_tokens,
+                crate::api::anthropic::LEGACY_THINKING_BUDGET_HIGH
             )));
         }
 
diff --git a/src/repl/request_builder.rs b/src/repl/request_builder.rs