Skip to content

Commit 7a4ba3f

Browse files
committed
Per-tier legacy /think budgets, per-model Anthropic beta gating, 50K compaction floor
1 parent 91b308d commit 7a4ba3f

6 files changed

Lines changed: 252 additions & 51 deletions

File tree

CHANGELOG.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,10 +23,14 @@ All notable changes to Sofos are documented in this file.
2323
- **Default reasoning effort is now `medium`** (was `high`). Verified roughly 3–5× cheaper hidden-reasoning bill on routine coding turns. Use `-e high` or `/think high` for hard tasks.
2424
- **Reasoning summaries are suppressed on the OpenAI thinking-off path.** When `effort: off`, sofos sends `reasoning.effort = "minimal"` with no `summary` field, so the model returns no summary blocks at all (they bill as output tokens).
2525
- **Model context windows corrected.** Claude Opus 4.7 / 4.6 and Sonnet 4.6 are 1,000,000 tokens (were 200K in the table); GPT-5.4 and GPT-5.5 are 1,050,000 tokens (were 400K). The drop-trim safety floor is now per-model API-aware (95% of the real window) instead of a flat 250K.
26-
- **Anthropic beta header now opts into both `token-efficient-tools-2025-02-19` and `compact-2026-01-12`.**
26+
- **Anthropic beta header is now picked per-request based on `ModelInfo::supports_server_compaction`.** `token-efficient-tools-2025-02-19` ships on every Anthropic request; `compact-2026-01-12` only ships against models that actually support it (Opus 4.7, Opus 4.6, Sonnet 4.6). Removes the implicit dependency on Anthropic's "ignore unknown beta tokens" policy — if validation ever tightens, only the right requests carry the token.
27+
- **`/think low|medium|high` on legacy non-adaptive Anthropic models now maps to distinct `budget_tokens` values** (`Low=1024`, `Medium=5120`, `High=16384`) instead of all three collapsing to the `--thinking-budget` flag value. Affects Sonnet 4.5, Opus 4.5/4.6, Haiku 4.5; adaptive models (Opus 4.7+) and OpenAI are unchanged. The `--thinking-budget` CLI flag is now inert on every path; kept for backwards-compatibility and will be removed in a later release.
28+
- **Startup validation now requires `--max-tokens > 16384` whenever reasoning effort is enabled**, regardless of the current model. Catches a configuration that would have silently 400'd the next request after a runtime `/model` swap to a non-adaptive Anthropic model. Default `--max-tokens 32768` already satisfies the new check.
29+
- **Server-side compaction trigger clamped to Anthropic's documented 50K floor.** Defends against a hypothetical future small-window model entry whose `auto_compact_at` would otherwise drop below 50K and 400 the request.
2730

2831
### Fixed
2932

33+
- **Streaming Anthropic responses now round-trip server-side `compaction` content blocks.** The streaming path used to silently drop them, so on a streaming-enabled Anthropic session the next turn would re-send the pre-compaction history and Anthropic would re-compact (extra cost). The non-streaming path was already correct via serde; this brings streaming into parity.
3034
- **OpenAI reasoning items round-trip in the right order relative to their assistant message.** Reasoning items were being emitted in the input array *after* the message they preceded, breaking encrypted_content round-trip continuity on the server side. Now correctly placed before.
3135
- **Tool-cache breakpoint actually lands on Anthropic when OpenAI's web-search tool is registered.** The stamper used to no-op when `OpenAIWebSearch` was the last entry in the tool list, leaving Anthropic with no tool-defs cache breakpoint at all. Now finds the last *Anthropic-compatible* tool to stamp.
3236
- **OpenAI `Reasoning` blocks no longer leak to Anthropic on provider switch.** A session that started on OpenAI accumulates `Reasoning` content blocks; switching to Anthropic mid-session would have sent those blocks to the Messages API, which doesn't recognise the type. The Anthropic sanitiser now drops them.

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -167,9 +167,9 @@ Exit summary shows token usage and estimated cost based on official API pricing.
167167
--morph-api-key <KEY> Morph API key (overrides env var)
168168
--model <MODEL> Model to use (default: claude-sonnet-4-6)
169169
--morph-model <MODEL> Morph model (default: morph-v3-fast)
170-
--max-tokens <N> Max response tokens (default: 32768)
170+
--max-tokens <N> Max response tokens (default: 32768; must be > 16384 when reasoning effort is enabled)
171171
-e, --reasoning-effort <LV> Reasoning effort: off, low, medium, high (default: medium)
172-
--thinking-budget <N> Token budget for older Claude models with manual budgets (default: 5120, must be < max-tokens). Ignored on Claude Opus 4.7+ and on OpenAI.
172+
--thinking-budget <N> Vestigial. Currently inert on every path: legacy Claude uses a fixed per-tier budget (Low=1024, Medium=5120, High=16384), Claude Opus 4.7+ uses adaptive thinking, OpenAI uses `reasoning.effort`. Kept for backwards-compatibility; will be removed.
173173
-v, --verbose Verbose logging
174174
```
175175
@@ -192,7 +192,7 @@ sofos -e off # Cheapest path; no reasoning summar
192192
193193
- **OpenAI (gpt-5 family)** — sends `reasoning.effort` matching the level (`minimal` for `off`, `low`/`medium`/`high` otherwise) and `summary: "auto"` when on, omitted when off.
194194
- **Claude Opus 4.7** — adaptive thinking; the server picks the budget based on the prompt, and sofos sends `output_config.effort` matching the level (`off` collapses to `low`, the lowest the API accepts). `--thinking-budget` is ignored.
195-
- **Older Claude (Sonnet 4.6, Opus 4.6, Haiku 4.5)** — `off` disables extended thinking; `low/medium/high` all enable it with the `--thinking-budget` token budget (default `5120`). The level is treated uniformly here pending per-tier budget mapping.
195+
- **Older Claude (Sonnet 4.6, Opus 4.6, Haiku 4.5)** — `off` disables extended thinking; `low`, `medium`, and `high` each map to a distinct legacy `budget_tokens` value (`1024 / 5120 / 16384`) so the slider has a visible effect. `--thinking-budget` is ignored — the per-tier values are the source of truth.
196196
197197
## Custom Instructions
198198

src/api/anthropic.rs

Lines changed: 166 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -8,19 +8,69 @@ use std::sync::atomic::{AtomicBool, Ordering};
88

99
const API_BASE: &str = "https://api.anthropic.com/v1";
1010
const API_VERSION: &str = "2023-06-01";
11-
/// Comma-separated list of Anthropic beta features sofos opts in to.
12-
/// `token-efficient-tools-2025-02-19` shrinks tool-call envelopes;
13-
/// `compact-2026-01-12` enables server-side compaction (the API
14-
/// generates the summary itself when the request crosses a configured
15-
/// trigger, then drops earlier messages on subsequent turns).
16-
///
17-
/// TODO: `compact-2026-01-12` only applies to Opus 4.7, Opus 4.6,
18-
/// and Sonnet 4.6. Sending it on a Haiku 4.5 request relies on
19-
/// Anthropic's "ignore unknown beta tokens" policy. If Anthropic
20-
/// ever tightens validation, gate the header per-request based on
21-
/// `ModelInfo::supports_server_compaction` instead of pinning the
22-
/// value at client construction.
23-
const ANTHROPIC_BETA: &str = "token-efficient-tools-2025-02-19,compact-2026-01-12";
11+
/// Header name for Anthropic feature opt-ins.
12+
const BETA_HEADER_NAME: &str = "anthropic-beta";
13+
14+
/// Universal beta token: shrinks tool-call envelopes. Supported on
15+
/// every model in the registry, so it ships unconditionally.
16+
const BETA_TOKEN_EFFICIENT: &str = "token-efficient-tools-2025-02-19";
17+
18+
/// Server-side compaction beta. Gated per-request by [`anthropic_beta_for`]
19+
/// based on `ModelInfo::supports_server_compaction` so a Haiku 4.5
20+
/// request doesn't depend on Anthropic's "ignore unknown beta tokens"
21+
/// policy. The runtime header value comes from
22+
/// [`BETA_TOKEN_EFFICIENT_AND_COMPACT`] (which embeds this string as a
23+
/// literal); this const exists so the drift-detection test can verify
24+
/// the literal stays in sync with its components.
25+
#[allow(dead_code)]
26+
const BETA_COMPACT: &str = "compact-2026-01-12";
27+
28+
/// Pre-joined string sent when both betas ship. Spelled out as a
29+
/// literal because `concat!` only works on literals (so it can't
30+
/// reference [`BETA_TOKEN_EFFICIENT`]/[`BETA_COMPACT`] directly).
31+
/// The `beta_with_compact_matches_components` test enforces it stays
32+
/// in sync with its components.
33+
const BETA_TOKEN_EFFICIENT_AND_COMPACT: &str =
34+
"token-efficient-tools-2025-02-19,compact-2026-01-12";
35+
36+
/// Compute the `anthropic-beta` header value for a single request.
37+
/// Adds `compact-2026-01-12` when the target model advertises server-
38+
/// side compaction, otherwise returns the base token unchanged.
39+
fn anthropic_beta_for(model: &str) -> &'static str {
40+
if super::model_info::lookup(model).supports_server_compaction {
41+
BETA_TOKEN_EFFICIENT_AND_COMPACT
42+
} else {
43+
BETA_TOKEN_EFFICIENT
44+
}
45+
}
46+
47+
/// Per-effort `budget_tokens` value for Anthropic's *legacy* non-adaptive
48+
/// extended-thinking shape (`{type: "enabled", budget_tokens}`). Models
49+
/// that require adaptive thinking (Opus 4.7+) ignore these and drive
50+
/// effort through `output_config.effort` instead.
51+
pub const LEGACY_THINKING_BUDGET_LOW: u32 = 1024;
52+
pub const LEGACY_THINKING_BUDGET_MEDIUM: u32 = 5120;
53+
pub const LEGACY_THINKING_BUDGET_HIGH: u32 = 16384;
54+
55+
/// Anthropic's documented minimum trigger value for the
56+
/// `compact_20260112` context-edit. Triggers below this 400 the
57+
/// request, so the request builder clamps `auto_compact_at` against
58+
/// this floor.
59+
pub const COMPACTION_TRIGGER_FLOOR: u32 = 50_000;
60+
61+
/// Map a [`ReasoningEffort`] to the legacy `budget_tokens` value.
62+
/// `Off` defensively collapses to `LOW` so callers that forget to
63+
/// pre-guard with `is_enabled()` don't panic; the request builder
64+
/// still gates the whole legacy branch behind `is_enabled()` so the
65+
/// `Off` arm is unreachable in practice.
66+
pub fn legacy_thinking_budget(effort: super::types::ReasoningEffort) -> u32 {
67+
use super::types::ReasoningEffort;
68+
match effort {
69+
ReasoningEffort::Off | ReasoningEffort::Low => LEGACY_THINKING_BUDGET_LOW,
70+
ReasoningEffort::Medium => LEGACY_THINKING_BUDGET_MEDIUM,
71+
ReasoningEffort::High => LEGACY_THINKING_BUDGET_HIGH,
72+
}
73+
}
2474

2575
/// Return true for models that *only* accept `thinking.type = "adaptive"`
2676
/// (paired with `output_config.effort`) and reject the legacy
@@ -63,7 +113,9 @@ impl AnthropicClient {
63113
.map_err(|e| SofosError::Config(format!("Invalid API key format: {}", e)))?,
64114
);
65115
headers.insert("anthropic-version", HeaderValue::from_static(API_VERSION));
66-
headers.insert("anthropic-beta", HeaderValue::from_static(ANTHROPIC_BETA));
116+
// `anthropic-beta` is set per-request by `anthropic_beta_for`
117+
// so the compaction beta only ships when the target model
118+
// actually supports it.
67119

68120
let client = utils::build_http_client(headers, utils::REQUEST_TIMEOUT)?;
69121

@@ -107,8 +159,16 @@ impl AnthropicClient {
107159
) -> Result<CreateMessageResponse> {
108160
let url = format!("{}/messages", API_BASE);
109161
let request = Self::prepare_request(request);
162+
let beta = anthropic_beta_for(&request.model);
110163

111-
let response = utils::send_once("Anthropic", self.client.post(&url).json(&request)).await?;
164+
let response = utils::send_once(
165+
"Anthropic",
166+
self.client
167+
.post(&url)
168+
.header(BETA_HEADER_NAME, beta)
169+
.json(&request),
170+
)
171+
.await?;
112172

113173
let result = response.json::<CreateMessageResponse>().await?;
114174
Ok(result)
@@ -127,10 +187,18 @@ impl AnthropicClient {
127187
{
128188
let mut request = Self::prepare_request(request);
129189
request.stream = Some(true);
190+
let beta = anthropic_beta_for(&request.model);
130191

131192
let url = format!("{}/messages", API_BASE);
132193

133-
let response = utils::send_once("Anthropic", self.client.post(&url).json(&request)).await?;
194+
let response = utils::send_once(
195+
"Anthropic",
196+
self.client
197+
.post(&url)
198+
.header(BETA_HEADER_NAME, beta)
199+
.json(&request),
200+
)
201+
.await?;
134202

135203
let mut byte_stream = response.bytes_stream();
136204
let mut buffer = String::new();
@@ -243,18 +311,33 @@ impl AnthropicClient {
243311
}
244312
current_block_type = None;
245313
}
246-
// TODO: handle `compaction` here.
247314
// Server-side compaction
248315
// (`compact-2026-01-12` beta) emits a
249-
// `compaction` content block with a
250-
// `content` field; the streaming path
251-
// currently drops it, so when
252-
// `use_streaming` is flipped on for
253-
// Anthropic, the next request fails to
254-
// round-trip the summary and Anthropic
255-
// re-compacts (extra cost). The
256-
// non-streaming path handles this via
257-
// serde and works today.
316+
// `compaction` content block with the
317+
// full summary in `content`. Mirror
318+
// the non-streaming serde path so the
319+
// next turn can round-trip the summary
320+
// and Anthropic doesn't re-compact.
321+
"compaction" => {
322+
// Mirror the existing "drop malformed
323+
// payloads silently" pattern used by
324+
// `web_search_tool_result` above. The
325+
// non-streaming serde path would error
326+
// on a missing `content`; in streaming
327+
// we don't want to kill the whole
328+
// response over one block, so skip it
329+
// — losing the summary forces a
330+
// re-compact next turn but doesn't
331+
// 400 the request.
332+
if let Some(content) =
333+
block.get("content").and_then(|v| v.as_str())
334+
{
335+
content_blocks.push(ContentBlock::Compaction {
336+
content: content.to_string(),
337+
});
338+
}
339+
current_block_type = None;
340+
}
258341
_ => {}
259342
}
260343
}
@@ -463,6 +546,63 @@ mod tests {
463546
assert!(!requires_adaptive_thinking(""));
464547
}
465548

549+
#[test]
550+
fn anthropic_beta_for_gates_compaction_to_supported_models() {
551+
// Opus 4.7 is on the compaction-supported list — both betas ship.
552+
let with_compact = anthropic_beta_for("claude-opus-4-7");
553+
assert!(with_compact.contains(BETA_TOKEN_EFFICIENT));
554+
assert!(with_compact.contains(BETA_COMPACT));
555+
556+
// Haiku 4.5 isn't — only the universal beta should appear so
557+
// we don't depend on Anthropic's "ignore unknown beta tokens"
558+
// policy if validation ever tightens.
559+
let without = anthropic_beta_for("claude-haiku-4-5");
560+
assert!(without.contains(BETA_TOKEN_EFFICIENT));
561+
assert!(!without.contains(BETA_COMPACT));
562+
}
563+
564+
#[test]
565+
fn beta_with_compact_matches_components() {
566+
// `BETA_TOKEN_EFFICIENT_AND_COMPACT` is a literal that must
567+
// stay in lockstep with its two component consts. Catch drift
568+
// here so renaming one component without the other is a test
569+
// failure rather than a silent header mismatch in production.
570+
assert_eq!(
571+
BETA_TOKEN_EFFICIENT_AND_COMPACT,
572+
format!("{BETA_TOKEN_EFFICIENT},{BETA_COMPACT}")
573+
);
574+
}
575+
576+
#[test]
577+
fn legacy_thinking_budget_helper_scales_with_effort() {
578+
use super::super::types::ReasoningEffort;
579+
assert_eq!(
580+
legacy_thinking_budget(ReasoningEffort::Low),
581+
LEGACY_THINKING_BUDGET_LOW
582+
);
583+
assert_eq!(
584+
legacy_thinking_budget(ReasoningEffort::Medium),
585+
LEGACY_THINKING_BUDGET_MEDIUM
586+
);
587+
assert_eq!(
588+
legacy_thinking_budget(ReasoningEffort::High),
589+
LEGACY_THINKING_BUDGET_HIGH
590+
);
591+
// Defensive default: `Off` collapses to `LOW` rather than
592+
// panicking, even though the legacy branch is upstream-guarded.
593+
assert_eq!(
594+
legacy_thinking_budget(ReasoningEffort::Off),
595+
LEGACY_THINKING_BUDGET_LOW
596+
);
597+
// Compile-time guard: the three tier values must stay strictly
598+
// increasing. Runtime `assert!` would be a tautology on consts
599+
// (clippy::assertions_on_constants), so check at const-eval time.
600+
const _: () = {
601+
assert!(LEGACY_THINKING_BUDGET_LOW < LEGACY_THINKING_BUDGET_MEDIUM);
602+
assert!(LEGACY_THINKING_BUDGET_MEDIUM < LEGACY_THINKING_BUDGET_HIGH);
603+
};
604+
}
605+
466606
#[test]
467607
fn effort_label_maps_reasoning_levels() {
468608
use super::super::types::ReasoningEffort;

src/cli.rs

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,8 @@ pub struct Cli {
4242
/// the tool-call JSON, surfacing as "Missing 'path' parameter".
4343
/// Claude Sonnet 4 and GPT-4.1 both support 32k+; smaller models
4444
/// cap at their own limit so this is safe as a default.
45+
/// Must be > 16384 when reasoning effort is enabled (the legacy
46+
/// Anthropic thinking-budget ceiling); the default 32768 satisfies it.
4547
#[arg(long, default_value = "32768")]
4648
pub max_tokens: u32,
4749

@@ -53,8 +55,11 @@ pub struct Cli {
5355
#[arg(short = 'e', long, default_value = "medium")]
5456
pub reasoning_effort: crate::api::ReasoningEffort,
5557

56-
/// Token budget for non-adaptive Anthropic extended thinking. Ignored
57-
/// on OpenAI and on Anthropic adaptive (Opus 4.7+).
58+
/// Vestigial. Currently inert on every path: legacy Anthropic uses
59+
/// a fixed per-tier budget (Low=1024, Medium=5120, High=16384),
60+
/// Anthropic adaptive (Opus 4.7+) uses `output_config.effort`, and
61+
/// OpenAI uses `reasoning.effort`. Kept for backwards-compatibility;
62+
/// will be removed in a later release.
5863
#[arg(long, default_value = "5120")]
5964
pub thinking_budget: u32,
6065

src/repl/mod.rs

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -139,11 +139,22 @@ impl Repl {
139139
eprintln!("{}", "Loaded custom instructions".bright_green());
140140
}
141141

142-
// Validate thinking budget
143-
if config.reasoning_effort.is_enabled() && config.thinking_budget >= config.max_tokens {
142+
// Validate that `max_tokens` leaves room for the largest legacy
143+
// thinking budget we might send. The actual budget is now picked
144+
// per-effort in `request_builder` (Low=1024, Medium=5120,
145+
// High=16384) rather than read from the user's `--thinking-budget`
146+
// flag, so the invariant we need is `max_tokens > HIGH`. We check
147+
// unconditionally on enabled-thinking sessions instead of also
148+
// probing the model id, because the model can be swapped mid-
149+
// session via `/model` and we don't want a runtime 400.
150+
if config.reasoning_effort.is_enabled()
151+
&& config.max_tokens <= crate::api::anthropic::LEGACY_THINKING_BUDGET_HIGH
152+
{
144153
return Err(SofosError::Config(format!(
145-
"thinking_budget ({}) must be less than max_tokens ({})",
146-
config.thinking_budget, config.max_tokens
154+
"max_tokens ({}) must exceed the legacy thinking-budget ceiling ({}). \
155+
Use a higher --max-tokens or set --reasoning-effort off.",
156+
config.max_tokens,
157+
crate::api::anthropic::LEGACY_THINKING_BUDGET_HIGH
147158
)));
148159
}
149160

0 commit comments

Comments
 (0)