Skip to content

Commit 91b308d

Browse files
committed
Reduce session cost: 4-level reasoning effort, server-side compaction, 1h cache TTL
1 parent 1d6b8b3 commit 91b308d

24 files changed

Lines changed: 1560 additions & 228 deletions

CHANGELOG.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,35 @@ All notable changes to Sofos are documented in this file.
44

55
## [Unreleased]
66

7+
### Added
8+
9+
- **Anthropic server-side compaction** is now enabled on Claude Opus 4.7, Opus 4.6, and Sonnet 4.6. Sofos sends the `compact-2026-01-12` beta header and a `context_management.edits[type=compact_20260112]` block on every request to those models; when the request crosses the per-model auto-compact threshold (~250K tokens), the API itself summarises older turns and returns a `compaction` content block, dropping the pre-compaction messages server-side on subsequent requests. No extra round-trip — the compaction summary arrives in the same response as the user's reply.
10+
- **OpenAI encrypted-reasoning round-trip.** Requests that enable reasoning now include `include: ["reasoning.encrypted_content"]`. Sofos captures the opaque encrypted-CoT blob alongside the visible reasoning summary and round-trips both on the next call, so the model resumes its hidden chain-of-thought across tool calls instead of regenerating it. Cuts hidden-reasoning output tokens on multi-call agentic turns.
11+
- **Per-model `ModelInfo` registry** consolidates context-window, auto-compact threshold, adaptive-thinking flag, server-compaction flag, and pricing (including tiered-pricing rules) into one struct per model. Adding a new model is one struct literal in `src/api/model_info.rs`.
12+
- **Tiered-pricing detection for GPT-5.4 and GPT-5.5.** Sofos tracks the largest single-turn input observed across the session. If any single prompt crosses the documented 272K threshold, the cost calculator switches to premium rates (2× input, 1.5× output) for the rest of the session — matching what OpenAI actually bills.
13+
- **1-hour cache TTL on stable prefixes.** System prompt, the last-listed tool definition, and the sticky message anchor now use Anthropic's `ttl: "1h"` ephemeral cache. The rolling breakpoint stays at 5 min because it moves every turn; paying the 2× write premium for a one-turn slot would burn cache writes for nothing.
14+
- **Middle truncation for tool outputs.** Large bash / search / file-read / diff / MCP outputs preserve both the head and the tail (separated by a `…N tokens truncated…` marker) instead of the head-only cut sofos previously applied. The diagnostic tail (last error line, ripgrep totals, exit messages) now survives truncation.
15+
- **`compaction` content block type** added to the on-the-wire schema and the saved-session schema, so Anthropic's server-side summaries persist across save / load.
16+
- **Honest server-side cost line.** The session summary now correctly accounts for the 1-hour cache write premium (200% of base input) on top of the existing 5-minute cache write premium (125%).
17+
18+
### Changed
19+
20+
- **CLI `-t` / `--enable-thinking` is replaced with `-e` / `--reasoning-effort <off|low|medium|high>`** (default `medium`). The previous binary on/off knob is gone — `medium` is now the default-on state because `high` materially raises hidden-reasoning token cost on routine coding work, and `off` is the absolute-cheapest path. **Breaking change**: scripts using `-t` need updating.
21+
- **`/think on` / `/think off` are replaced with `/think <off|low|medium|high>`.** `/think` (no argument) still shows status. `on` and `off` no longer parse as commands. **Breaking change**.
22+
- **Auto-compact threshold lowered.** Conversations now compact at ~250K tokens on 1M-window models (Opus 4.7 / 4.6, Sonnet 4.6, GPT-5.4 / 5.5), ~170K on Haiku 4.5, ~250K on the GPT-5.3-Codex 400K window. Previously sat at 800K (non-codex) / 300K (codex), which left meaningful cost on the table re-sending huge prefixes on every tool round-trip.
23+
- **Default reasoning effort is now `medium`** (was `high`). Verified roughly 3–5× cheaper hidden-reasoning bill on routine coding turns. Use `-e high` or `/think high` for hard tasks.
24+
- **Reasoning summaries are suppressed on the OpenAI thinking-off path.** When `effort: off`, sofos sends `reasoning.effort = "minimal"` with no `summary` field, so the model returns no summary blocks at all (they bill as output tokens).
25+
- **Model context windows corrected.** Claude Opus 4.7 / 4.6 and Sonnet 4.6 are 1,000,000 tokens (were 200K in the table); GPT-5.4 and GPT-5.5 are 1,050,000 tokens (were 400K). The drop-trim safety floor is now per-model API-aware (95% of the real window) instead of a flat 250K.
26+
- **Anthropic beta header now opts into both `token-efficient-tools-2025-02-19` and `compact-2026-01-12`.**
27+
28+
### Fixed
29+
30+
- **OpenAI reasoning items round-trip in the right order relative to their assistant message.** Reasoning items were being emitted in the input array *after* the message they preceded, breaking encrypted_content round-trip continuity on the server side. Now correctly placed before.
31+
- **Tool-cache breakpoint actually lands on Anthropic when OpenAI's web-search tool is registered.** The stamper used to no-op when `OpenAIWebSearch` was the last entry in the tool list, leaving Anthropic with no tool-defs cache breakpoint at all. Now finds the last *Anthropic-compatible* tool to stamp.
32+
- **OpenAI `Reasoning` blocks no longer leak to Anthropic on provider switch.** A session that started on OpenAI accumulates `Reasoning` content blocks; switching to Anthropic mid-session would have sent those blocks to the Messages API, which doesn't recognise the type. The Anthropic sanitiser now drops them.
33+
- **`peak_single_turn_input_tokens` is updated for every iteration of multi-tool turns**, not just the first. Long tool chains crossing the GPT-5.5 272K cliff inside the loop now correctly switch the cost line to premium rates.
34+
- **Stale duplicate cache breakpoint on `read_file_tool` removed.** The tool definition carried an inline `cache_control` that, combined with the request-builder's last-tool stamp, could push the request to a 5th breakpoint (Anthropic limits to 4).
35+
736
## [0.2.7] - 2026-05-04
837

938
### Fixed

README.md

Lines changed: 22 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -112,8 +112,8 @@ sofos
112112

113113
- `/resume` - Resume previous session
114114
- `/clear` - Clear conversation history
115-
- `/think [on|off]` - Toggle extended thinking (shows status if no arg)
116-
- `/compact` - Summarize older messages via the LLM to reclaim context tokens (auto-triggers at 80% usage)
115+
- `/think [off|low|medium|high]` - Set reasoning effort (shows status if no arg)
116+
- `/compact` - Summarize older messages via the LLM to reclaim context tokens. Triggers automatically at the per-model auto-compact threshold (~250K tokens on 1M-window models, ~170K on Haiku, ~250K on Codex). On Claude Opus 4.7 / 4.6 / Sonnet 4.6 the API itself runs the summarization server-side via the `compact-2026-01-12` beta — no extra round-trip.
117117
- `/s` - Safe mode (read-only, prompt: **`:`**)
118118
- `/n` - Normal mode (all tools, prompt: **`>`**)
119119
- `/exit`, `/quit`, `/q`, `Ctrl+D` - Exit with cost summary
@@ -125,7 +125,7 @@ sofos
125125

126126
**Scrollback:** Sofos runs as an inline viewport at the bottom of your terminal — the rest of the terminal is normal scrollback, so use your terminal emulator's own scrollbar, mouse wheel, and text selection / copy-paste.
127127

128-
**Status line:** Shown below the input box. Updates live as you change state (`/s`, `/n`, `/think`) — model, mode (`normal`/`safe`), reasoning config (`thinking: <N> tok` / `effort: high`), and running token totals.
128+
**Status line:** Shown below the input box. Updates live as you change state (`/s`, `/n`, `/think`) — model, mode (`normal`/`safe`), reasoning config (`effort: off|low|medium|high` for OpenAI and Claude Opus 4.7+; `thinking: <N> tok` for older Claude models with manual budgets), and running token totals.
129129

130130
### Image Vision
131131

@@ -151,7 +151,9 @@ Analyze https://example.com/chart.png
151151
152152
### Cost Tracking
153153
154-
Exit summary shows token usage and estimated cost based on official API pricing. When the provider prompt cache served any tokens during the session, a `cache read: N (M% hit)` row appears under the input total, and the estimated cost reflects the cache discount (10% of base input on both providers, plus 125% for Anthropic 5-min cache writes).
154+
Exit summary shows token usage and estimated cost based on official API pricing. When the provider prompt cache served any tokens during the session, a `cache read: N (M% hit)` row appears under the input total, and the estimated cost reflects the cache discount (10% of base input on both providers, plus 125% for Anthropic 5-min writes and 200% for 1-hour writes).
155+
156+
**Tiered pricing detection.** GPT-5.4 and GPT-5.5 charge a session-wide premium (2× input, 1.5× output) once any single prompt crosses 272K input tokens. Sofos tracks the largest single-turn input observed and switches the cost calculator to premium rates if the cliff is ever crossed, so the displayed cost reflects what OpenAI actually bills.
155157
156158
### CLI Options
157159
@@ -166,25 +168,31 @@ Exit summary shows token usage and estimated cost based on official API pricing.
166168
--model <MODEL> Model to use (default: claude-sonnet-4-6)
167169
--morph-model <MODEL> Morph model (default: morph-v3-fast)
168170
--max-tokens <N> Max response tokens (default: 32768)
169-
-t, --enable-thinking Enable extended thinking (default: false)
170-
--thinking-budget <N> Token budget for thinking (Claude only, default: 5120, must be < max-tokens)
171+
-e, --reasoning-effort <LV> Reasoning effort: off, low, medium, high (default: medium)
172+
--thinking-budget <N> Token budget for older Claude models with manual budgets (default: 5120, must be < max-tokens). Ignored on Claude Opus 4.7+ and on OpenAI.
171173
-v, --verbose Verbose logging
172174
```
173175
174-
### Extended Thinking
176+
### Reasoning Effort
175177
176-
Enable for complex reasoning tasks (disabled by default):
178+
Sofos exposes four levels — `off`, `low`, `medium`, `high` — applied uniformly across providers. Default is `medium`; `high` is opt-in because it materially raises hidden-reasoning token cost on routine coding work.
177179
178180
```bash
179-
sofos -t # Default 5120 token budget (Claude 4.5 / 4.6)
180-
sofos -t --thinking-budget 10000 --max-tokens 16000 # Custom budget (Claude 4.5 / 4.6)
181+
sofos -e medium # Default — sensible cost/quality balance
182+
sofos -e high # Hard tasks, willing to pay more
183+
sofos -e off # Cheapest path; no reasoning summary
184+
185+
# Mid-session
186+
/think high # Bump up
187+
/think off # Drop to minimal
188+
/think # Show current
181189
```
182190
183-
**Note:** Extended thinking works with both Claude and OpenAI models.
191+
**Per-provider mapping:**
184192
185-
- **Claude 4.5 / 4.6** uses a manual token budget controlled by `--thinking-budget` (default `5120`).
186-
- **Claude Opus 4.7** uses adaptive thinkingthe server picks the budget based on the prompt, and sofos sends `effort: high` when thinking is on and `effort: low` when off. `--thinking-budget` is ignored for this model; the status line shows `effort: high|low` instead of a token count.
187-
- **OpenAI (gpt-5 models)** — `/think on` sets high reasoning effort and `/think off` sets low. `--thinking-budget` is ignored.
193+
- **OpenAI (gpt-5 family)** — sends `reasoning.effort` matching the level (`minimal` for `off`, `low`/`medium`/`high` otherwise) and `summary: "auto"` when on, omitted when off.
194+
- **Claude Opus 4.7** adaptive thinking; the server picks the budget based on the prompt, and sofos sends `output_config.effort` matching the level (`off` collapses to `low`, the lowest the API accepts). `--thinking-budget` is ignored.
195+
- **Older Claude (Sonnet 4.6, Opus 4.6, Haiku 4.5)** — `off` disables extended thinking; `low/medium/high` all enable it with the `--thinking-budget` token budget (default `5120`). The level is treated uniformly here pending per-tier budget mapping.
188196
189197
## Custom Instructions
190198

src/api/anthropic.rs

Lines changed: 93 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -8,27 +8,45 @@ use std::sync::atomic::{AtomicBool, Ordering};
88

99
const API_BASE: &str = "https://api.anthropic.com/v1";
1010
const API_VERSION: &str = "2023-06-01";
11-
const ANTHROPIC_BETA: &str = "token-efficient-tools-2025-02-19";
11+
/// Comma-separated list of Anthropic beta features sofos opts in to.
12+
/// `token-efficient-tools-2025-02-19` shrinks tool-call envelopes;
13+
/// `compact-2026-01-12` enables server-side compaction (the API
14+
/// generates the summary itself when the request crosses a configured
15+
/// trigger, then drops earlier messages on subsequent turns).
16+
///
17+
/// TODO: `compact-2026-01-12` only applies to Opus 4.7, Opus 4.6,
18+
/// and Sonnet 4.6. Sending it on a Haiku 4.5 request relies on
19+
/// Anthropic's "ignore unknown beta tokens" policy. If Anthropic
20+
/// ever tightens validation, gate the header per-request based on
21+
/// `ModelInfo::supports_server_compaction` instead of pinning the
22+
/// value at client construction.
23+
const ANTHROPIC_BETA: &str = "token-efficient-tools-2025-02-19,compact-2026-01-12";
1224

1325
/// Return true for models that *only* accept `thinking.type = "adaptive"`
1426
/// (paired with `output_config.effort`) and reject the legacy
1527
/// `{type: "enabled", budget_tokens: N}` shape with HTTP 400.
1628
///
17-
/// Currently Opus 4.7 is the sole member of this set; Sonnet/Opus 4.6 and
18-
/// older continue to accept manual budgets, so we keep them on the old path
19-
/// to preserve the user's `--thinking-budget` knob.
29+
/// The set is owned by [`crate::api::ModelInfo`]; this thin wrapper
30+
/// preserves the call shape used by `request_builder` and `repl::mod`
31+
/// without forcing those sites to dereference the struct just to
32+
/// check one bool.
2033
pub fn requires_adaptive_thinking(model: &str) -> bool {
21-
model.starts_with("claude-opus-4-7")
34+
super::model_info::lookup(model).requires_adaptive_thinking
2235
}
2336

24-
/// The string form of an "effort" level derived from the user's
25-
/// thinking-on/off toggle. Used both for Anthropic's `output_config.effort`
26-
/// (adaptive models) and OpenAI's `reasoning.effort` — the two APIs
27-
/// happen to share the same `high` / `low` vocabulary, so one helper
28-
/// keeps the request builder, TUI status line, startup banner, and
29-
/// `/think` messages in sync without each site hand-mapping the bool.
30-
pub fn effort_label(enable_thinking: bool) -> &'static str {
31-
if enable_thinking { "high" } else { "low" }
37+
/// Map a [`ReasoningEffort`] to the string Anthropic's adaptive thinking
38+
/// expects in `output_config.effort` (Opus 4.7+). The API accepts
39+
/// `low` / `medium` / `high`; `Off` collapses to `low` because adaptive
40+
/// thinking has no off-switch — the conversation may already carry
41+
/// thinking blocks that the server cross-checks against the request,
42+
/// and dropping `output_config` would 400 the next turn.
43+
pub fn effort_label(effort: super::types::ReasoningEffort) -> &'static str {
44+
use super::types::ReasoningEffort;
45+
match effort {
46+
ReasoningEffort::Off | ReasoningEffort::Low => "low",
47+
ReasoningEffort::Medium => "medium",
48+
ReasoningEffort::High => "high",
49+
}
3250
}
3351

3452
#[derive(Clone)]
@@ -225,6 +243,18 @@ impl AnthropicClient {
225243
}
226244
current_block_type = None;
227245
}
246+
// TODO: handle `compaction` here.
247+
// Server-side compaction
248+
// (`compact-2026-01-12` beta) emits a
249+
// `compaction` content block with a
250+
// `content` field; the streaming path
251+
// currently drops it, so when
252+
// `use_streaming` is flipped on for
253+
// Anthropic, the next request fails to
254+
// round-trip the summary and Anthropic
255+
// re-compacts (extra cost). The
256+
// non-streaming path handles this via
257+
// serde and works today.
228258
_ => {}
229259
}
230260
}
@@ -369,7 +399,17 @@ fn sanitize_messages_for_anthropic(messages: Vec<Message>) -> Vec<Message> {
369399
let filtered_content = content
370400
.into_iter()
371401
.filter_map(|block| match block {
402+
// OpenAI reasoning summary block — not part of
403+
// Anthropic's content-block schema; the server
404+
// would reject the unknown type.
372405
MessageContentBlock::Summary { .. } => None,
406+
// OpenAI Responses API reasoning item, packed
407+
// with `id` + `encrypted_content`. Carries no
408+
// meaning to Anthropic and uses a `type`
409+
// string the server doesn't recognise. Drop
410+
// before sending so a session that switched
411+
// providers doesn't 400 on the next turn.
412+
MessageContentBlock::Reasoning { .. } => None,
373413
other => Some(other),
374414
})
375415
.collect();
@@ -424,9 +464,12 @@ mod tests {
424464
}
425465

426466
#[test]
427-
fn effort_label_maps_bool_to_high_low() {
428-
assert_eq!(effort_label(true), "high");
429-
assert_eq!(effort_label(false), "low");
467+
fn effort_label_maps_reasoning_levels() {
468+
use super::super::types::ReasoningEffort;
469+
assert_eq!(effort_label(ReasoningEffort::Off), "low");
470+
assert_eq!(effort_label(ReasoningEffort::Low), "low");
471+
assert_eq!(effort_label(ReasoningEffort::Medium), "medium");
472+
assert_eq!(effort_label(ReasoningEffort::High), "high");
430473
}
431474

432475
#[test]
@@ -442,6 +485,7 @@ mod tests {
442485
output_config: Some(OutputConfig::with_effort("high")),
443486
reasoning: None,
444487
prompt_cache_key: None,
488+
context_management: None,
445489
};
446490

447491
let json = serde_json::to_value(&request).unwrap();
@@ -464,6 +508,7 @@ mod tests {
464508
output_config: None,
465509
reasoning: None,
466510
prompt_cache_key: None,
511+
context_management: None,
467512
};
468513

469514
let json = serde_json::to_value(&request).unwrap();
@@ -485,9 +530,41 @@ mod tests {
485530
output_config: None,
486531
reasoning: None,
487532
prompt_cache_key: Some("session-1".to_string()),
533+
context_management: None,
488534
};
489535

490536
let prepared = AnthropicClient::prepare_request(request);
491537
assert!(prepared.prompt_cache_key.is_none());
492538
}
539+
540+
#[test]
541+
fn sanitizer_drops_openai_reasoning_blocks_before_anthropic_call() {
542+
// Regression: a session that started on OpenAI accumulates
543+
// `Reasoning` blocks with `id` + `encrypted_content`. Switching
544+
// to Anthropic mid-session and forwarding those blocks would
545+
// 400 on a content-block-type the server doesn't know.
546+
let messages = vec![Message {
547+
role: "assistant".to_string(),
548+
content: MessageContent::Blocks {
549+
content: vec![
550+
MessageContentBlock::Reasoning {
551+
id: "rs_abc".to_string(),
552+
summary: vec!["thought".to_string()],
553+
encrypted_content: Some("blob".to_string()),
554+
cache_control: None,
555+
},
556+
MessageContentBlock::Text {
557+
text: "real reply".to_string(),
558+
cache_control: None,
559+
},
560+
],
561+
},
562+
}];
563+
let cleaned = sanitize_messages_for_anthropic(messages);
564+
let MessageContent::Blocks { content } = &cleaned[0].content else {
565+
panic!("expected blocks");
566+
};
567+
assert_eq!(content.len(), 1, "Reasoning block must be dropped");
568+
assert!(matches!(content[0], MessageContentBlock::Text { .. }));
569+
}
493570
}

src/api/mod.rs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,13 @@
11
pub mod anthropic;
2+
pub mod model_info;
23
pub mod morph;
34
pub mod openai;
5+
pub mod truncate;
46
pub mod types;
57
pub mod utils;
68

9+
pub use model_info::ModelInfo;
10+
711
pub use anthropic::AnthropicClient;
812
pub use morph::MorphClient;
913
pub use openai::OpenAIClient;

0 commit comments

Comments
 (0)