You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+29Lines changed: 29 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,35 @@ All notable changes to Sofos are documented in this file.
4
4
5
5
## [Unreleased]
6
6
7
+
### Added
8
+
9
+
-**Anthropic server-side compaction** is now enabled on Claude Opus 4.7, Opus 4.6, and Sonnet 4.6. Sofos sends the `compact-2026-01-12` beta header and a `context_management.edits[type=compact_20260112]` block on every request to those models; when the request crosses the per-model auto-compact threshold (~250K tokens), the API itself summarises older turns and returns a `compaction` content block, dropping the pre-compaction messages server-side on subsequent requests. No extra round-trip — the compaction summary arrives in the same response as the user's reply.
10
+
-**OpenAI encrypted-reasoning round-trip.** Requests that enable reasoning now include `include: ["reasoning.encrypted_content"]`. Sofos captures the opaque encrypted-CoT blob alongside the visible reasoning summary and round-trips both on the next call, so the model resumes its hidden chain-of-thought across tool calls instead of regenerating it. Cuts hidden-reasoning output tokens on multi-call agentic turns.
11
+
-**Per-model `ModelInfo` registry** consolidates context-window, auto-compact threshold, adaptive-thinking flag, server-compaction flag, and pricing (including tiered-pricing rules) into one struct per model. Adding a new model is one struct literal in `src/api/model_info.rs`.
12
+
-**Tiered-pricing detection for GPT-5.4 and GPT-5.5.** Sofos tracks the largest single-turn input observed across the session. If any single prompt crosses the documented 272K threshold, the cost calculator switches to premium rates (2× input, 1.5× output) for the rest of the session — matching what OpenAI actually bills.
13
+
-**1-hour cache TTL on stable prefixes.** System prompt, the last-listed tool definition, and the sticky message anchor now use Anthropic's `ttl: "1h"` ephemeral cache. The rolling breakpoint stays at 5 min because it moves every turn; paying the 2× write premium for a one-turn slot would burn cache writes for nothing.
14
+
-**Middle truncation for tool outputs.** Large bash / search / file-read / diff / MCP outputs preserve both the head and the tail (separated by a `…N tokens truncated…` marker) instead of the head-only cut sofos previously applied. The diagnostic tail (last error line, ripgrep totals, exit messages) now survives truncation.
15
+
-**`compaction` content block type** added to the on-the-wire schema and the saved-session schema, so Anthropic's server-side summaries persist across save / load.
16
+
-**Honest server-side cost line.** The session summary now correctly accounts for the 1-hour cache write premium (200% of base input) on top of the existing 5-minute cache write premium (125%).
17
+
18
+
### Changed
19
+
20
+
-**CLI `-t` / `--enable-thinking` is replaced with `-e` / `--reasoning-effort <off|low|medium|high>`** (default `medium`). The previous binary on/off knob is gone — `medium` is now the default-on state because `high` materially raises hidden-reasoning token cost on routine coding work, and `off` is the absolute-cheapest path. **Breaking change**: scripts using `-t` need updating.
21
+
-**`/think on` / `/think off` are replaced with `/think <off|low|medium|high>`.**`/think` (no argument) still shows status. `on` and `off` no longer parse as commands. **Breaking change**.
22
+
-**Auto-compact threshold lowered.** Conversations now compact at ~250K tokens on 1M-window models (Opus 4.7 / 4.6, Sonnet 4.6, GPT-5.4 / 5.5), ~170K on Haiku 4.5, ~250K on the GPT-5.3-Codex 400K window. Previously sat at 800K (non-codex) / 300K (codex), which left meaningful cost on the table re-sending huge prefixes on every tool round-trip.
23
+
-**Default reasoning effort is now `medium`** (was `high`). Verified roughly 3–5× cheaper hidden-reasoning bill on routine coding turns. Use `-e high` or `/think high` for hard tasks.
24
+
-**Reasoning summaries are suppressed on the OpenAI thinking-off path.** When `effort: off`, sofos sends `reasoning.effort = "minimal"` with no `summary` field, so the model returns no summary blocks at all (they bill as output tokens).
25
+
-**Model context windows corrected.** Claude Opus 4.7 / 4.6 and Sonnet 4.6 are 1,000,000 tokens (were 200K in the table); GPT-5.4 and GPT-5.5 are 1,050,000 tokens (were 400K). The drop-trim safety floor is now per-model API-aware (95% of the real window) instead of a flat 250K.
26
+
-**Anthropic beta header now opts into both `token-efficient-tools-2025-02-19` and `compact-2026-01-12`.**
27
+
28
+
### Fixed
29
+
30
+
-**OpenAI reasoning items round-trip in the right order relative to their assistant message.** Reasoning items were being emitted in the input array *after* the message they preceded, breaking encrypted_content round-trip continuity on the server side. Now correctly placed before.
31
+
-**Tool-cache breakpoint actually lands on Anthropic when OpenAI's web-search tool is registered.** The stamper used to no-op when `OpenAIWebSearch` was the last entry in the tool list, leaving Anthropic with no tool-defs cache breakpoint at all. Now finds the last *Anthropic-compatible* tool to stamp.
32
+
-**OpenAI `Reasoning` blocks no longer leak to Anthropic on provider switch.** A session that started on OpenAI accumulates `Reasoning` content blocks; switching to Anthropic mid-session would have sent those blocks to the Messages API, which doesn't recognise the type. The Anthropic sanitiser now drops them.
33
+
-**`peak_single_turn_input_tokens` is updated for every iteration of multi-tool turns**, not just the first. Long tool chains crossing the GPT-5.5 272K cliff inside the loop now correctly switch the cost line to premium rates.
34
+
-**Stale duplicate cache breakpoint on `read_file_tool` removed.** The tool definition carried an inline `cache_control` that, combined with the request-builder's last-tool stamp, could push the request to a 5th breakpoint (Anthropic limits to 4).
Copy file name to clipboardExpand all lines: README.md
+22-14Lines changed: 22 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -112,8 +112,8 @@ sofos
112
112
113
113
-`/resume` - Resume previous session
114
114
-`/clear` - Clear conversation history
115
-
-`/think [on|off]` - Toggle extended thinking (shows status if no arg)
116
-
-`/compact` - Summarize older messages via the LLM to reclaim context tokens (auto-triggers at 80% usage)
115
+
-`/think [off|low|medium|high]` - Set reasoning effort (shows status if no arg)
116
+
-`/compact` - Summarize older messages via the LLM to reclaim context tokens. Triggers automatically at the per-model auto-compact threshold (~250K tokens on 1M-window models, ~170K on Haiku, ~250K on Codex). On Claude Opus 4.7 / 4.6 / Sonnet 4.6 the API itself runs the summarization server-side via the `compact-2026-01-12` beta — no extra round-trip.
117
117
-`/s` - Safe mode (read-only, prompt: **`:`**)
118
118
-`/n` - Normal mode (all tools, prompt: **`>`**)
119
119
-`/exit`, `/quit`, `/q`, `Ctrl+D` - Exit with cost summary
@@ -125,7 +125,7 @@ sofos
125
125
126
126
**Scrollback:** Sofos runs as an inline viewport at the bottom of your terminal — the rest of the terminal is normal scrollback, so use your terminal emulator's own scrollbar, mouse wheel, and text selection / copy-paste.
127
127
128
-
**Status line:** Shown below the input box. Updates live as you change state (`/s`, `/n`, `/think`) — model, mode (`normal`/`safe`), reasoning config (`thinking: <N> tok`/ `effort: high`), and running token totals.
128
+
**Status line:** Shown below the input box. Updates live as you change state (`/s`, `/n`, `/think`) — model, mode (`normal`/`safe`), reasoning config (`effort: off|low|medium|high` for OpenAI and Claude Opus 4.7+; `thinking: <N> tok`for older Claude models with manual budgets), and running token totals.
Exit summary shows token usage and estimated cost based on official API pricing. When the provider prompt cache served any tokens during the session, a `cache read: N (M% hit)` row appears under the input total, and the estimated cost reflects the cache discount (10% of base input on both providers, plus 125% for Anthropic 5-min cache writes).
154
+
Exit summary shows token usage and estimated cost based on official API pricing. When the provider prompt cache served any tokens during the session, a `cache read: N (M% hit)` row appears under the input total, and the estimated cost reflects the cache discount (10% of base input on both providers, plus 125% for Anthropic 5-min writes and 200% for 1-hour writes).
155
+
156
+
**Tiered pricing detection.** GPT-5.4 and GPT-5.5 charge a session-wide premium (2× input, 1.5× output) once any single prompt crosses 272K input tokens. Sofos tracks the largest single-turn input observed and switches the cost calculator to premium rates if the cliff is ever crossed, so the displayed cost reflects what OpenAI actually bills.
155
157
156
158
### CLI Options
157
159
@@ -166,25 +168,31 @@ Exit summary shows token usage and estimated cost based on official API pricing.
166
168
--model <MODEL> Model to use (default: claude-sonnet-4-6)
167
169
--morph-model <MODEL> Morph model (default: morph-v3-fast)
168
170
--max-tokens <N> Max response tokens (default: 32768)
--thinking-budget <N> Token budget for older Claude models with manual budgets (default: 5120, must be < max-tokens). Ignored on Claude Opus 4.7+ and on OpenAI.
171
173
-v, --verbose Verbose logging
172
174
```
173
175
174
-
### Extended Thinking
176
+
### Reasoning Effort
175
177
176
-
Enable for complex reasoning tasks (disabled by default):
178
+
Sofos exposes four levels — `off`, `low`, `medium`, `high` — applied uniformly across providers. Default is `medium`; `high` is opt-in because it materially raises hidden-reasoning token cost on routine coding work.
sofos -e medium # Default — sensible cost/quality balance
182
+
sofos -e high # Hard tasks, willing to pay more
183
+
sofos -e off # Cheapest path; no reasoning summary
184
+
185
+
# Mid-session
186
+
/think high # Bump up
187
+
/think off # Drop to minimal
188
+
/think # Show current
181
189
```
182
190
183
-
**Note:** Extended thinking works with both Claude and OpenAI models.
191
+
**Per-provider mapping:**
184
192
185
-
- **Claude 4.5 / 4.6** uses a manual token budget controlled by `--thinking-budget` (default `5120`).
186
-
- **Claude Opus 4.7** uses adaptive thinking — the server picks the budget based on the prompt, and sofos sends `effort: high` when thinking is on and `effort: low` when off. `--thinking-budget` is ignored for this model; the status line shows `effort: high|low` instead of a token count.
187
-
- **OpenAI (gpt-5 models)** — `/think on` sets high reasoning effort and `/think off` sets low. `--thinking-budget` is ignored.
193
+
- **OpenAI (gpt-5 family)** — sends `reasoning.effort` matching the level (`minimal` for `off`, `low`/`medium`/`high` otherwise) and `summary: "auto"` when on, omitted when off.
194
+
- **Claude Opus 4.7** — adaptive thinking; the server picks the budget based on the prompt, and sofos sends `output_config.effort` matching the level (`off` collapses to `low`, the lowest the API accepts). `--thinking-budget` is ignored.
195
+
- **Older Claude (Sonnet 4.6, Opus 4.6, Haiku 4.5)** — `off` disables extended thinking; `low/medium/high` all enable it with the `--thinking-budget` token budget (default `5120`). The level is treated uniformly here pending per-tier budget mapping.
0 commit comments