Skip to content

Commit 3912652

Browse files
anandgupta42claude
andauthored
feat: context management improvements — overflow recovery, observation masking, token estimation (#35)
* feat: context management improvements — overflow recovery, observation masking, token estimation Fix critical bugs in context overflow detection and add production-hardening for long-running agent sessions: - Fix NamedError.isInstance(null) crash that would kill the agent if a provider returned a null error object - Fix isOverflow() headroom gap when limit.input is set — compaction now correctly reserves space for output tokens on models with separate input/output limits (fixes upstream bugs #10634, #8089, #11086) - Add compaction loop protection (max 3 attempts) to prevent infinite compact→overflow→compact cycles - Replace generic "[Old tool result content cleared]" with observation masks that preserve tool name, args, output size, and first-line fingerprint for better model continuity after pruning - Content-aware token estimation (code: 3.0, JSON: 3.2, SQL: 3.5, text: 4.0) replacing flat chars/4 heuristic - Add Azure OpenAI overflow detection patterns - DE-aware compaction template preserving warehouse, schema, dbt, lineage, and FinOps context during summarization - Guard against empty observation masks sending empty tool_result - Add context management documentation page 153 tests across 4 test files, 1332 total suite passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: reset compactionAttempts counter after successful processing step The counter was only incremented but never reset, causing it to accumulate across unrelated user turns within a session. After 3 successful compactions spread across many turns, the 4th would incorrectly trigger the "max attempts" error. Now resets to 0 after each successful non-compaction step. Fixes Sentry review comment on PR #35. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add compaction loop protection tests 32 tests covering the compactionAttempts counter state machine: - Basic counter increment/reset behavior - Sentry fix validation: counter resets between successful turns - Overflow detection and compact result paths share counter - MAX_COMPACTION_ATTEMPTS (3) loop protection - Realistic multi-turn session scenarios - isOverflow boundary conditions and config edge cases Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: remove double-deduction in isOverflow for non-limit.input models Sentry correctly flagged that the non-limit.input path was subtracting both maxOutput AND reserved (20K buffer), causing compaction to trigger 20K tokens too early for most production models. Simplified both paths to use a single headroom = Math.max(reserved, maxOutput). For default configs (maxOutput=32K > buffer=20K), this matches the original upstream behavior while preserving the P0 fix for limit.input models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address multi-model code review findings on PR #35 - Fix text-end overwriting time.start (processor.ts) — use spread to preserve original start timestamp, matching reasoning-end handler - Guard negative usable in isOverflow (compaction.ts) — return false when headroom exceeds base instead of producing negative usable that would trigger compaction on every turn - Remove dead "manual" trigger type from telemetry union - Add tests for negative usable edge cases Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: update context management docs for review findings - Document loop protection (max 3 attempts, reset between turns) - Fix reserved config example (was 4096, should be 20000) - Clarify reserved field uses max(reserved, max_output) as headroom - Add CJK/emoji token estimation limitation note Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 9c019bc commit 3912652

File tree

18 files changed

+2164
-18
lines changed

18 files changed

+2164
-18
lines changed

docs/docs/configure/config.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ Configuration is loaded from multiple sources, with later sources overriding ear
5757
| `skills` | `object` | Skill paths and URLs |
5858
| `plugin` | `string[]` | Plugin specifiers |
5959
| `instructions` | `string[]` | Glob patterns for instruction files |
60-
| `compaction` | `object` | Context compaction settings |
60+
| `compaction` | `object` | Context compaction settings (see [Context Management](context-management.md)) |
6161
| `experimental` | `object` | Experimental feature flags |
6262

6363
## Value Substitution
@@ -132,4 +132,4 @@ Control how context is managed when conversations grow long:
132132
| `reserved` || Token buffer to reserve |
133133

134134
!!! info
135-
Compaction automatically summarizes older messages to free up context window space, allowing longer conversations without losing important context.
135+
Compaction automatically summarizes older messages to free up context window space, allowing longer conversations without losing important context. See [Context Management](context-management.md) for full details.
Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# Context Management
2+
3+
altimate-code automatically manages conversation context so you can work through long sessions without hitting model limits. When a conversation grows large, the CLI summarizes older messages, prunes stale tool outputs, and recovers from provider overflow errors — all without losing the important details of your work.
4+
5+
## How It Works
6+
7+
Every LLM has a finite context window. As you work, each message, tool call, and tool result adds tokens to the conversation. When the conversation approaches the model's limit, altimate-code takes action:
8+
9+
1. **Prune** — Old tool outputs (file reads, command results, query results) are replaced with compact summaries
10+
2. **Compact** — The entire conversation history is summarized into a continuation prompt
11+
3. **Continue** — The agent picks up where it left off using the summary
12+
13+
This happens automatically by default. You do not need to manually manage context.
14+
15+
## Auto-Compaction
16+
17+
When enabled (the default), altimate-code monitors token usage after each model response. If the conversation is approaching the context limit, it triggers compaction automatically.
18+
19+
During compaction:
20+
21+
- A dedicated compaction agent summarizes the full conversation
22+
- The summary captures goals, progress, discoveries, relevant files, and next steps
23+
- The original messages are retained in session history but the model continues from the summary
24+
- After compaction, the agent automatically continues working if there are clear next steps
25+
26+
You will see a compaction indicator in the TUI when this happens. The conversation continues seamlessly.
27+
28+
!!! tip
29+
If you notice compaction happening frequently, consider using a model with a larger context window or breaking your task into smaller sessions.
30+
31+
## Observation Masking (Pruning)
32+
33+
Before compaction, altimate-code prunes old tool outputs to reclaim context space. This is called "observation masking."
34+
35+
When a tool output is pruned, it is replaced with a brief fingerprint:
36+
37+
```
38+
[Tool output cleared — read_file(file: src/main.ts) returned 42 lines, 1.2 KB — "import { App } from './app'"]
39+
```
40+
41+
This tells the model what tool was called, what arguments were used, how much output it produced, and the first line of the result — enough to maintain continuity without consuming tokens.
42+
43+
**Pruning rules:**
44+
45+
- Only tool outputs older than the most recent 2 turns are eligible
46+
- The most recent ~40,000 tokens of tool outputs are always preserved
47+
- Pruning only fires when at least 20,000 tokens can be reclaimed
48+
- `skill` tool outputs are never pruned (they contain critical session context)
49+
50+
## Data Engineering Context
51+
52+
Compaction is aware of data engineering workflows. When summarizing a conversation, the compaction prompt preserves:
53+
54+
- **Warehouse connections** — which databases or warehouses are connected
55+
- **Schema context** — discovered tables, columns, and relationships
56+
- **dbt project state** — models, sources, tests, and project structure
57+
- **Lineage findings** — upstream and downstream dependencies
58+
- **Query patterns** — SQL dialects, anti-patterns, and optimization opportunities
59+
- **FinOps context** — cost findings and warehouse sizing recommendations
60+
61+
This means you can run a long data exploration session and compaction will not lose track of what schemas you discovered, what dbt models you were working with, or what cost optimizations you identified.
62+
63+
## Provider Overflow Detection
64+
65+
If compaction does not trigger in time and the model returns a context overflow error, altimate-code detects it and automatically compacts the conversation.
66+
67+
Overflow detection works with all major providers:
68+
69+
| Provider | Detection |
70+
|----------|-----------|
71+
| Anthropic | "prompt is too long" |
72+
| OpenAI | "exceeds the context window" |
73+
| AWS Bedrock | "input is too long for requested model" |
74+
| Google Gemini | "input token count exceeds the maximum" |
75+
| Azure OpenAI | "the request was too long" |
76+
| Groq | "reduce the length of the messages" |
77+
| OpenRouter / DeepSeek | "maximum context length is N tokens" |
78+
| xAI (Grok) | "maximum prompt length is N" |
79+
| GitHub Copilot | "exceeds the limit of N" |
80+
| Ollama / llama.cpp / LM Studio | Various local server messages |
81+
82+
When an overflow is detected, the CLI automatically compacts and retries. No action is needed on your part.
83+
84+
### Loop Protection
85+
86+
If compaction fails to reduce context sufficiently and overflow keeps recurring, altimate-code stops after 3 consecutive compaction attempts within the same turn. You will see a message asking you to start a new conversation. The counter resets after each successful processing step, so compactions spread across different turns do not count against the limit.
87+
88+
!!! note
89+
Some providers (such as z.ai) may accept oversized inputs silently. For these, the automatic token-based compaction trigger is the primary safeguard.
90+
91+
## Configuration
92+
93+
Control context management behavior in `altimate-code.json`:
94+
95+
```json
96+
{
97+
"compaction": {
98+
"auto": true,
99+
"prune": true,
100+
"reserved": 20000
101+
}
102+
}
103+
```
104+
105+
| Field | Type | Default | Description |
106+
|-------|------|---------|-------------|
107+
| `auto` | `boolean` | `true` | Automatically compact when the context window is nearly full |
108+
| `prune` | `boolean` | `true` | Prune old tool outputs before compaction |
109+
| `reserved` | `number` | `20000` | Token buffer to reserve below the context limit. The actual headroom is `max(reserved, model_max_output)`, so this value only takes effect when it exceeds the model's output token limit. Increase if you see frequent overflow errors |
110+
111+
### Disabling Auto-Compaction
112+
113+
If you prefer to manage context manually (for example, by starting new sessions), disable auto-compaction:
114+
115+
```json
116+
{
117+
"compaction": {
118+
"auto": false
119+
}
120+
}
121+
```
122+
123+
!!! warning
124+
With auto-compaction disabled, you may hit context overflow errors during long sessions. The CLI will still detect and recover from these, but the experience will be less smooth.
125+
126+
### Manual Compaction
127+
128+
You can trigger compaction at any time from the TUI by pressing `leader` + `c`, or by using the `/compact` command in conversation. This is useful when you want to create a checkpoint before switching tasks.
129+
130+
## Token Estimation
131+
132+
altimate-code uses content-aware heuristics to estimate token counts without calling a tokenizer. This keeps overhead low while maintaining accuracy.
133+
134+
The estimator detects content type and adjusts its ratio:
135+
136+
| Content Type | Characters per Token | Detection |
137+
|--------------|---------------------|-----------|
138+
| Code | ~3.0 | High density of `{}();=` characters |
139+
| JSON | ~3.2 | Starts with `{` or `[`, high density of `{}[]:,"` |
140+
| SQL | ~3.5 | Contains SQL keywords (`SELECT`, `FROM`, `JOIN`, etc.) |
141+
| Plain text | ~4.0 | Default for prose and markdown |
142+
| Mixed | ~3.7 | Fallback for content that does not match a specific type |
143+
144+
These ratios are tuned against the cl100k_base tokenizer used by Claude and GPT-4 models. The estimator samples the first 500 characters of content to classify it, so the overhead is negligible.
145+
146+
!!! note "Limitations"
147+
The heuristic uses JavaScript string length (UTF-16 code units), which over-estimates tokens for emoji (2 code units but ~1-2 tokens) and CJK characters. For precise token counting, a future update will integrate a native tokenizer.

docs/mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,7 @@ nav:
9191
- Behavior:
9292
- Rules: configure/rules.md
9393
- Permissions: configure/permissions.md
94+
- Context Management: configure/context-management.md
9495
- Formatters: configure/formatters.md
9596
- Appearance:
9697
- Themes: configure/themes.md

packages/altimate-code/src/agent/prompt/compaction.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,13 @@ Focus on information that would be helpful for continuing the conversation, incl
99
- Key user requests, constraints, or preferences that should persist
1010
- Important technical decisions and why they were made
1111

12+
For data engineering conversations, also preserve:
13+
- Warehouse connections and discovered schemas/tables
14+
- dbt project context (models, sources, tests)
15+
- Lineage findings and query patterns
16+
- SQL dialects and translation contexts
17+
- FinOps findings (costs, warehouse sizing)
18+
1219
Your summary should be comprehensive enough to provide context but concise enough to be quickly understood.
1320

1421
Do not respond to any questions in the conversation, only output the summary.

packages/altimate-code/src/provider/error.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@ export namespace ProviderError {
1818
/greater than the context length/i, // LM Studio
1919
/context window exceeds limit/i, // MiniMax
2020
/exceeded model token limit/i, // Kimi For Coding, Moonshot
21+
/the request was too long/i, // Azure OpenAI
22+
/maximum tokens for requested operation/i, // Azure OpenAI
2123
/context[_ ]length[_ ]exceeded/i, // Generic fallback
2224
]
2325

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# Paid Context Management Features
2+
3+
These features are planned for implementation in altimate-core (Rust) and gated behind license key verification.
4+
5+
## 1. Precise Token Counting
6+
7+
**Bridge method:** `context.count_tokens(text, model_family) -> number`
8+
9+
Uses tiktoken-rs in altimate-core for exact model-specific token counts. Replaces the heuristic estimation in `token.ts`. Supports cl100k_base (GPT-4/Claude), o200k_base (GPT-4o), and future tokenizers.
10+
11+
**Benefits:**
12+
- Eliminates 20-30% estimation error
13+
- Precise compaction triggering — no late/early compaction
14+
- Accurate token budget allocation
15+
16+
## 2. Smart Context Scoring
17+
18+
**Bridge method:** `context.score_relevance(items[], query) -> scored_items[]`
19+
20+
Embedding-based relevance scoring for context items. Used before compaction to drop lowest-scoring items first, preserving the most relevant conversation history. Uses a local embeddings model (no external API calls required).
21+
22+
**Benefits:**
23+
- Drops irrelevant context before compaction
24+
- Preserves high-value conversation segments
25+
- Reduces unnecessary compaction cycles
26+
27+
## 3. Schema Compression
28+
29+
**Bridge method:** `context.compress_schema(schema_ddl, token_budget) -> compressed_schema`
30+
31+
Schemonic-style ILP (Integer Linear Programming) optimization. Extends the existing `altimate_core_optimize_context` tool. Achieves ~2x token reduction on schema DDL without accuracy loss by intelligently abbreviating column names, removing redundant constraints, and merging similar table definitions.
32+
33+
**Benefits:**
34+
- Fits 2x more schema context in the same token budget
35+
- No accuracy loss on downstream SQL generation
36+
- Works with all warehouse dialects
37+
38+
## 4. Lineage-Aware Context Selection
39+
40+
**Bridge method:** `context.select_by_lineage(model_name, manifest, hops) -> relevant_tables[]`
41+
42+
Uses dbt DAG / lineage graph to scope relevant tables. PageRank-style relevance scoring weights tables by proximity and importance in the dependency graph. Configurable hop distance for breadth of context.
43+
44+
**Benefits:**
45+
- Only includes tables relevant to the current model/query
46+
- Reduces schema context by 60-80% for large warehouses
47+
- Leverages existing dbt manifest parsing
48+
49+
## 5. Semantic Schema Catalog
50+
51+
**Bridge method:** `context.generate_catalog(schema, sample_data) -> yaml_catalog`
52+
53+
YAML-based semantic views (similar to Snowflake Cortex Analyst). Auto-generates business descriptions, data types, and relationships from schema + sample data. Serves as a compressed, human-readable schema representation.
54+
55+
**Benefits:**
56+
- Business-friendly context for the LLM
57+
- More token-efficient than raw DDL
58+
- Auto-generates from existing schema metadata
59+
60+
## 6. Context Budget Allocator
61+
62+
**Bridge method:** `context.allocate_budget(model_limit, task_type) -> { system, schema, conversation, output }`
63+
64+
Explicit token allocation across categories. Dynamic adjustment based on task type (query writing vs. debugging vs. optimization). Prevents any single category from consuming the entire context window.
65+
66+
**Benefits:**
67+
- Prevents schema from crowding out conversation history
68+
- Task-appropriate allocation (more schema for query writing, more conversation for debugging)
69+
- Works with the compaction system to respect budgets

packages/altimate-code/src/session/compaction.ts

Lines changed: 70 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,53 @@ import { Agent } from "@/agent/agent"
1414
import { Plugin } from "@/plugin"
1515
import { Config } from "@/config/config"
1616
import { ProviderTransform } from "@/provider/transform"
17+
import { Telemetry } from "@/telemetry"
1718

1819
export namespace SessionCompaction {
1920
const log = Log.create({ service: "session.compaction" })
2021

22+
function formatBytes(bytes: number): string {
23+
if (bytes < 1024) return `${bytes} B`
24+
if (bytes < 1024 * 1024) return `${(bytes / 1024).toFixed(1)} KB`
25+
return `${(bytes / (1024 * 1024)).toFixed(1)} MB`
26+
}
27+
28+
function truncateArgs(input: Record<string, any> | null | undefined, maxLen: number): string {
29+
if (!input || typeof input !== "object") return ""
30+
let str: string
31+
try {
32+
str = Object.entries(input)
33+
.map(([k, v]) => `${k}: ${JSON.stringify(v)}`)
34+
.join(", ")
35+
} catch {
36+
return "[unserializable]"
37+
}
38+
if (str.length <= maxLen) return str
39+
// Avoid slicing mid-surrogate pair by finding a safe boundary
40+
let end = maxLen
41+
const code = str.charCodeAt(end - 1)
42+
if (code >= 0xd800 && code <= 0xdbff) end--
43+
return str.slice(0, end) + "…"
44+
}
45+
46+
export function createObservationMask(part: MessageV2.ToolPart): string {
47+
const output =
48+
(part.state.status === "completed" ? part.state.output : "") || ""
49+
const lines = output.split("\n").length
50+
const bytes = Buffer.byteLength(output, "utf8")
51+
const args = truncateArgs(
52+
part.state.status === "completed" ||
53+
part.state.status === "running" ||
54+
part.state.status === "error"
55+
? part.state.input
56+
: {},
57+
80,
58+
)
59+
const firstLine = output.split("\n")[0]?.slice(0, 80) || ""
60+
const fingerprint = firstLine ? ` — "${firstLine}"` : ""
61+
return `[Tool output cleared — ${part.tool}(${args}) returned ${lines} lines, ${formatBytes(bytes)}${fingerprint}]`
62+
}
63+
2164
export const Event = {
2265
Compacted: BusEvent.define(
2366
"session.compacted",
@@ -39,12 +82,12 @@ export namespace SessionCompaction {
3982
input.tokens.total ||
4083
input.tokens.input + input.tokens.output + input.tokens.cache.read + input.tokens.cache.write
4184

42-
const reserved =
43-
config.compaction?.reserved ?? Math.min(COMPACTION_BUFFER, ProviderTransform.maxOutputTokens(input.model))
44-
const usable = input.model.limit.input
45-
? input.model.limit.input - reserved
46-
: context - ProviderTransform.maxOutputTokens(input.model)
47-
return count >= usable
85+
const maxOutput = ProviderTransform.maxOutputTokens(input.model)
86+
const reserved = config.compaction?.reserved ?? COMPACTION_BUFFER
87+
const headroom = Math.max(reserved, maxOutput)
88+
const base = input.model.limit.input ?? context
89+
if (base <= headroom) return false
90+
return count >= base - headroom
4891
}
4992

5093
export const PRUNE_MINIMUM = 20_000
@@ -90,11 +133,23 @@ export namespace SessionCompaction {
90133
if (pruned > PRUNE_MINIMUM) {
91134
for (const part of toPrune) {
92135
if (part.state.status === "completed") {
136+
const mask = createObservationMask(part)
93137
part.state.time.compacted = Date.now()
138+
part.state.metadata = {
139+
...part.state.metadata,
140+
observation_mask: mask,
141+
}
94142
await Session.updatePart(part)
95143
}
96144
}
97145
log.info("pruned", { count: toPrune.length })
146+
Telemetry.track({
147+
type: "tool_outputs_pruned",
148+
timestamp: Date.now(),
149+
session_id: input.sessionID,
150+
count: toPrune.length,
151+
tokens_pruned: pruned,
152+
})
98153
}
99154
}
100155

@@ -163,6 +218,15 @@ When constructing the summary, try to stick to this template:
163218
- [What important instructions did the user give you that are relevant]
164219
- [If there is a plan or spec, include information about it so next agent can continue using it]
165220
221+
## Data Context
222+
223+
- [What warehouse(s) or database(s) are we connected to?]
224+
- [What schemas, tables, or columns were discovered or are relevant?]
225+
- [What dbt models, sources, or tests are involved?]
226+
- [Any lineage findings (upstream/downstream dependencies)?]
227+
- [Any query patterns, anti-patterns, or optimization opportunities found?]
228+
- [Skip this section entirely if the task is not data-engineering related]
229+
166230
## Discoveries
167231
168232
[What notable things were learned during this conversation that would be useful for the next agent to know when continuing the work]

packages/altimate-code/src/session/message-v2.ts

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -617,7 +617,10 @@ export namespace MessageV2 {
617617
if (part.type === "tool") {
618618
toolNames.add(part.tool)
619619
if (part.state.status === "completed") {
620-
const outputText = part.state.time.compacted ? "[Old tool result content cleared]" : part.state.output
620+
const mask = part.state.metadata?.observation_mask
621+
const outputText = part.state.time.compacted
622+
? (typeof mask === "string" && mask.length > 0 ? mask : "[Old tool result content cleared]")
623+
: part.state.output
621624
const attachments = part.state.time.compacted ? [] : (part.state.attachments ?? [])
622625

623626
// For providers that don't support media in tool results, extract media files

0 commit comments

Comments
 (0)