AltimateAI · anandgupta42 · Mar 4, 2026 · Mar 4, 2026 · Mar 4, 2026 · Mar 4, 2026
diff --git a/docs/docs/configure/config.md b/docs/docs/configure/config.md
@@ -57,7 +57,7 @@ Configuration is loaded from multiple sources, with later sources overriding ear
 | `skills` | `object` | Skill paths and URLs |
 | `plugin` | `string[]` | Plugin specifiers |
 | `instructions` | `string[]` | Glob patterns for instruction files |
-| `compaction` | `object` | Context compaction settings |
+| `compaction` | `object` | Context compaction settings (see [Context Management](context-management.md)) |
 | `experimental` | `object` | Experimental feature flags |
 
 ## Value Substitution
@@ -132,4 +132,4 @@ Control how context is managed when conversations grow long:
 | `reserved` | — | Token buffer to reserve |
 
 !!! info
-    Compaction automatically summarizes older messages to free up context window space, allowing longer conversations without losing important context.
+    Compaction automatically summarizes older messages to free up context window space, allowing longer conversations without losing important context. See [Context Management](context-management.md) for full details.
diff --git a/docs/docs/configure/context-management.md b/docs/docs/configure/context-management.md
@@ -0,0 +1,140 @@
+# Context Management
+
+altimate-code automatically manages conversation context so you can work through long sessions without hitting model limits. When a conversation grows large, the CLI summarizes older messages, prunes stale tool outputs, and recovers from provider overflow errors — all without losing the important details of your work.
+
+## How It Works
+
+Every LLM has a finite context window. As you work, each message, tool call, and tool result adds tokens to the conversation. When the conversation approaches the model's limit, altimate-code takes action:
+
+1. **Prune** — Old tool outputs (file reads, command results, query results) are replaced with compact summaries
+2. **Compact** — The entire conversation history is summarized into a continuation prompt
+3. **Continue** — The agent picks up where it left off using the summary
+
+This happens automatically by default. You do not need to manually manage context.
+
+## Auto-Compaction
+
+When enabled (the default), altimate-code monitors token usage after each model response. If the conversation is approaching the context limit, it triggers compaction automatically.
+
+During compaction:
+
+- A dedicated compaction agent summarizes the full conversation
+- The summary captures goals, progress, discoveries, relevant files, and next steps
+- The original messages are retained in session history but the model continues from the summary
+- After compaction, the agent automatically continues working if there are clear next steps
+
+You will see a compaction indicator in the TUI when this happens. The conversation continues seamlessly.
+
+!!! tip
+    If you notice compaction happening frequently, consider using a model with a larger context window or breaking your task into smaller sessions.
+
+## Observation Masking (Pruning)
+
+Before compaction, altimate-code prunes old tool outputs to reclaim context space. This is called "observation masking."
+
+When a tool output is pruned, it is replaced with a brief fingerprint:
+
+```
+[Tool output cleared — read_file(file: src/main.ts) returned 42 lines, 1.2 KB — "import { App } from './app'"]
+```
+
+This tells the model what tool was called, what arguments were used, how much output it produced, and the first line of the result — enough to maintain continuity without consuming tokens.
+
+**Pruning rules:**
+
+- Only tool outputs older than the most recent 2 turns are eligible
+- The most recent ~40,000 tokens of tool outputs are always preserved
+- Pruning only fires when at least 20,000 tokens can be reclaimed
+- `skill` tool outputs are never pruned (they contain critical session context)
+
+## Data Engineering Context
+
+Compaction is aware of data engineering workflows. When summarizing a conversation, the compaction prompt preserves:
+
+- **Warehouse connections** — which databases or warehouses are connected
+- **Schema context** — discovered tables, columns, and relationships
+- **dbt project state** — models, sources, tests, and project structure
+- **Lineage findings** — upstream and downstream dependencies
+- **Query patterns** — SQL dialects, anti-patterns, and optimization opportunities
+- **FinOps context** — cost findings and warehouse sizing recommendations
+
+This means you can run a long data exploration session and compaction will not lose track of what schemas you discovered, what dbt models you were working with, or what cost optimizations you identified.
+
+## Provider Overflow Detection
+
+If compaction does not trigger in time and the model returns a context overflow error, altimate-code detects it and automatically compacts the conversation.
+
+Overflow detection works with all major providers:
+
+| Provider | Detection |
+|----------|-----------|
+| Anthropic | "prompt is too long" |
+| OpenAI | "exceeds the context window" |
+| AWS Bedrock | "input is too long for requested model" |
+| Google Gemini | "input token count exceeds the maximum" |
+| Azure OpenAI | "the request was too long" |
+| Groq | "reduce the length of the messages" |
+| OpenRouter / DeepSeek | "maximum context length is N tokens" |
+| xAI (Grok) | "maximum prompt length is N" |
+| GitHub Copilot | "exceeds the limit of N" |
+| Ollama / llama.cpp / LM Studio | Various local server messages |
+
+When an overflow is detected, the CLI automatically compacts and retries. No action is needed on your part.
+
+!!! note
+    Some providers (such as z.ai) may accept oversized inputs silently. For these, the automatic token-based compaction trigger is the primary safeguard.
+
+## Configuration
+
+Control context management behavior in `altimate-code.json`:
+
+```json
+{
+  "compaction": {
+    "auto": true,
+    "prune": true,
+    "reserved": 4096
+  }
+}
+```
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `auto` | `boolean` | `true` | Automatically compact when the context window is nearly full |
+| `prune` | `boolean` | `true` | Prune old tool outputs before compaction |
+| `reserved` | `number` | `20000` | Token buffer to reserve below the context limit. Increase this if you see frequent overflow errors |
+
+### Disabling Auto-Compaction
+
+If you prefer to manage context manually (for example, by starting new sessions), disable auto-compaction:
+
+```json
+{
+  "compaction": {
+    "auto": false
+  }
+}
+```
+
+!!! warning
+    With auto-compaction disabled, you may hit context overflow errors during long sessions. The CLI will still detect and recover from these, but the experience will be less smooth.
+
+### Manual Compaction
+
+You can trigger compaction at any time from the TUI by pressing `leader` + `c`, or by using the `/compact` command in conversation. This is useful when you want to create a checkpoint before switching tasks.
+
+## Token Estimation
+
+altimate-code uses content-aware heuristics to estimate token counts without calling a tokenizer. This keeps overhead low while maintaining accuracy.
+
+The estimator detects content type and adjusts its ratio:
+
+| Content Type | Characters per Token | Detection |
+|--------------|---------------------|-----------|
+| Code | ~3.0 | High density of `{}();=` characters |
+| JSON | ~3.2 | Starts with `{` or `[`, high density of `{}[]:,"` |
+| SQL | ~3.5 | Contains SQL keywords (`SELECT`, `FROM`, `JOIN`, etc.) |
+| Plain text | ~4.0 | Default for prose and markdown |
+| Mixed | ~3.7 | Fallback for content that does not match a specific type |
+
+These ratios are tuned against the cl100k_base tokenizer used by Claude and GPT-4 models. The estimator samples the first 500 characters of content to classify it, so the overhead is negligible.
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
@@ -91,6 +91,7 @@ nav:
       - Behavior:
           - Rules: configure/rules.md
           - Permissions: configure/permissions.md
+          - Context Management: configure/context-management.md
           - Formatters: configure/formatters.md
       - Appearance:
           - Themes: configure/themes.md

diff --git a/packages/altimate-code/src/agent/prompt/compaction.txt b/packages/altimate-code/src/agent/prompt/compaction.txt
@@ -9,6 +9,13 @@ Focus on information that would be helpful for continuing the conversation, incl
 - Key user requests, constraints, or preferences that should persist
 - Important technical decisions and why they were made
 
+For data engineering conversations, also preserve:
+- Warehouse connections and discovered schemas/tables
+- dbt project context (models, sources, tests)
+- Lineage findings and query patterns
+- SQL dialects and translation contexts
+- FinOps findings (costs, warehouse sizing)
+
 Your summary should be comprehensive enough to provide context but concise enough to be quickly understood.
 
 Do not respond to any questions in the conversation, only output the summary.
diff --git a/packages/altimate-code/src/provider/error.ts b/packages/altimate-code/src/provider/error.ts
@@ -18,6 +18,8 @@ export namespace ProviderError {
     /greater than the context length/i, // LM Studio
     /context window exceeds limit/i, // MiniMax
     /exceeded model token limit/i, // Kimi For Coding, Moonshot
+    /the request was too long/i, // Azure OpenAI
+    /maximum tokens for requested operation/i, // Azure OpenAI
     /context[_ ]length[_ ]exceeded/i, // Generic fallback
   ]
 

diff --git a/packages/altimate-code/src/session/PAID_CONTEXT_FEATURES.md b/packages/altimate-code/src/session/PAID_CONTEXT_FEATURES.md
@@ -0,0 +1,69 @@
+# Paid Context Management Features
+
+These features are planned for implementation in altimate-core (Rust) and gated behind license key verification.
+
+## 1. Precise Token Counting
+
+**Bridge method:** `context.count_tokens(text, model_family) -> number`
+
+Uses tiktoken-rs in altimate-core for exact model-specific token counts. Replaces the heuristic estimation in `token.ts`. Supports cl100k_base (GPT-4/Claude), o200k_base (GPT-4o), and future tokenizers.
+
+**Benefits:**
+- Eliminates 20-30% estimation error
+- Precise compaction triggering — no late/early compaction
+- Accurate token budget allocation
+
+## 2. Smart Context Scoring
+
+**Bridge method:** `context.score_relevance(items[], query) -> scored_items[]`
+
+Embedding-based relevance scoring for context items. Used before compaction to drop lowest-scoring items first, preserving the most relevant conversation history. Uses a local embeddings model (no external API calls required).
+
+**Benefits:**
+- Drops irrelevant context before compaction
+- Preserves high-value conversation segments
+- Reduces unnecessary compaction cycles
+
+## 3. Schema Compression
+
+**Bridge method:** `context.compress_schema(schema_ddl, token_budget) -> compressed_schema`
+
+Schemonic-style ILP (Integer Linear Programming) optimization. Extends the existing `altimate_core_optimize_context` tool. Achieves ~2x token reduction on schema DDL without accuracy loss by intelligently abbreviating column names, removing redundant constraints, and merging similar table definitions.
+
+**Benefits:**
+- Fits 2x more schema context in the same token budget
+- No accuracy loss on downstream SQL generation
+- Works with all warehouse dialects
+
+## 4. Lineage-Aware Context Selection
+
+**Bridge method:** `context.select_by_lineage(model_name, manifest, hops) -> relevant_tables[]`
+
+Uses dbt DAG / lineage graph to scope relevant tables. PageRank-style relevance scoring weights tables by proximity and importance in the dependency graph. Configurable hop distance for breadth of context.
+
+**Benefits:**
+- Only includes tables relevant to the current model/query
+- Reduces schema context by 60-80% for large warehouses
+- Leverages existing dbt manifest parsing
+
+## 5. Semantic Schema Catalog
+
+**Bridge method:** `context.generate_catalog(schema, sample_data) -> yaml_catalog`
+
+YAML-based semantic views (similar to Snowflake Cortex Analyst). Auto-generates business descriptions, data types, and relationships from schema + sample data. Serves as a compressed, human-readable schema representation.
+
+**Benefits:**
+- Business-friendly context for the LLM
+- More token-efficient than raw DDL
+- Auto-generates from existing schema metadata
+
+## 6. Context Budget Allocator
+
+**Bridge method:** `context.allocate_budget(model_limit, task_type) -> { system, schema, conversation, output }`
+
+Explicit token allocation across categories. Dynamic adjustment based on task type (query writing vs. debugging vs. optimization). Prevents any single category from consuming the entire context window.
+
+**Benefits:**
+- Prevents schema from crowding out conversation history
+- Task-appropriate allocation (more schema for query writing, more conversation for debugging)
+- Works with the compaction system to respect budgets
diff --git a/packages/altimate-code/src/session/compaction.ts b/packages/altimate-code/src/session/compaction.ts
@@ -14,10 +14,53 @@ import { Agent } from "@/agent/agent"
 import { Plugin } from "@/plugin"
 import { Config } from "@/config/config"
 import { ProviderTransform } from "@/provider/transform"
+import { Telemetry } from "@/telemetry"
 
 export namespace SessionCompaction {
   const log = Log.create({ service: "session.compaction" })
 
+  function formatBytes(bytes: number): string {
+    if (bytes < 1024) return `${bytes} B`
+    if (bytes < 1024 * 1024) return `${(bytes / 1024).toFixed(1)} KB`
+    return `${(bytes / (1024 * 1024)).toFixed(1)} MB`
+  }
+
+  function truncateArgs(input: Record<string, any> | null | undefined, maxLen: number): string {
+    if (!input || typeof input !== "object") return ""
+    let str: string
+    try {
+      str = Object.entries(input)
+        .map(([k, v]) => `${k}: ${JSON.stringify(v)}`)
+        .join(", ")
+    } catch {
+      return "[unserializable]"
+    }
+    if (str.length <= maxLen) return str
+    // Avoid slicing mid-surrogate pair by finding a safe boundary
+    let end = maxLen
+    const code = str.charCodeAt(end - 1)
+    if (code >= 0xd800 && code <= 0xdbff) end--
+    return str.slice(0, end) + "…"
+  }
+
+  export function createObservationMask(part: MessageV2.ToolPart): string {
+    const output =
+      (part.state.status === "completed" ? part.state.output : "") || ""
+    const lines = output.split("\n").length
+    const bytes = Buffer.byteLength(output, "utf8")
+    const args = truncateArgs(
+      part.state.status === "completed" ||
+        part.state.status === "running" ||
+        part.state.status === "error"
+        ? part.state.input
+        : {},
+      80,
+    )
+    const firstLine = output.split("\n")[0]?.slice(0, 80) || ""
+    const fingerprint = firstLine ? ` — "${firstLine}"` : ""
+    return `[Tool output cleared — ${part.tool}(${args}) returned ${lines} lines, ${formatBytes(bytes)}${fingerprint}]`
+  }
+
   export const Event = {
     Compacted: BusEvent.define(
       "session.compacted",
@@ -39,11 +82,11 @@ export namespace SessionCompaction {
       input.tokens.total ||
       input.tokens.input + input.tokens.output + input.tokens.cache.read + input.tokens.cache.write
 
-    const reserved =
-      config.compaction?.reserved ?? Math.min(COMPACTION_BUFFER, ProviderTransform.maxOutputTokens(input.model))
+    const maxOutput = ProviderTransform.maxOutputTokens(input.model)
+    const reserved = config.compaction?.reserved ?? COMPACTION_BUFFER
     const usable = input.model.limit.input
-      ? input.model.limit.input - reserved
-      : context - ProviderTransform.maxOutputTokens(input.model)
+      ? input.model.limit.input - Math.max(reserved, maxOutput)
+      : context - maxOutput - reserved
     return count >= usable
   }
 
@@ -90,11 +133,23 @@ export namespace SessionCompaction {
     if (pruned > PRUNE_MINIMUM) {
       for (const part of toPrune) {
         if (part.state.status === "completed") {
+          const mask = createObservationMask(part)
           part.state.time.compacted = Date.now()
+          part.state.metadata = {
+            ...part.state.metadata,
+            observation_mask: mask,
+          }
           await Session.updatePart(part)
         }
       }
       log.info("pruned", { count: toPrune.length })
+      Telemetry.track({
+        type: "tool_outputs_pruned",
+        timestamp: Date.now(),
+        session_id: input.sessionID,
+        count: toPrune.length,
+        tokens_pruned: pruned,
+      })
     }
   }
 
@@ -163,6 +218,15 @@ When constructing the summary, try to stick to this template:
 - [What important instructions did the user give you that are relevant]
 - [If there is a plan or spec, include information about it so next agent can continue using it]
 
+## Data Context
+
+- [What warehouse(s) or database(s) are we connected to?]
+- [What schemas, tables, or columns were discovered or are relevant?]
+- [What dbt models, sources, or tests are involved?]
+- [Any lineage findings (upstream/downstream dependencies)?]
+- [Any query patterns, anti-patterns, or optimization opportunities found?]
+- [Skip this section entirely if the task is not data-engineering related]
+
 ## Discoveries
 
 [What notable things were learned during this conversation that would be useful for the next agent to know when continuing the work]

diff --git a/packages/altimate-code/src/session/message-v2.ts b/packages/altimate-code/src/session/message-v2.ts
@@ -617,7 +617,10 @@ export namespace MessageV2 {
           if (part.type === "tool") {
             toolNames.add(part.tool)
             if (part.state.status === "completed") {
-              const outputText = part.state.time.compacted ? "[Old tool result content cleared]" : part.state.output
+              const mask = part.state.metadata?.observation_mask
+              const outputText = part.state.time.compacted
+                ? (typeof mask === "string" && mask.length > 0 ? mask : "[Old tool result content cleared]")
+                : part.state.output
               const attachments = part.state.time.compacted ? [] : (part.state.attachments ?? [])
 
               // For providers that don't support media in tool results, extract media files