Enhance documentation with updates on token recommendation placements and model-switching guidelines for improved cost efficiency

Scribe · Scribe · commit 001d60e461aa · 2026-06-17T19:21:20.000+02:00
diff --git a/.squad/agents/bender/history.md b/.squad/agents/bender/history.md
@@ -97,3 +97,7 @@
 - Hermes completed the semantic cleanup, not just file removal: plugin metadata now advertises the guide + agent surface, repo docs no longer teach repo-local skill installation, and late Part 4 numbering is contiguous again.
 - Live filesystem and `rg` checks satisfied the main acceptance criteria: no top-level `skills/`, no `skills-lock.json`, and one canonical image copy under `docs/assets/`.
 - Remaining gap is environmental, not yet a content failure: strict MkDocs build still cannot be executed here because `mkdocs` is not installed.
+
+### 2026-06-17: Token Recommendation Placement Review — Approved
+
+- Hermes placed RTK Windows cautions, VS Code extension/profile cleanup, custom-agent cost control, model-switch risk, and Copilot CLI AIC value framing in natural locations without overstating hidden internals or exact billing math.
diff --git a/.squad/agents/hermes/history.md b/.squad/agents/hermes/history.md
@@ -9,6 +9,13 @@
 
 ## Learnings
 
+### 2026-06-17: Placement for Tool, Profile, Model-Switch, and AIC Recommendations
+
+- Added RTK Windows caveat beside existing RTK setup, not as a new technique. Readers need the warning where they copy commands.
+- Extension/profile cleanup belongs with MCP/tool costs and practical agent setup because extension-injected tools behave like hidden context surface.
+- Model mid-chat switching belongs in model-pricing anti-patterns with careful cache/history wording; avoid claiming fixed implementation internals.
+- Copilot CLI AIC counter fits habit-building/monthly maintenance: value framing is a behavior loop, not a setup prerequisite.
+
 ### 2026-04-14: Wrote TOKEN-OPTIMIZATION-GUIDE.md (v1)
 
 - 1107 lines. Followed Leela's outline (5 parts), filled with Farnsworth's data.
diff --git a/.squad/decisions.md b/.squad/decisions.md
@@ -196,6 +196,13 @@
 - Cleanup stays semantic, not just structural: docs and plugin metadata must stop advertising shipped installable skills.
 - Validation target stays bounded: live `rg` and filesystem checks must pass; strict MkDocs build remains desirable but may be blocked if `mkdocs` is unavailable.
 
+### 2026-06-17: Token Recommendations Placement
+
+**Author:** Hermes | **Status:** Active | **Requested by:** Marco Olivo
+
+- Placed RTK Windows caution in MCP/tool-cost sections, VS Code extension/profile and custom-agent guidance in MCP/practical setup, model-switch cache risk in model-pricing anti-patterns, and Copilot CLI AIC value-framing in habit-building maintenance.
+- Each recommendation now sits beside the mechanism it affects, avoiding a new page and keeping README changes limited to high-impact quick-start nudges.
+
 ## Governance
 
 - All meaningful changes require team consensus
diff --git a/README.md b/README.md
@@ -21,12 +21,12 @@ Don't have time to read the full guide? Do these today and cut your token usage:
 | 1 | **Request code-only responses** — add `Code only, no explanation.` to `copilot-instructions.md`. Highest per-token ROI: output costs 5× more than input, and this cuts 40-70% of output on every code task, permanently | Shrinks response length | 0 minutes |
 | 2 | **Constrain output format by default** — add `Bullets over paragraphs. No explanations unless asked.` to `copilot-instructions.md` | Keeps answers terse | 0 minutes |
 | 3 | **Shrink your always-on context** — compress `copilot-instructions.md` AND prune `AGENTS.md` to landmines only. Every token in either file is billed on every interaction (and every agent step). Strip filler, delete anything the agent discovers by reading code, delete LLM-generated `/init` boilerplate | Reduces always-on input/context | 15 minutes |
-| 4 | **Default to Auto model selection** — use Auto as the baseline because it chooses from the supported Auto pool and gives a paid-plan discount. Pin higher-cost models manually when a task clearly justifies them. See [Model Selection & Pricing](docs/11-models-and-pricing.md) | Lowers billed rate on eligible usage | 0 minutes |
+| 4 | **Default to Auto model selection** — use Auto as the baseline because it chooses from the supported Auto pool and gives a paid-plan discount. Pin higher-cost models manually only when a task clearly justifies them, and start a fresh chat when switching cost lanes in a long session. See [Model Selection & Pricing](docs/11-models-and-pricing.md) | Lowers billed rate on eligible usage | 0 minutes |
 | 5 | **Use Ask Mode for simple questions** — reserve Agent Mode for multi-step tasks | Avoids agent overhead | 0 minutes (just choose the right mode) |
 | 6 | **Scope context with `applyTo:` paths** — split one large instructions file into small scoped ones that load only when relevant | Reduces always-on input/context | 15 minutes |
 | 7 | **Be precise in your prompts** — "Add null check to `getUser()`" not "Can you please look at this and maybe add some error handling?" Note: your typed prompt is a small fraction of total input; precision matters more for quality than for raw token savings | Improves task targeting | 0 minutes |
 | 8 | **Retune prompts to the target model** — provider prompting guides change by model/version. Paste the official guide URL into Copilot and ask it to adapt `.github/copilot-instructions.md`, agent profiles, or app prompts for the model you actually use | Reduces rework | 10 minutes per model change |
-| 9 | **Audit your MCP servers** — disable servers you're not using; each costs ~100-500 tokens per agent step | Removes tool/schema overhead | 5 minutes |
+| 9 | **Audit your MCP servers and injected tools** — disable unused MCP servers and VS Code extensions that add skills/tools; use a clean coding profile or focused custom agent for repeat workflows. Each MCP tool costs ~100-500 tokens per agent step | Removes tool/schema overhead | 5-10 minutes |
 | 10 | **Convert rich files to Markdown before AI work** — `.docx`, `.pdf`, `.pptx`, `.xlsx`, HTML, images, audio, video, and ZIPs carry format tax. [Marc Bara's writeup](https://medium.com/@marc.bara.iniesta/your-docx-is-wasting-33-of-your-ai-budget-86a3d229d042) shows the cost; use [Microsoft MarkItDown](https://github.com/microsoft/markitdown) before chat, agent, or RAG ingestion | Reduces noisy input context | 5 minutes |
 | 11 | **Run `/chronicle improve` weekly** (**Copilot CLI only**, experimental) — this slash command works in interactive Copilot CLI sessions, not as a general Copilot Chat feature. It finds recurring confusion in your CLI session history and generates custom-instruction fixes so the same misread intent stops costing tokens forever | Cuts recurring rework | 2 minutes per run |
 | 12 | **Try CodeAct for long tool chains** (**Copilot CLI only**, optional external plugin) — [`copilot-codeact-plugin`](https://github.com/jsturtevant/copilot-codeact-plugin) collapses multi-step tool chains into one sandboxed execution, which can reduce repeated replay of system prompt, prior messages, and tool definitions | Reduces tool-loop replay | 10-15 minutes |
@@ -124,7 +124,7 @@ Ranked by cost impact. Output first — it costs 5× more per token than input.
 1. **Output control** — "Code only, no explanation" + terse default in `copilot-instructions.md`. 40-70% output savings on code tasks, 30-60% across all interactions. One instruction, permanent.
 2. **Shrink always-on context** (`copilot-instructions.md` + `AGENTS.md`) — compress filler, prune to landmines only, delete LLM-generated boilerplate. Compounds on every interaction and agent step; 20-23% agent-task reduction plus better correctness
 3. **Ask Mode for simple questions** — 60-90% savings by avoiding Agent overhead
-4. **Audit MCP servers** — disable unused servers, save 5K-190K tokens per agent task
+4. **Audit MCP servers and injected tools** — disable unused servers/extensions, or use a clean coding profile/custom agent, to save 5K-190K tokens per agent task
 5. **Auto model selection** — lower-cost default routing plus paid-plan discount on eligible usage, zero effort
 6. **Convert rich files to Markdown first** — avoid paying for Word/PDF/HTML layout noise in chat, agent, and RAG workflows
 7. **Retune prompts to the target model** — better first-pass output reduces repeated clarification turns
diff --git a/docs/06-workflow-optimization.md b/docs/06-workflow-optimization.md
@@ -99,7 +99,7 @@ Keep the claim bounded: this guide is **not** benchmarking CodeAct itself. The p
 
 CodeAct reduces the *number* of tool calls. [**RTK (Rust Token Killer)**](https://github.com/rtk-ai/rtk) reduces the *size* of each tool call's result. They address different sides of the same problem and can be used together.
 
-RTK is a CLI proxy that intercepts `git`, `cargo test`, `grep`, `ls`, and 100+ other dev commands and compresses their output before it reaches the agent — 60–90% savings per command. Unlike CodeAct, RTK works in all Copilot surfaces (VS Code, CLI, and other AI tools), not just Copilot CLI. See [MCP & Tool Costs §2.7.7](08-mcp-tool-costs.md#277-compress-tool-output-at-the-source-rtk) for setup and the full command list.
+RTK is a CLI proxy that intercepts `git`, `cargo test`, `grep`, `ls`, and 100+ other dev commands and compresses their output before it reaches the agent — 60–90% savings per command. Unlike CodeAct, RTK is not limited to Copilot CLI; it can help across Copilot surfaces when the shell hook is reliable. Treat Windows setups as a pilot, not a default rollout. See [MCP & Tool Costs §2.7.7](08-mcp-tool-costs.md#277-compress-tool-output-at-the-source-rtk) for setup and the full command list.
 
 ## 2.5.4 Default to Auto Model Selection
 
diff --git a/docs/08-mcp-tool-costs.md b/docs/08-mcp-tool-costs.md
@@ -18,7 +18,7 @@ Free Space:    55.3k (28%)
 Buffer:        40.4k (20%)
 ```
 
-**VS Code Copilot:** no equivalent command, but you can estimate your `System/Tools` baseline by counting active MCP servers × tools × ~200 tokens average (see §2.7.2).
+**VS Code Copilot:** no equivalent command, but you can estimate your `System/Tools` baseline by counting active MCP servers × tools × ~200 tokens average (see §2.7.2). Also audit extensions that add skills, agents, MCP servers, or tool surfaces. If an extension injects tools you do not need for coding, disable it for that workspace or move coding work into a VS Code profile with only the essentials enabled.
 
 **The critical distinction — always-loaded vs. on-demand:**
 
@@ -151,6 +151,8 @@ Don't enable every MCP server globally. Use workspace-level configuration:
 
 **The rule:** If you don't need it for the current task, disable it. You can always re-enable it later. Every idle MCP server costs tokens on every agent step.
 
+**VS Code extensions count too.** MCP servers are the obvious source of tool schemas, but some extensions also add skills, chat participants, agent profiles, or tool surfaces that can appear in the AI context. For cost-sensitive coding sessions, keep a lean VS Code profile: core language tooling, GitHub Copilot, and only the MCP/tools needed for that repo. Disable everything else at the workspace or profile level.
+
 ## 2.7.6 Practical Guidance
 
 1. **Audit your MCP servers** — run through your enabled servers. Do you actually use all of them? Disable the rest
@@ -160,7 +162,8 @@ Don't enable every MCP server globally. Use workspace-level configuration:
 5. **Custom instructions help** — add "Minimize tool calls. Read files only when necessary." to reduce call frequency
 6. **Use skills instead of MCPs for occasional capabilities** — MCP tool schemas load on every step whether used or not. Skills load only title and description upfront; the full content pulls on demand. If a capability is used in fewer than half your sessions, a skill is cheaper. See [Practical Setup §4.2](10-practical-setup.md#mcps-vs-skills-eager-vs-lazy-context-loading) for the full comparison
 7. **Optional, Copilot CLI only: try CodeAct for long tool chains** — external plugin [`copilot-codeact-plugin`](https://github.com/jsturtevant/copilot-codeact-plugin) collapses many small tool hops into one sandboxed execution. That does not shrink any one server's schema, but it can reduce how often the full tool catalog gets replayed on CLI-heavy tasks
-8. **Compress tool output at the source with RTK** — [RTK (Rust Token Killer)](https://github.com/rtk-ai/rtk) is a CLI proxy that filters the *results* of shell commands before they reach the agent. Confirmed to work well in VS Code Copilot (repo-by-repo setup). Reductions are real but vary by command and project output volume. See §2.7.7
+8. **Use a focused custom agent for repeat coding workflows** — a custom agent can carry a narrow tool list and stable instructions, so the same coding workflow starts with the same active surface instead of whatever the default chat currently exposes. Where your Copilot surface supports model selection in agent/profile files, pin the intended model there too
+9. **Compress tool output at the source with RTK** — [RTK (Rust Token Killer)](https://github.com/rtk-ai/rtk) is a CLI proxy that filters the *results* of shell commands before they reach the agent. Confirmed to work well in VS Code Copilot on macOS/Linux with repo-by-repo setup. Treat Windows as experimental and validate before rolling it out broadly. Reductions are real but vary by command and project output volume. See §2.7.7
 
 ## 2.7.7 Compress Tool Output at the Source: RTK
 
@@ -198,6 +201,8 @@ brew install rtk
 curl -fsSL https://raw.githubusercontent.com/rtk-ai/rtk/refs/heads/master/install.sh | sh
 ```
 
+**Windows caveat:** RTK is strongest today on Unix-like shell paths. On Windows, shell-hook behavior and path handling can be brittle, especially across PowerShell, Git Bash, WSL, and VS Code agent execution. Treat it as a pilot, not a default recommendation: test it on the exact repo and shell your team uses, and skip it if the setup causes command failures or noisy behavior.
+
 **Setting up for VS Code Copilot — per-repo:**
 
 For VS Code Copilot, RTK installs a PreToolUse hook scoped to the current repository. Run this once inside each repo where you want RTK active:
diff --git a/docs/10-practical-setup.md b/docs/10-practical-setup.md
@@ -297,7 +297,9 @@ Mock: external services only. No impl mocking.
 Coverage: branch coverage ≥80%.
 ```
 
-Focused agents carry less instruction overhead than a general-purpose instruction set.
+Focused agents carry less instruction overhead than a general-purpose instruction set. They also give you a stable control surface: the same task profile can declare the tools it is allowed to use, the instructions it carries, and, where your Copilot surface supports it, the model it should use. For repeat coding workflows, prefer a focused custom agent over the default agent when you care about predictable cost. The default agent inherits more of the current environment: active tools, extension-provided surfaces, and whatever model is currently selected.
+
+Keep the tool list narrow. This repo's `agents/token-saver.agent.md` is the pattern: built-in `bash`, `edit`, and `view`; no duplicate filesystem MCP; terse output rules; explicit tool minimization.
 
 ### 4.3.6 Compress Shell Command Output with RTK
 
@@ -317,6 +319,8 @@ rtk init --copilot
 
 RTK installs a PreToolUse hook into the current repository. Repeat per repo — there is no global VS Code Copilot install. Once active, the hook is transparent: your terminal is unchanged; only the agent's Bash tool calls are intercepted.
 
+On Windows, validate RTK before recommending it to a team. The hook path can be more fragile across PowerShell, Git Bash, WSL, and VS Code agent execution. If RTK adds setup friction or command failures, skip it and focus first on clean profiles, fewer MCP servers, precise prompts, and shorter command output.
+
 Commands with verbose output (test failures, large diffs) see the biggest reductions. Short-output commands see smaller gains. Actual savings depend on your project's output volume.
 
 Combine with `copilot-setup-steps.yml` (§4.3.2) and precise issue descriptions (§4.3.3) for maximum session efficiency. Full setup, command list, and other AI tool support: [MCP & Tool Costs §2.7.7](08-mcp-tool-costs.md#277-compress-tool-output-at-the-source-rtk).
@@ -335,8 +339,10 @@ Combine with `copilot-setup-steps.yml` (§4.3.2) and precise issue descriptions
 - Review your `copilot-instructions.md` — has it grown? Compress it back down
 - Check if any memory files have gotten verbose — compress them back down
 - Audit which files are habitually open in your editor — close ones you're not working on (open tabs auto-feed context)
+- Audit VS Code profiles and extensions — disable extensions that inject AI skills, agents, MCP servers, or tools unless the current repo needs them
 - (Business/Enterprise) Review repository / org **Content Exclusion** settings for new sensitive paths
 - Check your model usage — are you pinning high-effort models for tasks Auto would route to a cheaper tier?
+- In Copilot CLI, watch the bottom-right **AIC** counter. Divide by 100 for the approximate dollar value, then ask whether the output saved more time or cost than it consumed. If spend is high for weak output, treat that as feedback on prompt scope, context size, tool count, or model choice
 - Review budgets, user-level caps, and model policies before expanding premium access further
 - When default model changes, retune prompts/instructions against that provider's current prompting guide
 - Check token usage by user/team — are agents and power users driving outsized consumption? See [Enterprise Governance](12-enterprise-governance.md)
@@ -461,6 +467,8 @@ Relevant settings that affect agent token usage:
 
 **`maxRequests`** caps how many tool-call requests the agent can make. Lower = fewer tokens, but the agent might not finish complex tasks. Start at 10-15, increase only when needed.
 
+For repeat workflows, pair this with a custom agent profile and a clean VS Code profile. Disable extensions that inject skills, agents, MCP servers, or tool surfaces you do not need for coding. The most predictable setup is boring: one focused agent, one intended model, and only the tools required for the repo.
+
 ### 4.5.5 Custom Instructions for Agent Efficiency
 
 Add to `.github/copilot-instructions.md`:
diff --git a/docs/11-models-and-pricing.md b/docs/11-models-and-pricing.md
@@ -122,11 +122,14 @@ This is especially relevant when comparing a cheap reasoning-capable model at `m
 ### Anti-patterns
 
 - Leaving an expensive premium model pinned for the whole session
+- Changing models mid-chat in a long session without thinking about accumulated context. Prior messages, tool results, and cacheable prefixes can still be part of the next request; switching into a higher-cost lane can make that carried context more expensive than starting fresh
 - Assuming Auto will escalate to Opus when a task gets hard
 - Using vendor API prices and Copilot pricing signals as if they were the same metric
 - Recommending a model without checking whether the plan includes it
 - Turning on every premium model for the whole org before checking who actually needs it
 
+**Model-switch rule:** choose the cost lane before the work starts. If you need to move from cheap/Auto to a premium model for a hard subtask, start a fresh chat with only the relevant summary and files. This preserves cache-friendly stability in the original session and avoids dragging a long low-value history into a higher-cost request. The exact billing implementation can change by surface and plan, so frame this as risk control rather than guaranteed repricing math.
+
 ## Org Rollout Rule: Review Before Enablement
 
 For teams, model choice is a governance problem as much as a prompt problem.