Token Optimization - How is everyone saving tokens? Model switching / SKILL.md tiered loading #714

jdrolls · 2026-02-17T13:01:00Z

jdrolls
Feb 17, 2026

Token Optimization Strategy for Claude Max Plan

The Problem

I'm on the $200 Claude Max plan and hitting my weekly token limit consistently—currently at 90% usage with three days still remaining before reset. I've been running Opus extensively without much optimization, and while switching to the API is an option, I'm not ready to potentially spend several hundred more dollars in a couple of days.

I finally asked DORA (my PAI instance) for optimization recommendations. The biggest opportunities identified were:

Restructuring my SKILL.md into a core loader with on-demand sections
Using Sonnet as the default model

My main concern: The structured loading approach for SKILL.md might break the algorithm that currently works well, especially when combined with switching to a less capable model. I'm interested in hearing from others who've implemented similar optimizations.

I'm implementing the restructure now and will report back in a few days on the results. For context, here are the findings and recommendations from my red team analysis:

Token Usage Breakdown

Here's where tokens are actually being consumed:

Source	Est. Tokens/Turn	Per Session (20 turns)	Optimization Potential
SKILL.md context	20,750	415,000	Biggest target
Other context files	4,250	85,000	Trimmable
Prompts + responses	2,000-10,000	40,000-200,000	Model choice matters
Tool call responses	5,000-20,000	100,000-400,000	Hidden cost
Hook inference (Haiku)	~500 per call	~10,000	Already optimized

Optimization Recommendations (Priority Order)

1. Restructure SKILL.md (HIGHEST IMPACT, Zero Quality Risk)

The Issue: The 83KB SKILL.md loads ~20,750 tokens into EVERY turn.

The Solution: Split it into:

Core loader (~4K tokens): Algorithm phase structure, critical rules, model routing, identity
On-demand sections: Capability registry, PRD template, ISC detailed specs, agent instructions—loaded via tool calls only when that phase is actually reached

Impact: Saves 15K-18K tokens per turn. At 20 turns/session, that's 300K-360K fewer input tokens per session.

2. Use Sonnet as Default (With Strategic Guardrails)

Blindly switching to Sonnet creates correctness regressions. The smart approach:

Use Sonnet for:

File edits
Skill invocations
Standard coding
Config changes

Stay on Opus for:

Debugging silent failures
Architecture decisions
Hook/system code changes
Anything touching PAI infrastructure

Key insight: Don't rely on manual /model switching. Consider having the CapabilityRecommender hook suggest model tier as part of effort level classification.

3. Specify Model Tiers on Agent Spawns

Agent	Recommended Model	Rationale
Explore	Haiku	File searching, no reasoning needed
Engineer	Sonnet	Excellent for code generation
Algorithm (ISC)	Sonnet	Formulaic enough for Sonnet
Research	Sonnet	Good synthesis capabilities
Red Team	Sonnet	Needs adversarial reasoning (NOT Haiku)
Architect	Opus	Requires deep reasoning
Council	Sonnet for routine, Opus for critical decisions	Context-dependent

Note: Red Team specifically needs Sonnet or better—Haiku produces shallow critiques that appear thorough but lack depth.

4. Quick Wins (Low Effort, Immediate Impact)

Move 600 spinner verbs to a separate file (saves ~1,250 tokens/turn)
Implement RatingCapture sentiment gate for short prompts (<50 chars)

What NOT To Do (Red Team Warnings)

❌ Don't use Haiku for Red Team—produces shallow analysis
❌ Don't rely on manual model switching—you'll miss it when it matters
❌ Don't forget output token costs—Opus output is 5x more expensive than Sonnet
❌ Don't "compress" SKILL.md—restructure it into tiered loader instead

Realistic Savings Estimate

Change	Realistic Savings	Implementation Effort
SKILL.md restructure	30-40% of total input cost	Medium (1-2 sessions)
Sonnet as default	20-35% on model-cost component	Low (immediate)
Agent model tiers	10-20% on agent-heavy tasks	Low (per-invocation)
Spinner verbs removal	~5% of context cost	Trivial
Combined Total	~40-55% total reduction

Discussion Question

Has anyone else implemented structured loading of their SKILL.md for context saving? What were your results? I'm particularly interested in whether this approach maintained quality while reducing token consumption.

I'll update this thread in a few days with my results from the restructure.

danielmiessler · 2026-02-18T09:56:14Z

danielmiessler
Feb 18, 2026
Maintainer

I'm looking at lots of different ways of doing this. But the current state is still better than default. Because you end up doing less work and therefore spending less tokens anyway.

But we are actively working on how to get the algorithm smaller, tighter, and more efficient. As well as looking for opportunities to improve token usage throughout the system.

0 replies

marbad1994 · 2026-05-01T13:35:34Z

marbad1994
May 1, 2026

I just made a model router that mimics one model. It has high availability and is more efficient than one model if configured correctly. I still have some tweaking to do and am still exploring different free models to see what could actually be completely free. There are three modes free-first, balanced and deep and then there is a speed attribute that affects which models will be prioritized. Other than that models are prioritized based on what kind of task is coming. The router learns over time which models perform better and will be promoted in the fallback chain. A small and relevant version of the context is passed between the models when they change places. Other than this some input and output is filtered to save some extra tokens. Unfortunately are the code assistants of today very bloated and sends a bunch of unnecessary tokens. This makes free models very hard to use in the long run. So I am currently working on one of my own that is as slimmed down as possible to get use of free models. https://github.com/marbad1994/makkorch-model-router

https://github.com/marbad1994/codemakk

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Token Optimization - How is everyone saving tokens? Model switching / SKILL.md tiered loading #714

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Token Optimization - How is everyone saving tokens? Model switching / SKILL.md tiered loading #714

Uh oh!

jdrolls Feb 17, 2026

Token Optimization Strategy for Claude Max Plan

The Problem

Token Usage Breakdown

Optimization Recommendations (Priority Order)

1. Restructure SKILL.md (HIGHEST IMPACT, Zero Quality Risk)

2. Use Sonnet as Default (With Strategic Guardrails)

3. Specify Model Tiers on Agent Spawns

4. Quick Wins (Low Effort, Immediate Impact)

What NOT To Do (Red Team Warnings)

Realistic Savings Estimate

Discussion Question

Replies: 2 comments

Uh oh!

danielmiessler Feb 18, 2026 Maintainer

Uh oh!

marbad1994 May 1, 2026

jdrolls
Feb 17, 2026

danielmiessler
Feb 18, 2026
Maintainer

marbad1994
May 1, 2026