Replies: 2 comments
-
|
I'm looking at lots of different ways of doing this. But the current state is still better than default. Because you end up doing less work and therefore spending less tokens anyway. But we are actively working on how to get the algorithm smaller, tighter, and more efficient. As well as looking for opportunities to improve token usage throughout the system. |
Beta Was this translation helpful? Give feedback.
-
|
I just made a model router that mimics one model. It has high availability and is more efficient than one model if configured correctly. I still have some tweaking to do and am still exploring different free models to see what could actually be completely free. There are three modes free-first, balanced and deep and then there is a speed attribute that affects which models will be prioritized. Other than that models are prioritized based on what kind of task is coming. The router learns over time which models perform better and will be promoted in the fallback chain. A small and relevant version of the context is passed between the models when they change places. Other than this some input and output is filtered to save some extra tokens. Unfortunately are the code assistants of today very bloated and sends a bunch of unnecessary tokens. This makes free models very hard to use in the long run. So I am currently working on one of my own that is as slimmed down as possible to get use of free models. https://github.com/marbad1994/makkorch-model-router |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Token Optimization Strategy for Claude Max Plan
The Problem
I'm on the $200 Claude Max plan and hitting my weekly token limit consistently—currently at 90% usage with three days still remaining before reset. I've been running Opus extensively without much optimization, and while switching to the API is an option, I'm not ready to potentially spend several hundred more dollars in a couple of days.
I finally asked DORA (my PAI instance) for optimization recommendations. The biggest opportunities identified were:
My main concern: The structured loading approach for SKILL.md might break the algorithm that currently works well, especially when combined with switching to a less capable model. I'm interested in hearing from others who've implemented similar optimizations.
I'm implementing the restructure now and will report back in a few days on the results. For context, here are the findings and recommendations from my red team analysis:
Token Usage Breakdown
Here's where tokens are actually being consumed:
Optimization Recommendations (Priority Order)
1. Restructure SKILL.md (HIGHEST IMPACT, Zero Quality Risk)
The Issue: The 83KB SKILL.md loads ~20,750 tokens into EVERY turn.
The Solution: Split it into:
Impact: Saves 15K-18K tokens per turn. At 20 turns/session, that's 300K-360K fewer input tokens per session.
2. Use Sonnet as Default (With Strategic Guardrails)
Blindly switching to Sonnet creates correctness regressions. The smart approach:
Use Sonnet for:
Stay on Opus for:
Key insight: Don't rely on manual
/modelswitching. Consider having the CapabilityRecommender hook suggest model tier as part of effort level classification.3. Specify Model Tiers on Agent Spawns
Note: Red Team specifically needs Sonnet or better—Haiku produces shallow critiques that appear thorough but lack depth.
4. Quick Wins (Low Effort, Immediate Impact)
What NOT To Do (Red Team Warnings)
Realistic Savings Estimate
Discussion Question
Has anyone else implemented structured loading of their SKILL.md for context saving? What were your results? I'm particularly interested in whether this approach maintained quality while reducing token consumption.
I'll update this thread in a few days with my results from the restructure.
Beta Was this translation helpful? Give feedback.
All reactions