Official Implementation of "Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding" (ICML'25)
-
Updated
May 14, 2026 - Python
Official Implementation of "Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding" (ICML'25)
Biological code organization system with 1,029+ production-ready snippets - 95% token reduction for Claude/GPT with AI-powered discovery & offline packs
Reduce Claude AI token consumption. Zero install to start. Python 3.7+ for auto-manifest generation.
Advanced token reduction and prompt optimization framework for LLMs, featuring linguistic, algorithmic, and architectural patterns.
Do dense LMs develop MoE-like specialization as they scale? Measure it, visualize it, and turn it into speed.
Packet-Switched Attention for stable 2-bit quantized MoE inference, with variance-aware routing and Protocol C benchmarks.
TokenCave is a browser extension for Claude AI that helps you monitor and optimize token usage with real-time counters, usage insights, and a “caveman mode” that dramatically reduces output length while preserving technical accuracy.
Add a description, image, and links to the llm-efficiency topic page so that developers can more easily learn about it.
To associate your repository with the llm-efficiency topic, visit your repo's landing page and select "manage topics."