|
| 1 | +<div align="center"> |
| 2 | + |
| 3 | +# ✂️ tok Architecture |
| 4 | + |
| 5 | +**Tokenizer, Compressor & Secrets Scanner for AI Agents** |
| 6 | + |
| 7 | +[](https://go.dev/) |
| 8 | +[]() |
| 9 | + |
| 10 | +</div> |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## 🎯 Overview |
| 15 | + |
| 16 | +tok is a tokenizer, compression, secrets scanning, and rate limiting library for AI coding agents. It reduces LLM token costs by **60–90%** through input compression, output filtering, and transparent command rewriting. |
| 17 | + |
| 18 | +> 💡 Pure Go library — no network service, no CLI required. |
| 19 | +
|
| 20 | +--- |
| 21 | + |
| 22 | +## 🧱 Components |
| 23 | + |
| 24 | +``` |
| 25 | +tok/ |
| 26 | +├── api/openapi.yaml 📜 Library API surface reference |
| 27 | +├── tok.go 📤 Public API: Compress(), EstimateTokens() |
| 28 | +├── compressor.go 🔄 Reusable Compressor struct |
| 29 | +├── options.go ⚙️ Option, Mode, Tier, With* functions, presets |
| 30 | +├── secrets.go 🔒 SecretDetector, DetectSecrets(), RedactSecrets() |
| 31 | +├── stats.go 📊 Stats returned from Compress() |
| 32 | +├── stream.go 📡 Stream processing |
| 33 | +└── internal/ |
| 34 | + ├── core/ 🧮 BPE tokenizer, token estimation |
| 35 | + ├── filter/ 🔧 31-layer filter pipeline, tier configs |
| 36 | + ├── codeaware/ 💻 Language-specific compression rules |
| 37 | + ├── secrets/ 🔑 Regex patterns, entropy analysis, allowlists |
| 38 | + ├── cache/ 💾 Compression result caching |
| 39 | + ├── fastops/ ⚡ Performance-critical operations |
| 40 | + └── config/ ⚙️ Configuration management |
| 41 | +``` |
| 42 | + |
| 43 | +--- |
| 44 | + |
| 45 | +## 📤 Public API |
| 46 | + |
| 47 | +```go |
| 48 | +// 🗜️ One-shot compression |
| 49 | +compressed, stats, err := tok.Compress(text, |
| 50 | + tok.WithTier(tok.TierCode), |
| 51 | + tok.WithBudget(4000), |
| 52 | + tok.WithQuery("implement OAuth flow"), |
| 53 | +) |
| 54 | + |
| 55 | +// 🔄 Reusable compressor (caches tokenizer state) |
| 56 | +c := tok.NewCompressor(tok.Aggressive) |
| 57 | +compressed, stats, err := c.Compress(text) |
| 58 | + |
| 59 | +// 📊 Token estimation |
| 60 | +approx := tok.EstimateTokens(text) // fast, ±5% |
| 61 | +precise := tok.EstimateTokensPrecise(text) // BPE-accurate |
| 62 | + |
| 63 | +// 🧮 Warmup (call at startup to avoid first-call latency) |
| 64 | +tok.WarmupTokenizer() |
| 65 | + |
| 66 | +// 🔒 Secret detection |
| 67 | +matches := tok.DefaultSecretDetector().DetectSecrets(text) |
| 68 | +redacted := tok.DefaultSecretDetector().RedactSecrets(text) |
| 69 | +``` |
| 70 | + |
| 71 | +--- |
| 72 | + |
| 73 | +## 📊 Compression Tiers |
| 74 | + |
| 75 | +| Tier | Description | Savings | |
| 76 | +|------|-------------|:-------:| |
| 77 | +| 🟢 `TierSurface` | Light deduplication | ~10% | |
| 78 | +| 🟡 `TierTrim` | Whitespace + comments | ~20% | |
| 79 | +| 🟠 `TierExtract` | Key information extraction | ~35% | |
| 80 | +| 🔵 `TierCode` | Code-aware compression | ~45% | |
| 81 | +| 🔴 `TierCore` | Semantic core extraction | ~55% | |
| 82 | +| 🟣 `TierLog` | Log file optimization | ~70% | |
| 83 | +| ⚡ `TierAdaptive` | Adaptive per content type | varies | |
| 84 | + |
| 85 | +--- |
| 86 | + |
| 87 | +## 🔒 Secret Detection |
| 88 | + |
| 89 | +| Strategy | Description | |
| 90 | +|----------|-------------| |
| 91 | +| 🔑 **Pattern-based** | Regex for API keys, JWTs, connection strings, SSH keys | |
| 92 | +| 📊 **Entropy-based** | Shannon entropy analysis (threshold: 4.5) | |
| 93 | +| 📋 **Allowlists** | Prevent false positives on known-safe patterns | |
| 94 | + |
| 95 | +--- |
| 96 | + |
| 97 | +## 🔗 Ecosystem Usage |
| 98 | + |
| 99 | +| Consumer | Usage | |
| 100 | +|----------|-------| |
| 101 | +| 🦅 **hawk** | Context window management | |
| 102 | +| 🦅 **eyrie** | Response compression | |
| 103 | +| 🧠 **yaad** | Token budget enforcement in recall | |
0 commit comments