|
| 1 | +# AGENTS.md — Tok |
| 2 | + |
| 3 | +Tokenizer, compression, secrets scanning, and rate limiting library for AI coding agents. |
| 4 | + |
| 5 | +## Design Principles |
| 6 | + |
| 7 | +- **Library only** — no CLI, no binary |
| 8 | +- **Token-efficient** — optimized for context window management |
| 9 | +- **Security-first** — secrets scanning prevents credential leaks |
| 10 | + |
| 11 | +## Build & Test |
| 12 | + |
| 13 | +```bash |
| 14 | +go test ./... # Run all tests |
| 15 | +go test -race ./... # Race detector |
| 16 | +go test -coverprofile=c.out ./... # Coverage |
| 17 | +go vet ./... # Static analysis |
| 18 | +gofumpt -w . # Format |
| 19 | +``` |
| 20 | + |
| 21 | +## Architecture |
| 22 | + |
| 23 | +- `tokenizer.go` — Token counting and estimation |
| 24 | +- `compressor.go` — Context compression strategies |
| 25 | +- `secrets.go` — Secrets scanning and redaction |
| 26 | +- `ratelimit.go` — Rate limiting for API calls |
| 27 | +- `budget.go` — Token budget management |
| 28 | +- `filter.go` — Content filtering and validation |
| 29 | + |
| 30 | +## Conventions |
| 31 | + |
| 32 | +- Go 1.26+, pure Go, no CGO |
| 33 | +- Table-driven tests |
| 34 | +- Conventional Commits: `feat:`, `fix:`, `docs:`, `refactor:`, `test:` |
| 35 | +- No `Co-authored-by:` trailers (auto-stripped by githook) |
| 36 | +- `gofumpt` formatting enforced in CI |
| 37 | +- Quality.yml coverage threshold: 30% |
| 38 | + |
| 39 | +## Common Pitfalls |
| 40 | + |
| 41 | +- Token estimation is approximate — don't rely on exact counts |
| 42 | +- Secrets scanning has false positives — use allowlists for known patterns |
| 43 | +- Rate limiter tests need careful timing assertions |
| 44 | + |
| 45 | +## Naming Conventions |
| 46 | + |
| 47 | +- **Top-level functions are verbs**: `Compress()`, `EstimateTokens()`, `EstimateTokensPrecise()`, `WarmupTokenizer()` |
| 48 | +- **Option pattern**: `Option` interface with `optFunc` adapter — same pattern as sight and inspect |
| 49 | +- **Preset options are bare vars**: `Minimal`, `Aggressive`, `Surface`, `Adaptive`, `Code`, `Log` — exported `var Option` values |
| 50 | +- **Mode is a string type**: `Mode` with constants `ModeMinimal`, `ModeAggressive` |
| 51 | +- **Tier is a string type**: `Tier` with constants `TierSurface`, `TierTrim`, `TierExtract`, `TierCore`, `TierCode`, `TierLog`, `TierThread`, `TierAdaptive` |
| 52 | +- **Internal packages**: `internal/core/` (tokenizer), `internal/filter/` (pipeline), `internal/secrets/` (detector), `internal/codeaware/` (code-specific) |
| 53 | +- **Secret detector pattern**: `DefaultSecretDetector()` returns singleton, `NewSecretDetector()` creates fresh instance |
| 54 | +- **SecretMatch is a type alias**: `type SecretMatch = secrets.SecretMatch` — re-exports from internal package |
| 55 | +- **Stats struct**: returned from `Compress()` — `OriginalTokens`, `FinalTokens`, compression ratio fields |
| 56 | + |
| 57 | +## API Patterns |
| 58 | + |
| 59 | +- **One-shot compression**: `tok.Compress(text, opts...)` — creates pipeline internally, returns `(string, Stats)` |
| 60 | +- **Reusable compressor**: `tok.NewCompressor(opts...)` returns `*Compressor` with `Compress(text)` method — reuses caches |
| 61 | +- **Token estimation**: `EstimateTokens(text)` for fast approximation, `EstimateTokensPrecise(text)` for BPE accuracy |
| 62 | +- **Warmup**: `WarmupTokenizer()` pre-initializes BPE tokenizer in background — call at startup to avoid first-call latency |
| 63 | +- **Budget constraint**: `WithBudget(tokens)` option hard-limits output token count — pipeline truncates to fit |
| 64 | +- **Query-driven filtering**: `WithQuery(intent)` option provides goal context for relevance-based filtering |
| 65 | +- **Tier selection**: `WithTier(TierCode)` selects pre-built pipeline profile — each tier has different layer counts |
| 66 | +- **Mode selection**: `WithMode(ModeAggressive)` controls compression aggressiveness within a tier |
| 67 | +- **Secret detection**: `DefaultSecretDetector().DetectSecrets(text)` returns `[]SecretMatch`; `.RedactSecrets(text)` returns redacted string |
| 68 | +- **Entropy-based detection**: `DetectAndRedactWithEntropy(text, threshold)` — pattern matching + Shannon entropy analysis |
| 69 | + |
| 70 | +## Testing Patterns |
| 71 | + |
| 72 | +- **External test package**: `package tok_test` — tests import `tok` as a consumer would |
| 73 | +- **Simple assertions**: `TestCompress` checks non-empty output and non-zero `OriginalTokens` — minimal, focused |
| 74 | +- **Empty input test**: `TestCompress_Empty` — verify empty string returns empty string and zero stats |
| 75 | +- **Preset smoke tests**: `TestCompress_Aggressive`, `TestCompress_WithTier`, `TestCompress_WithQuery` — each preset/option tested |
| 76 | +- **Budget test**: `TestCompress_WithBudget` — create large input, compress with budget 50, verify `FinalTokens <= 60` |
| 77 | +- **Concurrent safety test**: `TestCompress_Concurrent` — 10 goroutines compressing same input with `sync.WaitGroup` |
| 78 | +- **Token estimation test**: `TestEstimateTokens` — verify non-zero for known input |
| 79 | +- **Reusable compressor test**: `TestNewCompressor` — create compressor, call `Compress()` twice, verify both return results |
| 80 | +- **Secret detection tests**: `secrets_test.go` — pattern matching, entropy edge cases, allowlist exclusions |
| 81 | +- **Bench tests**: `internal/` subdirectories — performance-critical paths |
| 82 | + |
| 83 | +## Refactoring Guidelines |
| 84 | + |
| 85 | +- **Safe to refactor**: `internal/filter/` pipeline layers — add, remove, reorder filter stages |
| 86 | +- **Safe to refactor**: `internal/core/` tokenizer — improve estimation accuracy, add new tokenizers |
| 87 | +- **Safe to refactor**: `internal/secrets/` patterns — add new detection patterns, tune entropy threshold |
| 88 | +- **Safe to refactor**: `internal/codeaware/` — language-specific compression rules |
| 89 | +- **Do not touch**: `Compress()` function signature — primary API contract |
| 90 | +- **Do not touch**: `Option` interface and preset vars — used by all consumers |
| 91 | +- **Do not touch**: `Stats` struct fields — returned from every `Compress()` call |
| 92 | +- **Do not touch**: `SecretDetector` public methods — used by hawk for secret scanning |
| 93 | +- **Do not touch**: `Tier` and `Mode` constants — referenced in configs and CLI flags |
| 94 | +- **Safe to extend**: add new `Tier` values, new filter layers, new secret patterns, new compression strategies |
| 95 | +- **When adding a tier**: add constant to `Tier` type, implement pipeline config in `internal/filter/` |
| 96 | + |
| 97 | +## Key File Locations |
| 98 | + |
| 99 | +| What | Where | |
| 100 | +|---|---| |
| 101 | +| Public API entry point | `tok.go` (`Compress()`, `EstimateTokens()`, `WarmupTokenizer()`) | |
| 102 | +| Reusable compressor | `compressor.go` (`Compressor` struct) | |
| 103 | +| Options & presets | `options.go` (`Option`, `Mode`, `Tier`, `With*` functions, preset vars) | |
| 104 | +| Secret detection | `secrets.go` (`SecretDetector`, `DetectSecrets()`, `RedactSecrets()`) | |
| 105 | +| Stats type | `stats.go` (returned from `Compress()`) | |
| 106 | +| Stream processing | `stream.go` | |
| 107 | +| Core tokenizer | `internal/core/` (BPE tokenizer, estimation) | |
| 108 | +| Filter pipeline | `internal/filter/` (pipeline coordinator, tier configs, layer execution) | |
| 109 | +| Code-aware filters | `internal/codeaware/` (language-specific compression) | |
| 110 | +| Secret patterns | `internal/secrets/` (regex patterns, entropy analysis, allowlists) | |
| 111 | +| Utility functions | `internal/utils/` | |
| 112 | +| Main test file | `tok_test.go` (compression, estimation, concurrency, presets) | |
| 113 | +| Secret tests | `secrets_test.go` | |
| 114 | +| Compression tests | `compressor_test.go` (if exists) | |
| 115 | +| Benchmark tests | `internal/*/bench_test.go` | |
| 116 | +| Linter config | `.golangci.yml` (govet, ineffassign, misspell — minimal) | |
0 commit comments