Skip to content

Commit df4d8e4

Browse files
committed
chore: restore AGENTS.md — project instructions for AI agents
1 parent 7321c44 commit df4d8e4

1 file changed

Lines changed: 116 additions & 0 deletions

File tree

AGENTS.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# AGENTS.md — Tok
2+
3+
Tokenizer, compression, secrets scanning, and rate limiting library for AI coding agents.
4+
5+
## Design Principles
6+
7+
- **Library only** — no CLI, no binary
8+
- **Token-efficient** — optimized for context window management
9+
- **Security-first** — secrets scanning prevents credential leaks
10+
11+
## Build & Test
12+
13+
```bash
14+
go test ./... # Run all tests
15+
go test -race ./... # Race detector
16+
go test -coverprofile=c.out ./... # Coverage
17+
go vet ./... # Static analysis
18+
gofumpt -w . # Format
19+
```
20+
21+
## Architecture
22+
23+
- `tokenizer.go` — Token counting and estimation
24+
- `compressor.go` — Context compression strategies
25+
- `secrets.go` — Secrets scanning and redaction
26+
- `ratelimit.go` — Rate limiting for API calls
27+
- `budget.go` — Token budget management
28+
- `filter.go` — Content filtering and validation
29+
30+
## Conventions
31+
32+
- Go 1.26+, pure Go, no CGO
33+
- Table-driven tests
34+
- Conventional Commits: `feat:`, `fix:`, `docs:`, `refactor:`, `test:`
35+
- No `Co-authored-by:` trailers (auto-stripped by githook)
36+
- `gofumpt` formatting enforced in CI
37+
- Quality.yml coverage threshold: 30%
38+
39+
## Common Pitfalls
40+
41+
- Token estimation is approximate — don't rely on exact counts
42+
- Secrets scanning has false positives — use allowlists for known patterns
43+
- Rate limiter tests need careful timing assertions
44+
45+
## Naming Conventions
46+
47+
- **Top-level functions are verbs**: `Compress()`, `EstimateTokens()`, `EstimateTokensPrecise()`, `WarmupTokenizer()`
48+
- **Option pattern**: `Option` interface with `optFunc` adapter — same pattern as sight and inspect
49+
- **Preset options are bare vars**: `Minimal`, `Aggressive`, `Surface`, `Adaptive`, `Code`, `Log` — exported `var Option` values
50+
- **Mode is a string type**: `Mode` with constants `ModeMinimal`, `ModeAggressive`
51+
- **Tier is a string type**: `Tier` with constants `TierSurface`, `TierTrim`, `TierExtract`, `TierCore`, `TierCode`, `TierLog`, `TierThread`, `TierAdaptive`
52+
- **Internal packages**: `internal/core/` (tokenizer), `internal/filter/` (pipeline), `internal/secrets/` (detector), `internal/codeaware/` (code-specific)
53+
- **Secret detector pattern**: `DefaultSecretDetector()` returns singleton, `NewSecretDetector()` creates fresh instance
54+
- **SecretMatch is a type alias**: `type SecretMatch = secrets.SecretMatch` — re-exports from internal package
55+
- **Stats struct**: returned from `Compress()``OriginalTokens`, `FinalTokens`, compression ratio fields
56+
57+
## API Patterns
58+
59+
- **One-shot compression**: `tok.Compress(text, opts...)` — creates pipeline internally, returns `(string, Stats)`
60+
- **Reusable compressor**: `tok.NewCompressor(opts...)` returns `*Compressor` with `Compress(text)` method — reuses caches
61+
- **Token estimation**: `EstimateTokens(text)` for fast approximation, `EstimateTokensPrecise(text)` for BPE accuracy
62+
- **Warmup**: `WarmupTokenizer()` pre-initializes BPE tokenizer in background — call at startup to avoid first-call latency
63+
- **Budget constraint**: `WithBudget(tokens)` option hard-limits output token count — pipeline truncates to fit
64+
- **Query-driven filtering**: `WithQuery(intent)` option provides goal context for relevance-based filtering
65+
- **Tier selection**: `WithTier(TierCode)` selects pre-built pipeline profile — each tier has different layer counts
66+
- **Mode selection**: `WithMode(ModeAggressive)` controls compression aggressiveness within a tier
67+
- **Secret detection**: `DefaultSecretDetector().DetectSecrets(text)` returns `[]SecretMatch`; `.RedactSecrets(text)` returns redacted string
68+
- **Entropy-based detection**: `DetectAndRedactWithEntropy(text, threshold)` — pattern matching + Shannon entropy analysis
69+
70+
## Testing Patterns
71+
72+
- **External test package**: `package tok_test` — tests import `tok` as a consumer would
73+
- **Simple assertions**: `TestCompress` checks non-empty output and non-zero `OriginalTokens` — minimal, focused
74+
- **Empty input test**: `TestCompress_Empty` — verify empty string returns empty string and zero stats
75+
- **Preset smoke tests**: `TestCompress_Aggressive`, `TestCompress_WithTier`, `TestCompress_WithQuery` — each preset/option tested
76+
- **Budget test**: `TestCompress_WithBudget` — create large input, compress with budget 50, verify `FinalTokens <= 60`
77+
- **Concurrent safety test**: `TestCompress_Concurrent` — 10 goroutines compressing same input with `sync.WaitGroup`
78+
- **Token estimation test**: `TestEstimateTokens` — verify non-zero for known input
79+
- **Reusable compressor test**: `TestNewCompressor` — create compressor, call `Compress()` twice, verify both return results
80+
- **Secret detection tests**: `secrets_test.go` — pattern matching, entropy edge cases, allowlist exclusions
81+
- **Bench tests**: `internal/` subdirectories — performance-critical paths
82+
83+
## Refactoring Guidelines
84+
85+
- **Safe to refactor**: `internal/filter/` pipeline layers — add, remove, reorder filter stages
86+
- **Safe to refactor**: `internal/core/` tokenizer — improve estimation accuracy, add new tokenizers
87+
- **Safe to refactor**: `internal/secrets/` patterns — add new detection patterns, tune entropy threshold
88+
- **Safe to refactor**: `internal/codeaware/` — language-specific compression rules
89+
- **Do not touch**: `Compress()` function signature — primary API contract
90+
- **Do not touch**: `Option` interface and preset vars — used by all consumers
91+
- **Do not touch**: `Stats` struct fields — returned from every `Compress()` call
92+
- **Do not touch**: `SecretDetector` public methods — used by hawk for secret scanning
93+
- **Do not touch**: `Tier` and `Mode` constants — referenced in configs and CLI flags
94+
- **Safe to extend**: add new `Tier` values, new filter layers, new secret patterns, new compression strategies
95+
- **When adding a tier**: add constant to `Tier` type, implement pipeline config in `internal/filter/`
96+
97+
## Key File Locations
98+
99+
| What | Where |
100+
|---|---|
101+
| Public API entry point | `tok.go` (`Compress()`, `EstimateTokens()`, `WarmupTokenizer()`) |
102+
| Reusable compressor | `compressor.go` (`Compressor` struct) |
103+
| Options & presets | `options.go` (`Option`, `Mode`, `Tier`, `With*` functions, preset vars) |
104+
| Secret detection | `secrets.go` (`SecretDetector`, `DetectSecrets()`, `RedactSecrets()`) |
105+
| Stats type | `stats.go` (returned from `Compress()`) |
106+
| Stream processing | `stream.go` |
107+
| Core tokenizer | `internal/core/` (BPE tokenizer, estimation) |
108+
| Filter pipeline | `internal/filter/` (pipeline coordinator, tier configs, layer execution) |
109+
| Code-aware filters | `internal/codeaware/` (language-specific compression) |
110+
| Secret patterns | `internal/secrets/` (regex patterns, entropy analysis, allowlists) |
111+
| Utility functions | `internal/utils/` |
112+
| Main test file | `tok_test.go` (compression, estimation, concurrency, presets) |
113+
| Secret tests | `secrets_test.go` |
114+
| Compression tests | `compressor_test.go` (if exists) |
115+
| Benchmark tests | `internal/*/bench_test.go` |
116+
| Linter config | `.golangci.yml` (govet, ineffassign, misspell — minimal) |

0 commit comments

Comments
 (0)