chore: restore AGENTS.md — project instructions for AI agents

Patel230 · Patel230 · commit df4d8e4b41da · 2026-06-02T17:45:44.000+05:30
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,116 @@
+# AGENTS.md — Tok
+
+Tokenizer, compression, secrets scanning, and rate limiting library for AI coding agents.
+
+## Design Principles
+
+- **Library only** — no CLI, no binary
+- **Token-efficient** — optimized for context window management
+- **Security-first** — secrets scanning prevents credential leaks
+
+## Build & Test
+
+```bash
+go test ./...                    # Run all tests
+go test -race ./...              # Race detector
+go test -coverprofile=c.out ./... # Coverage
+go vet ./...                     # Static analysis
+gofumpt -w .                     # Format
+```
+
+## Architecture
+
+- `tokenizer.go` — Token counting and estimation
+- `compressor.go` — Context compression strategies
+- `secrets.go` — Secrets scanning and redaction
+- `ratelimit.go` — Rate limiting for API calls
+- `budget.go` — Token budget management
+- `filter.go` — Content filtering and validation
+
+## Conventions
+
+- Go 1.26+, pure Go, no CGO
+- Table-driven tests
+- Conventional Commits: `feat:`, `fix:`, `docs:`, `refactor:`, `test:`
+- No `Co-authored-by:` trailers (auto-stripped by githook)
+- `gofumpt` formatting enforced in CI
+- Quality.yml coverage threshold: 30%
+
+## Common Pitfalls
+
+- Token estimation is approximate — don't rely on exact counts
+- Secrets scanning has false positives — use allowlists for known patterns
+- Rate limiter tests need careful timing assertions
+
+## Naming Conventions
+
+- **Top-level functions are verbs**: `Compress()`, `EstimateTokens()`, `EstimateTokensPrecise()`, `WarmupTokenizer()`
+- **Option pattern**: `Option` interface with `optFunc` adapter — same pattern as sight and inspect
+- **Preset options are bare vars**: `Minimal`, `Aggressive`, `Surface`, `Adaptive`, `Code`, `Log` — exported `var Option` values
+- **Mode is a string type**: `Mode` with constants `ModeMinimal`, `ModeAggressive`
+- **Tier is a string type**: `Tier` with constants `TierSurface`, `TierTrim`, `TierExtract`, `TierCore`, `TierCode`, `TierLog`, `TierThread`, `TierAdaptive`
+- **Internal packages**: `internal/core/` (tokenizer), `internal/filter/` (pipeline), `internal/secrets/` (detector), `internal/codeaware/` (code-specific)
+- **Secret detector pattern**: `DefaultSecretDetector()` returns singleton, `NewSecretDetector()` creates fresh instance
+- **SecretMatch is a type alias**: `type SecretMatch = secrets.SecretMatch` — re-exports from internal package
+- **Stats struct**: returned from `Compress()` — `OriginalTokens`, `FinalTokens`, compression ratio fields
+
+## API Patterns
+
+- **One-shot compression**: `tok.Compress(text, opts...)` — creates pipeline internally, returns `(string, Stats)`
+- **Reusable compressor**: `tok.NewCompressor(opts...)` returns `*Compressor` with `Compress(text)` method — reuses caches
+- **Token estimation**: `EstimateTokens(text)` for fast approximation, `EstimateTokensPrecise(text)` for BPE accuracy
+- **Warmup**: `WarmupTokenizer()` pre-initializes BPE tokenizer in background — call at startup to avoid first-call latency
+- **Budget constraint**: `WithBudget(tokens)` option hard-limits output token count — pipeline truncates to fit
+- **Query-driven filtering**: `WithQuery(intent)` option provides goal context for relevance-based filtering
+- **Tier selection**: `WithTier(TierCode)` selects pre-built pipeline profile — each tier has different layer counts
+- **Mode selection**: `WithMode(ModeAggressive)` controls compression aggressiveness within a tier
+- **Secret detection**: `DefaultSecretDetector().DetectSecrets(text)` returns `[]SecretMatch`; `.RedactSecrets(text)` returns redacted string
+- **Entropy-based detection**: `DetectAndRedactWithEntropy(text, threshold)` — pattern matching + Shannon entropy analysis
+
+## Testing Patterns
+
+- **External test package**: `package tok_test` — tests import `tok` as a consumer would
+- **Simple assertions**: `TestCompress` checks non-empty output and non-zero `OriginalTokens` — minimal, focused
+- **Empty input test**: `TestCompress_Empty` — verify empty string returns empty string and zero stats
+- **Preset smoke tests**: `TestCompress_Aggressive`, `TestCompress_WithTier`, `TestCompress_WithQuery` — each preset/option tested
+- **Budget test**: `TestCompress_WithBudget` — create large input, compress with budget 50, verify `FinalTokens <= 60`
+- **Concurrent safety test**: `TestCompress_Concurrent` — 10 goroutines compressing same input with `sync.WaitGroup`
+- **Token estimation test**: `TestEstimateTokens` — verify non-zero for known input
+- **Reusable compressor test**: `TestNewCompressor` — create compressor, call `Compress()` twice, verify both return results
+- **Secret detection tests**: `secrets_test.go` — pattern matching, entropy edge cases, allowlist exclusions
+- **Bench tests**: `internal/` subdirectories — performance-critical paths
+
+## Refactoring Guidelines
+
+- **Safe to refactor**: `internal/filter/` pipeline layers — add, remove, reorder filter stages
+- **Safe to refactor**: `internal/core/` tokenizer — improve estimation accuracy, add new tokenizers
+- **Safe to refactor**: `internal/secrets/` patterns — add new detection patterns, tune entropy threshold
+- **Safe to refactor**: `internal/codeaware/` — language-specific compression rules
+- **Do not touch**: `Compress()` function signature — primary API contract
+- **Do not touch**: `Option` interface and preset vars — used by all consumers
+- **Do not touch**: `Stats` struct fields — returned from every `Compress()` call
+- **Do not touch**: `SecretDetector` public methods — used by hawk for secret scanning
+- **Do not touch**: `Tier` and `Mode` constants — referenced in configs and CLI flags
+- **Safe to extend**: add new `Tier` values, new filter layers, new secret patterns, new compression strategies
+- **When adding a tier**: add constant to `Tier` type, implement pipeline config in `internal/filter/`
+
+## Key File Locations
+
+| What | Where |
+|---|---|
+| Public API entry point | `tok.go` (`Compress()`, `EstimateTokens()`, `WarmupTokenizer()`) |
+| Reusable compressor | `compressor.go` (`Compressor` struct) |
+| Options & presets | `options.go` (`Option`, `Mode`, `Tier`, `With*` functions, preset vars) |
+| Secret detection | `secrets.go` (`SecretDetector`, `DetectSecrets()`, `RedactSecrets()`) |
+| Stats type | `stats.go` (returned from `Compress()`) |
+| Stream processing | `stream.go` |
+| Core tokenizer | `internal/core/` (BPE tokenizer, estimation) |
+| Filter pipeline | `internal/filter/` (pipeline coordinator, tier configs, layer execution) |
+| Code-aware filters | `internal/codeaware/` (language-specific compression) |
+| Secret patterns | `internal/secrets/` (regex patterns, entropy analysis, allowlists) |
+| Utility functions | `internal/utils/` |
+| Main test file | `tok_test.go` (compression, estimation, concurrency, presets) |
+| Secret tests | `secrets_test.go` |
+| Compression tests | `compressor_test.go` (if exists) |
+| Benchmark tests | `internal/*/bench_test.go` |
+| Linter config | `.golangci.yml` (govet, ineffassign, misspell — minimal) |