ctrlb-hq · lakshmanpasala · Apr 10, 2026 · Apr 10, 2026 · Apr 10, 2026 · Apr 10, 2026
diff --git a/.cursor/rules/log-analysis.mdc b/.cursor/rules/log-analysis.mdc
@@ -0,0 +1,21 @@
+---
+description: Use ctrlb-decompose to analyze log files before reading them raw
+globs: **/*.log
+alwaysApply: false
+---
+
+# Log File Analysis
+
+This project includes ctrlb-decompose, a log pattern extraction tool. When you encounter a `.log` file or are asked to analyze logs, run ctrlb-decompose first to get a structural summary:
+
+```bash
+./target/release/ctrlb-decompose <file> --llm --context 2 --quiet 2>/dev/null
+```
+
+This compresses raw log lines into patterns with typed variables, quantile stats (p50/p99), and anomaly detection. Errors and warnings are prioritized first.
+
+For JSON output: `./target/release/ctrlb-decompose <file> --json --quiet 2>/dev/null`
+
+For piped input: `kubectl logs pod | ./target/release/ctrlb-decompose --llm --source-label "pod-name"`
+
+Workflow: decompose first, identify patterns of interest, then grep or use `--context N` with a higher N to see more examples.
diff --git a/.cursorrules b/.cursorrules
@@ -0,0 +1,35 @@
+This repository contains ctrlb-decompose, a CLI tool that compresses raw log lines into structural patterns with typed variables, quantile statistics, anomaly detection, and temporal correlations.
+
+## Log Analysis
+
+When working with log files (>100 lines), run ctrlb-decompose first to get a structural summary before reading raw logs:
+
+```bash
+./target/release/ctrlb-decompose <file> --llm --context 2 --quiet 2>/dev/null
+```
+
+This compresses thousands of log lines into a handful of actionable patterns with:
+- Error/warning patterns prioritized first
+- Typed variables (IPv4, UUID, Duration, HexID, Integer, Enum, etc.)
+- Quantile statistics (p50, p99, min, max) for numeric variables
+- Example raw lines for each pattern
+
+Key flags:
+- `--llm` — Token-efficient markdown output (default context=2, banner suppressed)
+- `--json` — Structured JSON output for programmatic use
+- `--source-label <name>` — Tag output with a source name (e.g., pod name)
+- `--sim-threshold <0.0-1.0>` — Tune clustering granularity (default 0.5, lower = fewer patterns)
+- `--top <N>` — Limit to top N patterns (default 20)
+- `--context <N>` — Example lines per pattern
+
+Works with stdin: `kubectl logs pod | ./target/release/ctrlb-decompose --llm`
+
+Workflow: decompose first, identify patterns of interest, then drill into the raw file with grep or higher --context.
+
+## Build
+
+```bash
+cargo build --release
+cargo test --locked
+cargo clippy
+```
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
@@ -0,0 +1,25 @@
+This repository contains ctrlb-decompose, a CLI tool that compresses raw log lines into structural patterns with typed variables, quantile statistics, and anomaly detection.
+
+## Log Analysis
+
+When working with log files (>100 lines), run ctrlb-decompose first:
+
+```bash
+./target/release/ctrlb-decompose <file> --llm --context 2 --quiet 2>/dev/null
+```
+
+Key flags:
+- `--llm` — Token-efficient markdown (errors first, quantile stats inline)
+- `--json` — Structured JSON for programmatic use
+- `--source-label <name>` — Tag output with source name
+- `--sim-threshold <0.0-1.0>` — Tune clustering granularity (default 0.5)
+
+Works with stdin: `kubectl logs pod | ./target/release/ctrlb-decompose --llm`
+
+## Build & Test
+
+```bash
+cargo build --release
+cargo test --locked
+cargo clippy
+```
diff --git a/.gitignore b/.gitignore
@@ -27,4 +27,8 @@ target
 /logs
 
 # WASM build output (regenerated by wasm-pack)
-/web/pkg
+/web/pkg
+
+# Perf profiling data
+perf.data
+perf.data.old
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,53 @@
+# AGENTS.md
+
+This file provides guidance to Codex and other AI agents working with code in this repository.
+
+## Project
+
+ctrlb-decompose compresses raw log lines into structural patterns with typed variables, quantile statistics, anomaly detection, and temporal correlations. It runs as a CLI tool, a Rust library, or a WASM module.
+
+## Build & Test
+
+```bash
+cargo build --release
+cargo test --locked
+cargo clippy
+```
+
+## Log Analysis
+
+When working with log files (>100 lines), run ctrlb-decompose first to get a structural summary before reading raw logs:
+
+```bash
+./target/release/ctrlb-decompose <file> --llm --context 2 --quiet 2>/dev/null
+```
+
+Key flags:
+- `--llm` — Token-efficient markdown output (banner auto-suppressed, default context=2)
+- `--json` — Structured JSON for programmatic use
+- `--source-label <name>` — Tag output with source name
+- `--sim-threshold <0.0-1.0>` — Tune clustering (default 0.5, lower = fewer patterns)
+- `--top <N>` — Top N patterns (default 20)
+- `--context <N>` — Example lines per pattern
+
+Works with stdin: `kubectl logs pod | ./target/release/ctrlb-decompose --llm`
+
+Workflow: decompose first, identify patterns, then drill into raw logs with grep or higher `--context`.
+
+## Architecture
+
+Two-stage normalization + clustering pipeline (single-pass, streaming):
+
+1. Timestamp extraction (`src/timestamp.rs`)
+2. CLP encoding (`src/extraction/clp/`) — normalizes variables into typed placeholders
+3. Drain3 clustering (`src/extraction/drain3.rs`) — tree-based prefix clustering with LRU eviction
+4. Variable classification — semantic types: IPv4, UUID, Duration, HexID, Integer, Float, Enum, String
+5. Statistics (`src/stats.rs`) — DDSketch quantiles, HyperLogLog cardinality, top-k, reservoir sampling
+6. Anomaly detection (`src/anomaly.rs`) — frequency spikes, error cascades, bimodal distributions
+7. Scoring & correlation (`src/scoring.rs`, `src/correlation.rs`)
+8. Output formatting (`src/format/`) — human, llm, json
+
+Entry points:
+- CLI: `main.rs` -> `lib.rs::run(args)`
+- Library: `lib.rs::process_log_text(input, opts)`
+- WASM: `wasm.rs::analyze_logs(input, format, top_n, context_lines)`
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,93 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project
+
+ctrlb-decompose compresses raw log lines into structural patterns with typed variables, quantile statistics, anomaly detection, and temporal correlations. It runs as a CLI tool, a Rust library, or a WASM module in the browser.
+
+## Build & Test Commands
+
+```bash
+# Build
+cargo build
+cargo build --release
+
+# Test
+cargo test --locked
+cargo test <test_name>             # Run a single test
+
+# Lint
+cargo clippy
+
+# Build without default features (library-only, no CLI)
+cargo build --no-default-features
+
+# WASM build
+wasm-pack build --target web --out-dir web/pkg -- --no-default-features --features wasm
+```
+
+## Architecture
+
+**Two-stage normalization + clustering pipeline** (single-pass, streaming):
+
+1. **Timestamp extraction** (`src/timestamp.rs`) — regex-based, stripped before further processing
+2. **CLP encoding** (`src/extraction/clp/`) — normalizes variables (ints, floats, IPs, hex) into typed placeholders
+3. **Drain3 clustering** (`src/extraction/drain3.rs`) — tree-based prefix clustering on logtypes with LRU eviction
+4. **Variable classification** (`src/extraction/drain3.rs`) — merges CLP-decoded values with Drain3 wildcards, classifies into semantic types (IPv4, UUID, Duration, HexID, Integer, Float, Enum, String, etc.)
+5. **Statistics** (`src/stats.rs`) — DDSketch quantiles (~200 bytes/slot), HyperLogLog++ cardinality, top-k, temporal bucketing, reservoir-sampled examples
+6. **Anomaly detection** (`src/anomaly.rs`) — frequency spikes, error cascades, bimodal distributions, low cardinality
+7. **Scoring & correlation** (`src/scoring.rs`, `src/correlation.rs`) — keyword severity, Pearson temporal co-occurrence, shared variables
+8. **Output formatting** (`src/format/`) — human (ANSI terminal), llm (compact markdown), json (structured)
+
+**Entry points:**
+- CLI: `main.rs` → `lib.rs::run(args)`
+- Library: `lib.rs::process_log_text(input, opts) -> AnalysisOutput`
+- WASM: `wasm.rs::analyze_logs(input, format, top_n, context_lines) -> String`
+
+## Feature Gates
+
+- `cli` (default) — includes `clap` and `colored` for terminal use
+- `wasm` — includes `wasm-bindgen` and `serde-wasm-bindgen` for browser use
+- The core library is WASM-safe (no stdin/filesystem deps)
+- Crate type is `["cdylib", "rlib"]` for dual WASM + library output
+
+## Key Design Decisions
+
+- **Single-pass streaming**: no second pass over data; all stats accumulated incrementally
+- **Memory-bounded**: Drain3 LRU (default 10k clusters), DDSketch fixed-size quantiles, HyperLogLog++ fixed-size cardinality, reservoir sampling for examples
+- **Lazy regex compilation**: `once_cell::sync::Lazy` for all regex patterns
+- **Minimum Rust version**: 1.94.0
+
+## Testing
+
+Tests are in `tests/integration.rs` using the fixture at `tests/fixtures/sample.log`. Snapshot testing uses `insta`. Benchmarks use `criterion` in `benches/pipeline.rs`.
+
+## CI
+
+- `ci.yml` — runs on push to main and PRs: `cargo test`, `clippy`, feature matrix (no-default-features, wasm target)
+- `release.yml` — triggered by `v*.*.*` tags: cargo-dist multi-platform binaries + homebrew
+- `wasm-deploy.yml` — deploys WASM build to GitHub Pages on push to main
+
+## Using ctrlb-decompose for log analysis
+
+When working with large log files (>100 lines), use ctrlb-decompose to get a structured summary before reading raw logs. This saves context window and surfaces errors/anomalies immediately.
+
+```bash
+# Analyze a log file (LLM-optimized output, 2 example lines per pattern)
+./target/release/ctrlb-decompose <file> --llm --context 2 --quiet 2>/dev/null
+
+# Pipe from any command
+kubectl logs <pod> | ./target/release/ctrlb-decompose --llm --context 2 --quiet --source-label "pod-name" 2>/dev/null
+journalctl -n 5000 | ./target/release/ctrlb-decompose --llm --context 2 --quiet 2>/dev/null
+
+# JSON output for programmatic use
+./target/release/ctrlb-decompose <file> --json --quiet 2>/dev/null
+
+# Tune clustering granularity (lower = more aggressive merging, higher = more patterns)
+./target/release/ctrlb-decompose <file> --llm --sim-threshold 0.6
+```
+
+In LLM mode, the banner is suppressed automatically. The `--quiet` flag suppresses the progress line on stderr.
+
+**Workflow**: Run `--llm` first to identify patterns of interest, then use `--context N` with higher N or grep for specific patterns in the raw file.