diff --git a/README.md b/README.md
index 56caa6f..7cc0a5a 100644
--- a/README.md
+++ b/README.md
@@ -2,11 +2,15 @@
 
 [![PyPI - Version](https://img.shields.io/pypi/v/infer-check?logo=PyPi&color=%233775A9)](https://pypi.org/project/infer-check/)
 [![Run tests and upload coverage](https://github.com/NullPointerDepressiveDisorder/infer-check/actions/workflows/coverage.yml/badge.svg?branch=main)](https://github.com/NullPointerDepressiveDisorder/infer-check/actions/workflows/coverage.yml)
+[![codecov](https://codecov.io/gh/NullPointerDepressiveDisorder/infer-check/graph/badge.svg?token=FWG0Z5YHUS)](https://codecov.io/gh/NullPointerDepressiveDisorder/infer-check)
+[![Docs](https://img.shields.io/badge/docs-mkdocs-blue)](https://nullpointerdepressivedisorder.github.io/infer-check)
 
 **Catches the correctness bugs that benchmarks miss in LLM inference engines.**
 
 Quantization silently breaks arithmetic. Serving layers silently alter output. KV caches silently corrupt under load. Benchmarks like lm-evaluation-harness test whether models are smart — `infer-check` tests whether engines are correct.
 
+> **[Read the full documentation](https://nullpointerdepressivedisorder.github.io/infer-check)**
+
 ## The problem
 
 Every LLM inference engine has correctness bugs that benchmarks don't catch:
@@ -18,43 +22,6 @@ Every LLM inference engine has correctness bugs that benchmarks don't catch:
 
 These aren't model quality problems — they're engine correctness failures. `infer-check` is a CLI tool that runs differential tests across backends, quantization levels, and concurrency conditions to surface them automatically.
 
-## Example results
-
-Results from running `infer-check` on Llama-3.1-8B-Instruct and Qwen3.5-4B (MoE) on Apple Silicon using mlx-lm and vllm-mlx. These demonstrate what the tool catches — not a comprehensive benchmark.
-
-### Quantization sweep
-
-4-bit quantization on Llama-3.1-8B showed clear task-dependent degradation. Numerical tasks broke worst:
-
-```
-                       Llama-3.1-8B: bf16 vs 4bit
-┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
-┃ prompt suite          ┃ identical ┃ severe   ┃ mean_similarity ┃
-┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
-│ adversarial-numerics  │     0/30  │   23/30  │          0.311  │
-│ reasoning             │     1/50  │   35/50  │          0.384  │
-│ code                  │     0/49  │   30/49  │          0.452  │
-└───────────────────────┴───────────┴──────────┴─────────────────┘
-```
-
-A "severe" divergence means the quantized output is functionally wrong — not just worded differently, but giving incorrect answers to questions the bf16 baseline handles correctly. This pattern is consistent with published research on quantization-induced degradation, reproduced here on MLX's native quantization scheme.
-
-### Dense vs. MoE comparison
-
-Qwen3.5-4B (Gated Delta Networks + sparse MoE) showed similar degradation rates to dense Llama-3.1-8B in our testing — 35/50 severe on reasoning at 4-bit. Small sample, but the tool picks up the signal clearly on both architectures.
-
-### Cross-backend diff
-
-mlx-lm vs vllm-mlx at temperature=0 on Llama-3.1-8B-4bit: 50/50 identical (reasoning) and 30/30 identical (numerics). In this test, the vllm-mlx serving layer introduced zero divergence — output differences in production would come from quantization, not from the serving layer itself.
-
-### Determinism
-
-Llama-3.1-8B-4bit and Qwen3.5-4B both scored 50/50 identical across 20 runs per prompt on single-request mlx-lm inference at temperature=0.
-
-### Stress test
-
-vllm-mlx at concurrency 1/2/4/8: zero errors, 100% output consistency at all levels. No KV cache corruption or batch-dependent divergence detected.
-
 ## Installation
 
 ```
@@ -64,27 +31,43 @@ pip install infer-check
 pip install "infer-check[mlx]"
 ```
 
-## Usage
+## Quick start
 
-### Quantization sweep
+Compare two quantizations head-to-head:
+
+```
+infer-check compare \
+  mlx-community/Llama-3.1-8B-Instruct-4bit \
+  mlx-community/Llama-3.1-8B-Instruct-8bit \
+  --prompts adversarial-numerics
+```
 
-Compare pre-quantized models against a baseline. Each model is a separate HuggingFace repo. Use `--max-tokens` to control generation length (defaults to 1024) and `--num-prompts` to limit the number of prompts used.
+Run a full quantization sweep:
 
 ```
 infer-check sweep \
   --models "bf16=mlx-community/Meta-Llama-3.1-8B-Instruct-bf16,\
             8bit=mlx-community/Meta-Llama-3.1-8B-Instruct-8bit,\
             4bit=mlx-community/Meta-Llama-3.1-8B-Instruct-4bit" \
-  --backend mlx-lm \
-  --prompts reasoning \
-  --max-tokens 512 \
-  --num-prompts 10 \
-  --output ./results/sweep/
+  --prompts reasoning
 ```
 
-`--prompts` accepts either a bundled suite name (`reasoning`, `code`, `adversarial-numerics`, `determinism`, `long-context`, `quant-sensitive`) or a path to any `.jsonl` file.
+## Commands
+
+| Command | Purpose | Docs |
+| --- | --- | --- |
+| `sweep` | Compare pre-quantized models against a baseline | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/sweep/) |
+| `compare` | Head-to-head comparison of two models or quantizations | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/compare/) |
+| `diff` | Compare outputs across different backends for the same model | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/diff/) |
+| `determinism` | Test output reproducibility at temperature=0 | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/determinism/) |
+| `stress` | Test correctness under concurrent load | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/stress/) |
+| `report` | Generate HTML/JSON reports from saved results | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/report/) |
+
+## Example results
+
+Results from running `infer-check` on Llama-3.1-8B-Instruct on Apple Silicon using mlx-lm.
 
-The baseline is automatically run twice as a self-check — if it's not 50/50 identical, your comparison data is unreliable.
+### Quantization sweep
 
 ```
                                  Sweep Summary
@@ -97,87 +80,46 @@ The baseline is automatically run twice as a self-check — if it's not 50/50 id
 └─────────────────────┴───────────┴───────┴──────────┴────────┴─────────────────┘
 ```
 
-### Cross-backend diff
-
-Same model, same quant, different inference paths. Catches serving-layer bugs.
+A "severe" divergence means the quantized output is functionally wrong — not just worded differently, but giving incorrect answers to questions the bf16 baseline handles correctly.
 
-```
-# Start vllm-mlx in another terminal:
-# vllm-mlx serve mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --port 8000
-
-infer-check diff \
-  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
-  --backends "mlx-lm,openai-compat" \
-  --base-urls ",http://127.0.0.1:8000" \
-  --prompts reasoning \
-  --output ./results/diff/
-```
-
-Uses `/v1/chat/completions` by default (`--chat`) so server-side chat templates match the local backend. Pass `--no-chat` for raw `/v1/completions`.
-
-### Determinism
-
-Same prompt N times at temperature=0. Output should be bit-identical every run.
-
-```
-infer-check determinism \
-  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
-  --backend mlx-lm \
-  --prompts determinism \
-  --runs 20 \
-  --output ./results/determinism/
-```
+### Cross-backend diff
 
-### Stress test
+mlx-lm vs vllm-mlx at temperature=0: 50/50 identical (reasoning) and 30/30 identical (numerics). Zero serving-layer divergence detected.
 
-Concurrent requests through a serving backend. Tests KV cache correctness under load.
+### Determinism & stress
 
-```
-infer-check stress \
-  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
-  --backend openai-compat \
-  --base-url http://127.0.0.1:8000 \
-  --prompts reasoning \
-  --concurrency 1,2,4,8 \
-  --output ./results/stress/
-```
+100% determinism across 20 runs per prompt at temperature=0. 100% output consistency at concurrency levels 1/2/4/8.
 
-### Report
+## Supported backends
 
-Generate an HTML report from all saved results.
+| Backend           | Type | Use case |
+|-------------------| --- | --- |
+| **mlx-lm**        | In-process | Local Apple Silicon inference with logprobs |
+| **llama-cpp**     | HTTP | `llama-server` via `/completion` endpoint |
+| **vllm-mlx**      | HTTP | Continuous batching on Apple Silicon |
+| **openai-compat** | HTTP | Any OpenAI-compatible server (vLLM, SGLang, Ollama) |
 
-```
-infer-check report ./results/ --format html
-```
+See the [backends documentation](https://nullpointerdepressivedisorder.github.io/infer-check/backends/) for setup and configuration details.
 
 ## Prompt suites
 
-Curated prompts targeting known quantization failure modes:
+Six curated suites ship with the package — no need to clone the repo:
 
 | Suite | Count | Purpose |
 | --- | --- | --- |
-| `reasoning.jsonl` | 50 | Multi-step math and logic |
-| `code.jsonl` | 49 | Python, JSON, SQL generation |
-| `adversarial-numerics.jsonl` | 30 | IEEE 754 edge cases, overflow, precision |
-| `long-context.jsonl` | 10 | Tables and transcripts with recall questions |
-| `quant-sensitive.jsonl` | 20 | Multi-digit arithmetic, long CoT, precise syntax |
-| `determinism.jsonl` | 50 | High-entropy continuations for determinism testing |
+| `reasoning` | 50 | Multi-step math and logic |
+| `code` | 49 | Python, JSON, SQL generation |
+| `adversarial-numerics` | 30 | IEEE 754 edge cases, overflow, precision |
+| `long-context` | 10 | Tables and transcripts with recall questions |
+| `quant-sensitive` | 20 | Multi-digit arithmetic, long CoT, precise syntax |
+| `determinism` | 50 | High-entropy continuations for determinism testing |
 
-All suites ship with the package — no need to clone the repo. Custom suites are JSONL files with one object per line (default `max_tokens` is 1024):
+Custom suites are JSONL files with one object per line:
 
 ```json
 {"id": "custom-001", "text": "Your prompt here", "category": "math", "max_tokens": 512}
 ```
 
-## Supported backends
-
-| Backend | Type | Use case |
-| --- | --- | --- |
-| **mlx-lm** | In-process | Local Apple Silicon inference with logprobs |
-| **llama.cpp** | HTTP | `llama-server` via `/completion` endpoint |
-| **vllm-mlx** | HTTP | Continuous batching on Apple Silicon |
-| **openai-compat** | HTTP | Any OpenAI-compatible server (vLLM, SGLang, Ollama) |
-
 ## Roadmap
 
 - [ ] GGUF backend (direct llama.cpp integration without HTTP)
diff --git a/docs/backends.md b/docs/backends.md
new file mode 100644
index 0000000..0d82caf
--- /dev/null
+++ b/docs/backends.md
@@ -0,0 +1,154 @@
+# Backends
+
+`infer-check` supports four inference backends. Each backend implements a common protocol for generation, health checks, and cleanup.
+
+## Overview
+
+| Backend           | Type | Default URL | Use case |
+|-------------------|------|-------------|----------|
+| **mlx-lm**        | In-process | (local) | Local Apple Silicon inference with logprobs |
+| **llama-cpp**     | HTTP | `http://127.0.0.1:8080` | llama-server via `/completion` endpoint |
+| **vllm-mlx**      | HTTP | `http://127.0.0.1:8000` | Continuous batching on Apple Silicon |
+| **openai-compat** | HTTP | `http://127.0.0.1:11434/v1` | Any OpenAI-compatible server (vLLM, SGLang, Ollama) |
+
+## mlx-lm
+
+In-process inference using Apple's MLX framework. Runs directly on Apple Silicon with no server required.
+
+**Install**: `pip install "infer-check[mlx]"` (requires `mlx` and `mlx-lm` packages)
+
+**Features**:
+
+- Generates per-token logprobs when available via `generate_step()`
+- Falls back to simple generation if logprobs aren't supported
+- Lazy model loading -- the model is downloaded and loaded on first use, not at import time
+- Single-threaded sequential inference
+
+**When to use**: Local testing on Mac. Best baseline for quantization sweeps since it runs natively with no serving layer overhead.
+
+**Example**:
+
+```bash
+infer-check compare \
+  mlx-community/Llama-3.1-8B-Instruct-bf16 \
+  mlx-community/Llama-3.1-8B-Instruct-4bit
+```
+
+## llama.cpp
+
+HTTP backend targeting [llama-server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) (the built-in HTTP server from llama.cpp).
+
+**Setup**: Start llama-server separately:
+
+```bash
+llama-server -m /path/to/model.gguf --port 8080
+```
+
+**Features**:
+
+- Uses the `/completion` endpoint for text generation
+- Requests top-10 token probabilities and converts them to logprobs
+- Aligns token distributions by ID metadata for cross-backend comparison
+- 120-second request timeout
+
+**When to use**: Testing GGUF models served via llama.cpp. Good for comparing GGUF quantization formats against each other or against MLX native quantization.
+
+**Example**:
+
+```bash
+infer-check determinism \
+  --model my-model \
+  --backend llama-cpp \
+  --base-url http://127.0.0.1:8080 \
+  --prompts determinism \
+  --runs 20
+```
+
+## vllm-mlx
+
+HTTP backend for [vllm-mlx](https://github.com/vllm-project/vllm-mlx), a continuous-batching inference server for Apple Silicon. Extends the OpenAI-compatible backend with model-aware health checks.
+
+**Setup**: Start vllm-mlx separately:
+
+```bash
+vllm-mlx serve mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --port 8000
+```
+
+**Features**:
+
+- Inherits all capabilities from the openai-compat backend
+- Model-aware health check verifies the expected model is loaded
+- Supports both `/v1/chat/completions` and `/v1/completions` endpoints
+
+**When to use**: Testing continuous-batching serving layer correctness. Ideal for `diff` and `stress` commands to verify the serving layer doesn't introduce divergence.
+
+**Example**:
+
+```bash
+infer-check diff \
+  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
+  --backends "mlx-lm,vllm-mlx" \
+  --base-urls ",http://127.0.0.1:8000" \
+  --prompts reasoning
+```
+
+## openai-compat
+
+Generic backend for any server that implements the OpenAI API format. Works with vLLM, SGLang, Ollama, and others.
+
+**Features**:
+
+- Supports both `/v1/chat/completions` and `/v1/completions` endpoints
+- Requests logprobs with graceful fallback if unsupported
+- 120-second request timeout
+- Detailed error messages for connection, timeout, and HTTP errors
+
+**When to use**: Any OpenAI-compatible server. This is the most flexible backend and the default for Ollama-style model tags.
+
+**Default URLs by resolution**:
+
+| Model source | Default URL |
+|-------------|-------------|
+| Ollama tags (e.g., `llama3.1:8b`) | `http://127.0.0.1:11434/v1` |
+| Custom server | Use `--base-url` |
+
+**Example with Ollama**:
+
+```bash
+infer-check compare \
+  ollama:llama3.1:8b-instruct-q4_K_M \
+  ollama:llama3.1:8b-instruct-q8_0
+```
+
+**Example with custom server**:
+
+```bash
+infer-check stress \
+  --model my-model \
+  --backend openai-compat \
+  --base-url http://my-server:8000/v1 \
+  --prompts reasoning \
+  --concurrency 1,2,4,8
+```
+
+## Chat vs completions
+
+HTTP backends support two endpoint modes:
+
+- **Chat** (`--chat`, default) -- uses `/v1/chat/completions`. The server applies its chat template. Use this when the server is configured with the correct chat template for your model.
+- **Completions** (`--no-chat`) -- uses `/v1/completions`. Sends raw text with no template. Use this for raw text generation or when you want to control the prompt format yourself.
+
+The `--chat` / `--no-chat` flag applies to the `diff` command. The `compare` command always uses completions mode to avoid template differences between backends.
+
+## Backend selection
+
+Backends are selected in different ways depending on the command:
+
+| Command | How backend is chosen |
+|---------|----------------------|
+| `compare` | Auto-detected from each model spec |
+| `sweep` | `--backend` flag (shared across all models) or auto-detected |
+| `diff` | `--backends` flag (explicit list) |
+| `stress` | `--backend` flag or auto-detected from model |
+| `determinism` | `--backend` flag or auto-detected from model |
+| `report` | N/A (operates on saved results) |
diff --git a/docs/commands/compare.md b/docs/commands/compare.md
new file mode 100644
index 0000000..aa24c13
--- /dev/null
+++ b/docs/commands/compare.md
@@ -0,0 +1,95 @@
+# compare
+
+Head-to-head comparison of two models, quantizations, or backends. Auto-detects the backend from model specs, or accepts explicit prefixes.
+
+## CLI Reference
+
+::: mkdocs-click
+    :module: infer_check.cli
+    :command: main
+    :prog_name: infer-check
+    :subcommand: compare
+    :style: table
+    :show_subcommand_aliases:
+
+## How it works
+
+1. **Resolve models** -- each model spec is resolved to a backend type, model ID, and base URL using the [model resolution rules](../getting-started.md#model-resolution).
+2. **Generate pass A** -- runs all prompts through the first backend.
+3. **Generate pass B** -- runs all prompts through the second backend.
+4. **Compare** -- for each prompt, computes text similarity, severity classification, and KL divergence (when logprobs are available).
+5. **Answer extraction** -- extracts the functional answer from each response based on the prompt category (numeric, boolean, code, JSON, or raw text). Computes flip rate from extracted answers.
+6. **Per-category breakdown** -- groups results by prompt category and computes per-category flip rate and mean similarity.
+7. **Report** -- optionally generates an HTML report with detailed comparisons.
+
+## Model spec prefixes
+
+| Prefix | Backend | Example |
+|--------|---------|---------|
+| `mlx:` | mlx-lm | `mlx:mlx-community/Llama-3.1-8B-Instruct-4bit` |
+| `ollama:` | openai-compat | `ollama:llama3.1:8b-instruct-q4_K_M` |
+| `gguf:` | llama-cpp | `gguf:/path/to/model.gguf` |
+| `vllm-mlx:` | vllm-mlx | `vllm-mlx:mlx-community/Llama-3.1-8B-Instruct-4bit` |
+
+Without a prefix, the backend is auto-detected from the model path. See [Getting Started](../getting-started.md#model-resolution) for full resolution rules.
+
+## Examples
+
+Two MLX quantizations:
+
+```bash
+infer-check compare \
+  mlx-community/Llama-3.1-8B-Instruct-4bit \
+  mlx-community/Llama-3.1-8B-Instruct-8bit
+```
+
+MLX native vs Ollama GGUF:
+
+```bash
+infer-check compare \
+  mlx-community/Llama-3.1-8B-Instruct-4bit \
+  ollama:llama3.1:8b-instruct-q4_K_M
+```
+
+With custom labels and limited prompts:
+
+```bash
+infer-check --num-prompts 10 compare \
+  mlx-community/Llama-3.1-8B-Instruct-4bit \
+  mlx-community/Llama-3.1-8B-Instruct-8bit \
+  --label-a "4bit" \
+  --label-b "8bit" \
+  --prompts reasoning \
+  --no-report
+```
+
+## Output
+
+The command displays three tables:
+
+**Summary table** -- overall metrics:
+
+| Metric | Description |
+|--------|-------------|
+| prompts | Total number of prompts tested |
+| flip rate | Fraction of prompts where the extracted answer changed |
+| mean KL divergence | Average KL(baseline \|\| test) across prompts |
+| mean text similarity | Average text similarity (0-1) |
+| identical / minor / moderate / severe | Severity tier counts |
+
+**Per-category breakdown** -- flip rate and mean similarity by prompt category.
+
+**Flipped prompts detail** -- for each prompt where the answer flipped, shows the prompt text, category, extraction strategy, both answers, and similarity score.
+
+## Output format
+
+Results are saved as a JSON file containing a `CompareResult` with:
+
+- `model_a`, `model_b` -- model configuration labels
+- `backend_a`, `backend_b` -- backend names
+- `comparisons` -- all per-prompt `ComparisonResult` objects
+- `flip_rate` -- fraction of prompts with answer flips
+- `mean_kl_divergence` -- average KL divergence
+- `mean_text_similarity` -- average text similarity
+- `per_category_stats` -- per-category breakdown with flip rate, mean similarity, count
+- `timestamp` -- when the comparison completed
diff --git a/docs/commands/determinism.md b/docs/commands/determinism.md
new file mode 100644
index 0000000..cbca339
--- /dev/null
+++ b/docs/commands/determinism.md
@@ -0,0 +1,69 @@
+# determinism
+
+Test whether a backend produces identical outputs across repeated runs at temperature=0. A correctly implemented inference engine should produce bit-identical output for the same prompt and parameters every time.
+
+## CLI Reference
+
+::: mkdocs-click
+    :module: infer_check.cli
+    :command: main
+    :prog_name: infer-check
+    :subcommand: determinism
+    :style: table
+    :show_subcommand_aliases:
+
+## How it works
+
+1. **Force temperature=0** -- all prompts are run at temperature=0 to ensure deterministic sampling.
+2. **Repeat runs** -- each prompt is sent to the backend N times (default 100).
+3. **Count identical outputs** -- counts how many runs produced the exact same text as the most common output.
+4. **Find divergence positions** -- for each pair of non-identical outputs, identifies the first token position where they diverge.
+5. **Compute score** -- determinism score = identical_count / num_runs (1.0 = perfect).
+
+## Example
+
+```bash
+infer-check determinism \
+  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
+  --backend mlx-lm \
+  --prompts determinism \
+  --runs 20 \
+  --output ./results/determinism/
+```
+
+Output:
+
+```
+                          Determinism Summary
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃ prompt_id                            ┃ runs ┃ identical ┃ determinism_score ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ d1a2b3c4-...                         │   20 │        20 │           100.00% │
+│ e5f6a7b8-...                         │   20 │        20 │           100.00% │
+│ ...                                  │   20 │        20 │           100.00% │
+└──────────────────────────────────────┴──────┴───────────┴───────────────────┘
+
+Overall determinism score: 100.00%
+```
+
+## What non-determinism means
+
+A determinism score below 100% indicates that the backend is not producing consistent output at temperature=0. Common causes:
+
+- **Floating-point non-determinism** in GPU kernels (different thread scheduling leads to different rounding)
+- **KV cache bugs** that accumulate errors across requests
+- **Batching interference** where concurrent requests affect each other's outputs
+- **Buggy sampling implementations** that don't properly handle temperature=0
+
+!!! warning
+    Non-determinism at temperature=0 is always a bug in the inference engine, not a property of the model. A correct implementation must produce identical output for identical inputs.
+
+## Output format
+
+Results are saved as a JSON array of `DeterminismResult` objects, each containing:
+
+- `prompt_id` -- reference to the prompt
+- `num_runs` -- total number of runs
+- `identical_count` -- how many runs matched the most common output
+- `divergence_positions` -- token indices where any pair of runs diverged
+- `determinism_score` -- identical_count / num_runs
diff --git a/docs/commands/diff.md b/docs/commands/diff.md
new file mode 100644
index 0000000..0686c1d
--- /dev/null
+++ b/docs/commands/diff.md
@@ -0,0 +1,83 @@
+# diff
+
+Compare outputs across different backends for the same model and prompts. Catches serving-layer bugs by holding the model and quantization constant while varying the inference path.
+
+## CLI Reference
+
+::: mkdocs-click
+    :module: infer_check.cli
+    :command: main
+    :prog_name: infer-check
+    :subcommand: diff
+    :style: table
+    :show_subcommand_aliases:
+
+## How it works
+
+1. **Build backends** -- creates a backend instance for each entry in `--backends`, using the shared `--model` and optional `--quant`.
+2. **Baseline pass** -- generates outputs for all prompts using the first backend.
+3. **Test passes** -- generates outputs for all prompts using each remaining backend.
+4. **Compare** -- each test backend's outputs are compared against the baseline, producing per-prompt `ComparisonResult` objects with severity, text similarity, and flip metadata.
+5. **Summary table** -- groups results by test backend and displays failure rate, flip rate, and mean similarity.
+
+## Base URL matching
+
+The `--base-urls` option is positionally matched to `--backends`. Use an empty entry for backends that don't need a URL (e.g., mlx-lm):
+
+```bash
+--backends "mlx-lm,openai-compat" \
+--base-urls ",http://127.0.0.1:8000"
+```
+
+This gives mlx-lm no URL (local inference) and openai-compat the vllm-mlx server URL.
+
+## Examples
+
+mlx-lm vs vllm-mlx serving layer:
+
+```bash
+# Start vllm-mlx in another terminal:
+# vllm-mlx serve mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --port 8000
+
+infer-check diff \
+  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
+  --backends "mlx-lm,openai-compat" \
+  --base-urls ",http://127.0.0.1:8000" \
+  --prompts reasoning \
+  --output ./results/diff/
+```
+
+With raw completions endpoint (no chat template):
+
+```bash
+infer-check diff \
+  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
+  --backends "mlx-lm,openai-compat" \
+  --base-urls ",http://127.0.0.1:8000" \
+  --prompts reasoning \
+  --no-chat
+```
+
+## Output
+
+```
+                              Diff Summary
+┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
+┃ test_backend  ┃ failures ┃ failure_rate ┃ flip_rate ┃ mean_similarity ┃
+┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
+│ openai-compat │        0 │        0.00% │      0.0% │          1.0000 │
+└───────────────┴──────────┴──────────────┴───────────┴─────────────────┘
+```
+
+A 100% similarity with 0% flip rate means the serving layer introduces zero divergence -- any output differences in production come from quantization, not the backend itself.
+
+## Output format
+
+Results are saved as a JSON array of `ComparisonResult` objects, each containing:
+
+- `baseline` / `test` -- the `InferenceResult` from each backend
+- `kl_divergence` -- KL(baseline || test) if logprobs are available
+- `token_divergence_index` -- first token where the outputs differ
+- `text_similarity` -- 0-1 similarity score
+- `is_failure` -- true if similarity < 0.5
+- `metadata` -- includes severity, flip status, answers, extraction strategy
diff --git a/docs/commands/report.md b/docs/commands/report.md
new file mode 100644
index 0000000..63e421e
--- /dev/null
+++ b/docs/commands/report.md
@@ -0,0 +1,46 @@
+# report
+
+Generate a report from previously saved result JSON files. Supports HTML and JSON output formats.
+
+## CLI Reference
+
+::: mkdocs-click
+    :module: infer_check.cli
+    :command: main
+    :prog_name: infer-check
+    :subcommand: report
+    :style: table
+    :show_subcommand_aliases:
+
+## How it works
+
+1. **Scan** -- recursively finds all `.json` files in the results directory.
+2. **Load** -- reads each file and collects all result objects. Handles both single objects and arrays. Skips files that fail to parse.
+3. **Generate** -- delegates to the format-specific exporter (HTML or JSON).
+4. **Open** -- for HTML reports, automatically opens the report in your default browser.
+
+## Examples
+
+Generate an HTML report from all results:
+
+```bash
+infer-check report ./results/ --format html
+```
+
+Generate a JSON report to a specific file:
+
+```bash
+infer-check report ./results/ --format json --output ./summary.json
+```
+
+Report from a specific command's results:
+
+```bash
+infer-check report ./results/compare/ --format html
+```
+
+## Notes
+
+- The report command does not have `--max-tokens` or `--num-prompts` options since it operates on previously generated results.
+- Result files from any command (sweep, compare, diff, stress, determinism) can be mixed in the same directory. The report handles heterogeneous result types.
+- If the HTML reporting module is not available, a minimal HTML page with raw JSON data is generated as a fallback.
diff --git a/docs/commands/stress.md b/docs/commands/stress.md
new file mode 100644
index 0000000..32c87d5
--- /dev/null
+++ b/docs/commands/stress.md
@@ -0,0 +1,65 @@
+# stress
+
+Stress-test a backend with varying concurrency levels. Tests whether concurrent requests cause output divergence, KV cache corruption, or errors.
+
+## CLI Reference
+
+::: mkdocs-click
+    :module: infer_check.cli
+    :command: main
+    :prog_name: infer-check
+    :subcommand: stress
+    :style: table
+    :show_subcommand_aliases:
+
+## How it works
+
+1. **Baseline pass** -- runs all prompts at concurrency=1 (the first level). These outputs become the reference.
+2. **Concurrent passes** -- for each concurrency level, runs all prompts with that many concurrent requests using `asyncio.Semaphore`.
+3. **Consistency check** -- compares each concurrent output against the baseline (concurrency=1) output for the same prompt.
+4. **Error tracking** -- counts failed requests at each concurrency level.
+5. **Summary** -- displays output consistency and error count per concurrency level.
+
+## Example
+
+```bash
+infer-check stress \
+  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
+  --backend openai-compat \
+  --base-url http://127.0.0.1:8000 \
+  --prompts reasoning \
+  --concurrency 1,2,4,8 \
+  --output ./results/stress/
+```
+
+Output:
+
+```
+                    Stress Test Summary
+┏━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
+┃ concurrency ┃ errors ┃ output_consistency  ┃
+┡━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
+│           1 │      0 │              100.00% │
+│           2 │      0 │              100.00% │
+│           4 │      0 │              100.00% │
+│           8 │      0 │              100.00% │
+└─────────────┴────────┴─────────────────────┘
+```
+
+## What to look for
+
+- **Errors at high concurrency** -- the backend is failing under load. Check server logs for OOM, timeout, or connection errors.
+- **Dropping output consistency** -- concurrent requests are interfering with each other. This is a strong signal of KV cache corruption or batch-dependent computation bugs.
+- **Consistency drop at a specific threshold** -- if consistency drops sharply at concurrency N, the backend likely has a fixed-size buffer or cache that overflows at that level.
+
+!!! tip
+    For HTTP backends (openai-compat, vllm-mlx, llama-cpp), make sure the server is running before starting the stress test. The `--base-url` option lets you point to any running server.
+
+## Output format
+
+Results are saved as a JSON array of `StressResult` objects, each containing:
+
+- `concurrency_level` -- the concurrency level tested
+- `results` -- all `InferenceResult` objects from that level
+- `error_count` -- number of failed requests
+- `output_consistency` -- fraction of outputs matching the baseline (concurrency=1)
diff --git a/docs/commands/sweep.md b/docs/commands/sweep.md
new file mode 100644
index 0000000..3236bb1
--- /dev/null
+++ b/docs/commands/sweep.md
@@ -0,0 +1,70 @@
+# sweep
+
+Compare pre-quantized models against a baseline. Each model is a separate HuggingFace repo or local path. The first model (or `--baseline`) is the reference; all others are compared against it.
+
+## CLI Reference
+
+::: mkdocs-click
+    :module: infer_check.cli
+    :command: main
+    :prog_name: infer-check
+    :subcommand: sweep
+    :style: table
+    :show_subcommand_aliases:
+
+## How it works
+
+1. **Parse models** -- splits the `--models` string into label/path pairs and creates a backend for each.
+2. **Baseline self-check** -- runs the baseline model twice on all prompts and compares the results. If the baseline isn't perfectly deterministic (50/50 identical), you'll see a warning. This tells you whether your comparison data is reliable.
+3. **Test comparisons** -- runs every other quantization against the baseline and computes per-prompt metrics (text similarity, severity, KL divergence).
+4. **Checkpoint saves** -- results are saved incrementally after each quantization level completes, so partial results survive interruptions.
+5. **Summary table** -- displays a table grouped by quantization level with severity breakdowns.
+
+## Model format
+
+The `--models` option accepts comma-separated entries. Each entry can be:
+
+- **Labeled**: `label=model_path` (e.g., `bf16=mlx-community/Meta-Llama-3.1-8B-Instruct-bf16`)
+- **Unlabeled**: just the model path (the last path component becomes the label)
+
+You need at least 2 models (one baseline + one test).
+
+## Example
+
+Full sweep across three quantization levels:
+
+```bash
+infer-check --max-tokens 512 --num-prompts 10 sweep \
+  --models "bf16=mlx-community/Meta-Llama-3.1-8B-Instruct-bf16,\
+            8bit=mlx-community/Meta-Llama-3.1-8B-Instruct-8bit,\
+            4bit=mlx-community/Meta-Llama-3.1-8B-Instruct-4bit" \
+  --backend mlx-lm \
+  --prompts reasoning \
+  --output ./results/sweep/
+```
+
+Output:
+
+```
+                                 Sweep Summary
+┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
+┃ quant_level         ┃ identical ┃ minor ┃ moderate ┃ severe ┃ mean_similarity ┃
+┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
+│ bf16 (self-check)   │     50/50 │  0/50 │     0/50 │   0/50 │          1.0000 │
+│ 8bit                │     20/50 │  9/50 │    12/50 │   9/50 │          0.8067 │
+│ 4bit                │      1/50 │  3/50 │    11/50 │  35/50 │          0.3837 │
+└─────────────────────┴───────────┴───────┴──────────┴────────┴─────────────────┘
+```
+
+The self-check row confirms the baseline is deterministic. The 4-bit row shows 35/50 severe divergences -- a clear signal of quantization-induced correctness degradation.
+
+## Output format
+
+Results are saved as a JSON file containing a `SweepResult` with:
+
+- `model_id` -- the baseline model identifier
+- `backend_name` -- the backend used
+- `quantization_levels` -- list of quantization labels
+- `comparisons` -- all per-prompt `ComparisonResult` objects
+- `timestamp` -- when the sweep completed
+- `summary` -- aggregate statistics (mean KL, failure counts)
diff --git a/docs/getting-started.md b/docs/getting-started.md
new file mode 100644
index 0000000..5678b3a
--- /dev/null
+++ b/docs/getting-started.md
@@ -0,0 +1,139 @@
+# Getting Started
+
+## Requirements
+
+- Python >= 3.11
+- macOS with Apple Silicon (for mlx-lm backend) or Linux
+- At least one supported backend
+
+## Installation
+
+```bash
+pip install infer-check
+```
+
+For Apple Silicon users who want local inference via MLX:
+
+```bash
+pip install "infer-check[mlx]"
+```
+
+To verify the installation:
+
+```bash
+infer-check --version
+```
+
+## Global options
+
+These options apply to all commands:
+
+| Option | Default | Description |
+|--------|---------|-------------|
+| `--max-tokens` | `1024` | Default max tokens for generation. Applies to all prompts unless they specify their own. |
+| `--num-prompts` | all | Limit the number of prompts to use from a suite. |
+| `--version` | | Show version and exit. |
+
+Global options go before the subcommand:
+
+```bash
+infer-check --max-tokens 512 --num-prompts 10 compare ...
+```
+
+## Your first test
+
+The simplest way to start is with the `compare` command. It takes two model specs and runs them against a prompt suite:
+
+```bash
+infer-check compare \
+  mlx-community/Llama-3.1-8B-Instruct-4bit \
+  mlx-community/Llama-3.1-8B-Instruct-8bit \
+  --prompts adversarial-numerics
+```
+
+This will:
+
+1. Auto-detect the backend (mlx-lm for `mlx-community/` repos)
+2. Load 30 adversarial-numerics prompts
+3. Run each prompt through both models
+4. Compare outputs and compute metrics (flip rate, KL divergence, text similarity)
+5. Display a summary table and save JSON results to `./results/compare/`
+
+## Prompt suites
+
+`infer-check` ships with curated prompt suites targeting known quantization failure modes:
+
+| Suite | Count | Purpose |
+|-------|-------|---------|
+| `reasoning` | 50 | Multi-step math and logic |
+| `code` | 49 | Python, JSON, SQL generation |
+| `adversarial-numerics` | 30 | IEEE 754 edge cases, overflow, precision |
+| `long-context` | 10 | Tables and transcripts with recall questions |
+| `quant-sensitive` | 20 | Multi-digit arithmetic, long CoT, precise syntax |
+| `determinism` | 50 | High-entropy continuations for determinism testing |
+
+All suites ship with the package -- no need to clone the repo. Use them by name:
+
+```bash
+--prompts reasoning
+--prompts adversarial-numerics
+```
+
+### Custom prompt suites
+
+Create a `.jsonl` file with one JSON object per line:
+
+```json
+{"id": "custom-001", "text": "What is 2^31 - 1?", "category": "math", "max_tokens": 512}
+{"id": "custom-002", "text": "Write a Python function to sort a list.", "category": "code"}
+```
+
+| Field | Required | Default | Description |
+|-------|----------|---------|-------------|
+| `id` | no | auto-generated UUID | Unique identifier |
+| `text` | yes | | The prompt text |
+| `category` | no | `"general"` | Task category (used for per-category breakdowns) |
+| `max_tokens` | no | `1024` | Max generation tokens for this prompt |
+
+Then pass the path:
+
+```bash
+--prompts ./my-custom-suite.jsonl
+```
+
+## Model resolution
+
+The `compare` command auto-detects the backend from the model spec. You can also use explicit prefixes:
+
+| Prefix | Backend | Example |
+|--------|---------|---------|
+| `mlx:` | mlx-lm | `mlx:mlx-community/Llama-3.1-8B-Instruct-4bit` |
+| `ollama:` | openai-compat (Ollama) | `ollama:llama3.1:8b-instruct-q4_K_M` |
+| `gguf:` | llama-cpp | `gguf:/path/to/model.gguf` |
+| `vllm-mlx:` | vllm-mlx | `vllm-mlx:mlx-community/Llama-3.1-8B-Instruct-4bit` |
+
+Without a prefix, resolution follows these rules:
+
+1. Path ends in `.gguf` --> llama-cpp
+2. Repo starts with `mlx-community/` or contains `-mlx` --> mlx-lm
+3. Repo contains `gguf`, `bartowski`, `maziyarpanahi`, `mradermacher` --> llama-cpp
+4. Contains `:` but no `/` (Ollama tag style) --> openai-compat
+5. Fallback --> mlx-lm
+
+## Output and results
+
+All commands save JSON results to their `--output` directory (defaults to `./results/<command>/`). Result files include timestamps in their filenames to avoid overwrites.
+
+Generate an HTML report from any results directory:
+
+```bash
+infer-check report ./results/ --format html
+```
+
+See [Interpreting Results](interpreting-results.md) for details on what the metrics mean.
+
+## Next steps
+
+- [Commands reference](commands/sweep.md) -- full details on every command
+- [Backends](backends.md) -- supported backends and configuration
+- [Interpreting Results](interpreting-results.md) -- understanding metrics and severity levels
diff --git a/docs/index.md b/docs/index.md
new file mode 100644
index 0000000..ef76bac
--- /dev/null
+++ b/docs/index.md
@@ -0,0 +1,80 @@
+# infer-check
+
+**Correctness and reliability testing for LLM inference engines.**
+
+Quantization silently breaks arithmetic. Serving layers silently alter output. KV caches silently corrupt under load. Benchmarks like lm-evaluation-harness test whether models are *smart* -- `infer-check` tests whether engines are *correct*.
+
+## The problem
+
+Every LLM inference engine has correctness bugs that benchmarks don't catch:
+
+- **KV cache NaN pollution** in vLLM-Ascend permanently corrupts all subsequent requests
+- **FP8 KV quantization** in vLLM causes repeated garbage output
+- **32.5% element mismatches** in SGLang's FP8 DeepGEMM kernels on Blackwell GPUs
+- **Batch-size-dependent output** where tokens change depending on concurrent request count
+
+These aren't model quality problems -- they're engine correctness failures. `infer-check` runs differential tests across backends, quantization levels, and concurrency conditions to surface them automatically.
+
+## What it does
+
+`infer-check` provides six commands for testing inference correctness:
+
+| Command | Purpose |
+|---------|---------|
+| [`sweep`](commands/sweep.md) | Compare pre-quantized models against a baseline |
+| [`compare`](commands/compare.md) | Head-to-head comparison of two models or quantizations |
+| [`diff`](commands/diff.md) | Compare outputs across different backends for the same model |
+| [`determinism`](commands/determinism.md) | Test output reproducibility at temperature=0 |
+| [`stress`](commands/stress.md) | Test correctness under concurrent load |
+| [`report`](commands/report.md) | Generate HTML/JSON reports from saved results |
+
+## Example results
+
+Results from running `infer-check` on Llama-3.1-8B-Instruct and Qwen3.5-4B on Apple Silicon using mlx-lm and vllm-mlx.
+
+### Quantization sweep
+
+4-bit quantization on Llama-3.1-8B showed clear task-dependent degradation:
+
+```
+                       Llama-3.1-8B: bf16 vs 4bit
+┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
+┃ prompt suite          ┃ identical ┃ severe   ┃ mean_similarity ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
+│ adversarial-numerics  │     0/30  │   23/30  │          0.311  │
+│ reasoning             │     1/50  │   35/50  │          0.384  │
+│ code                  │     0/49  │   30/49  │          0.452  │
+└───────────────────────┴───────────┴──────────┴─────────────────┘
+```
+
+### Cross-backend diff
+
+mlx-lm vs vllm-mlx at temperature=0 on Llama-3.1-8B-4bit: 50/50 identical (reasoning) and 30/30 identical (numerics). The vllm-mlx serving layer introduced zero divergence.
+
+### Determinism
+
+Llama-3.1-8B-4bit and Qwen3.5-4B both scored 50/50 identical across 20 runs per prompt on single-request mlx-lm inference at temperature=0.
+
+### Stress test
+
+vllm-mlx at concurrency 1/2/4/8: zero errors, 100% output consistency at all levels. No KV cache corruption or batch-dependent divergence detected.
+
+## Quick start
+
+```bash
+pip install infer-check
+
+# With MLX backend support (Apple Silicon)
+pip install "infer-check[mlx]"
+```
+
+Then run your first comparison:
+
+```bash
+infer-check compare \
+  mlx-community/Llama-3.1-8B-Instruct-4bit \
+  mlx-community/Llama-3.1-8B-Instruct-8bit \
+  --prompts adversarial-numerics
+```
+
+See the [Getting Started](getting-started.md) guide for a full walkthrough.
diff --git a/docs/interpreting-results.md b/docs/interpreting-results.md
new file mode 100644
index 0000000..e908e7b
--- /dev/null
+++ b/docs/interpreting-results.md
@@ -0,0 +1,153 @@
+# Interpreting Results
+
+This guide explains the metrics `infer-check` computes and how to interpret them.
+
+## Severity tiers
+
+Every per-prompt comparison is classified into one of four severity tiers based on text similarity:
+
+| Severity | Similarity range | Meaning |
+|----------|-----------------|---------|
+| **identical** | 1.0 | Outputs are character-for-character the same. |
+| **minor** | >= 0.8 | Small wording differences. The answer is functionally the same. |
+| **moderate** | >= 0.5 | Significant differences, but some overlap. May or may not affect correctness. |
+| **severe** | < 0.5 | Outputs are fundamentally different. Likely a correctness failure. |
+
+Text similarity is computed using `difflib.SequenceMatcher`, which measures the ratio of matching characters between the two outputs.
+
+## Flip rate
+
+The flip rate is the fraction of prompts where the **functional answer** changed between the two models or backends. Unlike text similarity (which measures surface-level text overlap), flip rate uses answer extraction to determine whether the actual answer is different.
+
+### Answer extraction strategies
+
+`infer-check` extracts the functional answer from each response based on the prompt's category:
+
+| Category | Strategy | What it extracts |
+|----------|----------|-----------------|
+| Numeric prompts | `numeric` | Last number in the response (integers, decimals, scientific notation) |
+| Boolean prompts | `boolean` | Yes/no with negation detection |
+| Code prompts | `code` | Fenced code blocks with whitespace normalization |
+| JSON prompts | `json` | Parsed and canonicalized JSON |
+| Everything else | `raw` | Full lowercased text |
+
+A "flip" occurs when the extracted answers from two models don't match. For example:
+
+- Model A answers "42", Model B answers "43" --> **flipped** (numeric)
+- Model A answers "Yes", Model B answers "No" --> **flipped** (boolean)
+- Model A and B give the same code but different commentary --> **not flipped** (code blocks match)
+
+### Flip rate vs severity
+
+These metrics capture different things:
+
+- **Severity** measures how different the full text outputs are
+- **Flip rate** measures whether the functional answer changed
+
+A response can have "severe" text divergence but no flip (e.g., different reasoning paths reaching the same answer), or "minor" text divergence with a flip (e.g., nearly identical text except the final number is wrong).
+
+Flip rate is generally the more actionable metric for assessing correctness.
+
+## KL divergence
+
+KL divergence (Kullback-Leibler divergence) measures how different the token probability distributions are between two backends. It's computed as KL(baseline || test) -- how much information is lost when using the test distribution to approximate the baseline.
+
+| KL divergence | Interpretation |
+|---------------|----------------|
+| 0.0 | Identical distributions |
+| < 0.01 | Very similar -- negligible difference |
+| 0.01 - 0.1 | Small differences in token probabilities |
+| 0.1 - 1.0 | Moderate divergence -- different confidence levels |
+| > 1.0 | Large divergence -- fundamentally different predictions |
+
+!!! note
+    KL divergence is only available when both backends provide logprobs or token probability distributions. Not all backends support this. When unavailable, the field is `null` in the output.
+
+## Text similarity
+
+A 0-1 score from `difflib.SequenceMatcher` measuring character-level overlap. Used to classify severity tiers.
+
+| Score | Interpretation |
+|-------|----------------|
+| 1.0 | Identical output |
+| 0.9+ | Very similar -- minor rewording |
+| 0.7-0.9 | Moderately similar -- different phrasing, same general content |
+| 0.5-0.7 | Partially similar -- some shared content |
+| < 0.5 | Mostly different -- classified as "severe" |
+
+## Token divergence index
+
+The index of the first token where the baseline and test outputs diverge. A low index (e.g., 0-5) means the outputs diverge early and are likely completely different. A high index means the outputs share a common prefix before diverging.
+
+## Determinism score
+
+For determinism tests, the score is:
+
+```
+determinism_score = identical_count / num_runs
+```
+
+- **1.0** (100%) -- all runs produced identical output. The backend is deterministic.
+- **< 1.0** -- some runs produced different output at temperature=0. This is a bug.
+
+The `divergence_positions` field lists the token indices where pairs of runs first diverged, helping locate where non-determinism creeps in.
+
+## Output consistency (stress tests)
+
+For stress tests, output consistency is:
+
+```
+output_consistency = identical_to_baseline / total_compared
+```
+
+Where the baseline is the output from concurrency=1. This measures whether increasing concurrency changes the outputs.
+
+- **100%** -- concurrent requests don't affect output. The backend is correct under load.
+- **< 100%** -- some outputs changed under concurrency. Investigate KV cache correctness and batch-dependent computation.
+
+## Per-category stats
+
+The `compare` command breaks down results by prompt category (as defined in the prompt suite). Each category gets:
+
+| Stat | Description |
+|------|-------------|
+| `count` | Number of prompts in this category |
+| `flip_rate` | Fraction of prompts with answer flips |
+| `mean_similarity` | Average text similarity |
+
+This helps identify which task types are most affected by quantization or backend differences. Numerical tasks typically show the highest degradation.
+
+## Reading the summary tables
+
+### Sweep table
+
+```
+┃ quant_level       ┃ identical ┃ minor ┃ moderate ┃ severe ┃ mean_similarity ┃
+```
+
+- **Self-check row**: The baseline compared against itself. Should be 100% identical. If not, your baseline isn't deterministic and all other comparisons are unreliable.
+- **Test rows**: Each quantization level compared against the baseline. More severe divergences = more correctness degradation.
+
+### Compare table
+
+```
+┃ metric                              ┃ value ┃
+```
+
+Look at flip rate first -- it's the most direct measure of correctness. Then check severity tiers for the distribution of divergence. The flipped prompts detail table shows exactly which prompts broke and what the answers changed to.
+
+### Diff table
+
+```
+┃ test_backend ┃ failures ┃ failure_rate ┃ flip_rate ┃ mean_similarity ┃
+```
+
+Failures indicate the backend returned errors. Flip rate and mean similarity show whether the serving layer changes outputs. Ideally, a diff test shows 0 failures, 0% flip rate, and 1.0 similarity.
+
+### Stress table
+
+```
+┃ concurrency ┃ errors ┃ output_consistency ┃
+```
+
+Look for errors and consistency drops at higher concurrency levels. A sudden drop at a specific concurrency level often indicates a buffer overflow or cache corruption bug in the backend.
diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css
new file mode 100644
index 0000000..72ce56a
--- /dev/null
+++ b/docs/stylesheets/extra.css
@@ -0,0 +1,8 @@
+[data-md-color-scheme="slate"] {
+  --md-primary-fg-color: #0ea5e9;
+  --md-primary-bg-color: #0c1117;
+}
+
+[data-md-color-scheme="default"] {
+  --md-primary-fg-color: #0284c7;
+}
diff --git a/mkdocs.yml b/mkdocs.yml
new file mode 100644
index 0000000..dedf69d
--- /dev/null
+++ b/mkdocs.yml
@@ -0,0 +1,56 @@
+site_name: infer-check
+site_url: https://nullpointerdepressivedisorder.github.io/infer-check
+site_description: Cross-backend differential testing for LLM inference correctness
+repo_url: https://github.com/NullPointerDepressiveDisorder/infer-check
+repo_name: infer-check
+
+theme:
+  name: material
+  palette:
+    - scheme: slate
+      primary: custom
+      toggle:
+        icon: material/brightness-4
+        name: Switch to light mode
+    - scheme: default
+      primary: custom
+      toggle:
+        icon: material/brightness-7
+        name: Switch to dark mode
+  features:
+    - navigation.sections
+    - navigation.expand
+    - content.code.copy
+    - search.highlight
+
+extra_css:
+  - stylesheets/extra.css
+
+markdown_extensions:
+  - mkdocs-click
+  - admonition
+  - pymdownx.details
+  - pymdownx.superfences
+  - pymdownx.highlight:
+      anchor_linenums: true
+  - pymdownx.inlinehilite
+  - pymdownx.tabbed:
+      alternate_style: true
+  - attr_list
+  - md_in_html
+
+plugins:
+  - search
+
+nav:
+  - Home: index.md
+  - Getting Started: getting-started.md
+  - Commands:
+      - sweep: commands/sweep.md
+      - compare: commands/compare.md
+      - diff: commands/diff.md
+      - determinism: commands/determinism.md
+      - stress: commands/stress.md
+      - report: commands/report.md
+  - Backends: backends.md
+  - Interpreting Results: interpreting-results.md
diff --git a/pyproject.toml b/pyproject.toml
index 1654c6a..b23ded4 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -63,6 +63,12 @@ dev = [
     "pytest-cov>=7.0.0",
     "ruff>=0.15.5",
 ]
+docs = [
+    "mkdocs>=1.6.1",
+    "mkdocs-click>=0.9.0",
+    "mkdocs-material>=9.7.6",
+    "pymdown-extensions>=10.21.2",
+]
 
 [tool.ruff]
 target-version = "py311"