NullPointerDepressiveDisorder
diff --git a/‎README.md‎
Lines changed: 51 additions & 109 deletions b/‎README.md‎
Lines changed: 51 additions & 109 deletions
diff --git a/‎docs/backends.md‎
Lines changed: 154 additions & 0 deletions b/‎docs/backends.md‎
Lines changed: 154 additions & 0 deletions
@@ -2,11 +2,15 @@
 
 [![PyPI - Version](https://img.shields.io/pypi/v/infer-check?logo=PyPi&color=%233775A9)](https://pypi.org/project/infer-check/)
 [![Run tests and upload coverage](https://github.com/NullPointerDepressiveDisorder/infer-check/actions/workflows/coverage.yml/badge.svg?branch=main)](https://github.com/NullPointerDepressiveDisorder/infer-check/actions/workflows/coverage.yml)
+[![codecov](https://codecov.io/gh/NullPointerDepressiveDisorder/infer-check/graph/badge.svg?token=FWG0Z5YHUS)](https://codecov.io/gh/NullPointerDepressiveDisorder/infer-check)
+[![Docs](https://img.shields.io/badge/docs-mkdocs-blue)](https://nullpointerdepressivedisorder.github.io/infer-check)
 
 **Catches the correctness bugs that benchmarks miss in LLM inference engines.**
 
 Quantization silently breaks arithmetic. Serving layers silently alter output. KV caches silently corrupt under load. Benchmarks like lm-evaluation-harness test whether models are smart — `infer-check` tests whether engines are correct.
 
+> **[Read the full documentation](https://nullpointerdepressivedisorder.github.io/infer-check)**
+
 ## The problem
 
 Every LLM inference engine has correctness bugs that benchmarks don't catch:
@@ -18,43 +22,6 @@ Every LLM inference engine has correctness bugs that benchmarks don't catch:
 
 These aren't model quality problems — they're engine correctness failures. `infer-check` is a CLI tool that runs differential tests across backends, quantization levels, and concurrency conditions to surface them automatically.
 
-## Example results
-
-Results from running `infer-check` on Llama-3.1-8B-Instruct and Qwen3.5-4B (MoE) on Apple Silicon using mlx-lm and vllm-mlx. These demonstrate what the tool catches — not a comprehensive benchmark.
-
-### Quantization sweep
-
-4-bit quantization on Llama-3.1-8B showed clear task-dependent degradation. Numerical tasks broke worst:
-
-```
-                       Llama-3.1-8B: bf16 vs 4bit
-┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
-┃ prompt suite          ┃ identical ┃ severe   ┃ mean_similarity ┃
-┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
-│ adversarial-numerics  │     0/30  │   23/30  │          0.311  │
-│ reasoning             │     1/50  │   35/50  │          0.384  │
-│ code                  │     0/49  │   30/49  │          0.452  │
-└───────────────────────┴───────────┴──────────┴─────────────────┘
-```
-
-A "severe" divergence means the quantized output is functionally wrong — not just worded differently, but giving incorrect answers to questions the bf16 baseline handles correctly. This pattern is consistent with published research on quantization-induced degradation, reproduced here on MLX's native quantization scheme.
-
-### Dense vs. MoE comparison
-
-Qwen3.5-4B (Gated Delta Networks + sparse MoE) showed similar degradation rates to dense Llama-3.1-8B in our testing — 35/50 severe on reasoning at 4-bit. Small sample, but the tool picks up the signal clearly on both architectures.
-
-### Cross-backend diff
-
-mlx-lm vs vllm-mlx at temperature=0 on Llama-3.1-8B-4bit: 50/50 identical (reasoning) and 30/30 identical (numerics). In this test, the vllm-mlx serving layer introduced zero divergence — output differences in production would come from quantization, not from the serving layer itself.
-
-### Determinism
-
-Llama-3.1-8B-4bit and Qwen3.5-4B both scored 50/50 identical across 20 runs per prompt on single-request mlx-lm inference at temperature=0.
-
-### Stress test
-
-vllm-mlx at concurrency 1/2/4/8: zero errors, 100% output consistency at all levels. No KV cache corruption or batch-dependent divergence detected.
-
 ## Installation
 
 ```
@@ -64,27 +31,43 @@ pip install infer-check
 pip install "infer-check[mlx]"
 ```
 
-## Usage
+## Quick start
 
-### Quantization sweep
+Compare two quantizations head-to-head:
+
+```
+infer-check compare \
+  mlx-community/Llama-3.1-8B-Instruct-4bit \
+  mlx-community/Llama-3.1-8B-Instruct-8bit \
+  --prompts adversarial-numerics
+```
 
-Compare pre-quantized models against a baseline. Each model is a separate HuggingFace repo. Use `--max-tokens` to control generation length (defaults to 1024) and `--num-prompts` to limit the number of prompts used.
+Run a full quantization sweep:
 
 ```
 infer-check sweep \
   --models "bf16=mlx-community/Meta-Llama-3.1-8B-Instruct-bf16,\
             8bit=mlx-community/Meta-Llama-3.1-8B-Instruct-8bit,\
             4bit=mlx-community/Meta-Llama-3.1-8B-Instruct-4bit" \
-  --backend mlx-lm \
-  --prompts reasoning \
-  --max-tokens 512 \
-  --num-prompts 10 \
-  --output ./results/sweep/
+  --prompts reasoning
 ```
 
-`--prompts` accepts either a bundled suite name (`reasoning`, `code`, `adversarial-numerics`, `determinism`, `long-context`, `quant-sensitive`) or a path to any `.jsonl` file.
+## Commands
+
+| Command | Purpose | Docs |
+| --- | --- | --- |
+| `sweep` | Compare pre-quantized models against a baseline | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/sweep/) |
+| `compare` | Head-to-head comparison of two models or quantizations | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/compare/) |
+| `diff` | Compare outputs across different backends for the same model | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/diff/) |
+| `determinism` | Test output reproducibility at temperature=0 | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/determinism/) |
+| `stress` | Test correctness under concurrent load | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/stress/) |
+| `report` | Generate HTML/JSON reports from saved results | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/report/) |
+
+## Example results
+
+Results from running `infer-check` on Llama-3.1-8B-Instruct on Apple Silicon using mlx-lm.
 
-The baseline is automatically run twice as a self-check — if it's not 50/50 identical, your comparison data is unreliable.
+### Quantization sweep
 
 ```
                                  Sweep Summary
@@ -97,87 +80,46 @@ The baseline is automatically run twice as a self-check — if it's not 50/50 id
 └─────────────────────┴───────────┴───────┴──────────┴────────┴─────────────────┘
 ```
 
-### Cross-backend diff
-
-Same model, same quant, different inference paths. Catches serving-layer bugs.
+A "severe" divergence means the quantized output is functionally wrong — not just worded differently, but giving incorrect answers to questions the bf16 baseline handles correctly.
 
-```
-# Start vllm-mlx in another terminal:
-# vllm-mlx serve mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --port 8000
-
-infer-check diff \
-  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
-  --backends "mlx-lm,openai-compat" \
-  --base-urls ",http://127.0.0.1:8000" \
-  --prompts reasoning \
-  --output ./results/diff/
-```
-
-Uses `/v1/chat/completions` by default (`--chat`) so server-side chat templates match the local backend. Pass `--no-chat` for raw `/v1/completions`.
-
-### Determinism
-
-Same prompt N times at temperature=0. Output should be bit-identical every run.
-
-```
-infer-check determinism \
-  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
-  --backend mlx-lm \
-  --prompts determinism \
-  --runs 20 \
-  --output ./results/determinism/
-```
+### Cross-backend diff
 
-### Stress test
+mlx-lm vs vllm-mlx at temperature=0: 50/50 identical (reasoning) and 30/30 identical (numerics). Zero serving-layer divergence detected.
 
-Concurrent requests through a serving backend. Tests KV cache correctness under load.
+### Determinism & stress
 
-```
-infer-check stress \
-  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
-  --backend openai-compat \
-  --base-url http://127.0.0.1:8000 \
-  --prompts reasoning \
-  --concurrency 1,2,4,8 \
-  --output ./results/stress/
-```
+100% determinism across 20 runs per prompt at temperature=0. 100% output consistency at concurrency levels 1/2/4/8.
 
-### Report
+## Supported backends
 
-Generate an HTML report from all saved results.
+| Backend           | Type | Use case |
+|-------------------| --- | --- |
+| **mlx-lm**        | In-process | Local Apple Silicon inference with logprobs |
+| **llama-cpp**     | HTTP | `llama-server` via `/completion` endpoint |
+| **vllm-mlx**      | HTTP | Continuous batching on Apple Silicon |
+| **openai-compat** | HTTP | Any OpenAI-compatible server (vLLM, SGLang, Ollama) |
 
-```
-infer-check report ./results/ --format html
-```
+See the [backends documentation](https://nullpointerdepressivedisorder.github.io/infer-check/backends/) for setup and configuration details.
 
 ## Prompt suites
 
-Curated prompts targeting known quantization failure modes:
+Six curated suites ship with the package — no need to clone the repo:
 
 | Suite | Count | Purpose |
 | --- | --- | --- |
-| `reasoning.jsonl` | 50 | Multi-step math and logic |
-| `code.jsonl` | 49 | Python, JSON, SQL generation |
-| `adversarial-numerics.jsonl` | 30 | IEEE 754 edge cases, overflow, precision |
-| `long-context.jsonl` | 10 | Tables and transcripts with recall questions |
-| `quant-sensitive.jsonl` | 20 | Multi-digit arithmetic, long CoT, precise syntax |
-| `determinism.jsonl` | 50 | High-entropy continuations for determinism testing |
+| `reasoning` | 50 | Multi-step math and logic |
+| `code` | 49 | Python, JSON, SQL generation |
+| `adversarial-numerics` | 30 | IEEE 754 edge cases, overflow, precision |
+| `long-context` | 10 | Tables and transcripts with recall questions |
+| `quant-sensitive` | 20 | Multi-digit arithmetic, long CoT, precise syntax |
+| `determinism` | 50 | High-entropy continuations for determinism testing |
 
-All suites ship with the package — no need to clone the repo. Custom suites are JSONL files with one object per line (default `max_tokens` is 1024):
+Custom suites are JSONL files with one object per line:
 
 ```json
 {"id": "custom-001", "text": "Your prompt here", "category": "math", "max_tokens": 512}
 ```
 
-## Supported backends
-
-| Backend | Type | Use case |
-| --- | --- | --- |
-| **mlx-lm** | In-process | Local Apple Silicon inference with logprobs |
-| **llama.cpp** | HTTP | `llama-server` via `/completion` endpoint |
-| **vllm-mlx** | HTTP | Continuous batching on Apple Silicon |
-| **openai-compat** | HTTP | Any OpenAI-compatible server (vLLM, SGLang, Ollama) |
-
 ## Roadmap
 
 - [ ] GGUF backend (direct llama.cpp integration without HTTP)
 
@@ -0,0 +1,154 @@
+# Backends
+
+`infer-check` supports four inference backends. Each backend implements a common protocol for generation, health checks, and cleanup.
+
+## Overview
+
+| Backend           | Type | Default URL | Use case |
+|-------------------|------|-------------|----------|
+| **mlx-lm**        | In-process | (local) | Local Apple Silicon inference with logprobs |
+| **llama-cpp**     | HTTP | `http://127.0.0.1:8080` | llama-server via `/completion` endpoint |
+| **vllm-mlx**      | HTTP | `http://127.0.0.1:8000` | Continuous batching on Apple Silicon |
+| **openai-compat** | HTTP | `http://127.0.0.1:11434/v1` | Any OpenAI-compatible server (vLLM, SGLang, Ollama) |
+
+## mlx-lm
+
+In-process inference using Apple's MLX framework. Runs directly on Apple Silicon with no server required.
+
+**Install**: `pip install "infer-check[mlx]"` (requires `mlx` and `mlx-lm` packages)
+
+**Features**:
+
+- Generates per-token logprobs when available via `generate_step()`
+- Falls back to simple generation if logprobs aren't supported
+- Lazy model loading -- the model is downloaded and loaded on first use, not at import time
+- Single-threaded sequential inference
+
+**When to use**: Local testing on Mac. Best baseline for quantization sweeps since it runs natively with no serving layer overhead.
+
+**Example**:
+
+```bash
+infer-check compare \
+  mlx-community/Llama-3.1-8B-Instruct-bf16 \
+  mlx-community/Llama-3.1-8B-Instruct-4bit
+```
+
+## llama.cpp
+
+HTTP backend targeting [llama-server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) (the built-in HTTP server from llama.cpp).
+
+**Setup**: Start llama-server separately:
+
+```bash
+llama-server -m /path/to/model.gguf --port 8080
+```
+
+**Features**:
+
+- Uses the `/completion` endpoint for text generation
+- Requests top-10 token probabilities and converts them to logprobs
+- Aligns token distributions by ID metadata for cross-backend comparison
+- 120-second request timeout
+
+**When to use**: Testing GGUF models served via llama.cpp. Good for comparing GGUF quantization formats against each other or against MLX native quantization.
+
+**Example**:
+
+```bash
+infer-check determinism \
+  --model my-model \
+  --backend llama-cpp \
+  --base-url http://127.0.0.1:8080 \
+  --prompts determinism \
+  --runs 20
+```
+
+## vllm-mlx
+
+HTTP backend for [vllm-mlx](https://github.com/vllm-project/vllm-mlx), a continuous-batching inference server for Apple Silicon. Extends the OpenAI-compatible backend with model-aware health checks.
+
+**Setup**: Start vllm-mlx separately:
+
+```bash
+vllm-mlx serve mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --port 8000
+```
+
+**Features**:
+
+- Inherits all capabilities from the openai-compat backend
+- Model-aware health check verifies the expected model is loaded
+- Supports both `/v1/chat/completions` and `/v1/completions` endpoints
+
+**When to use**: Testing continuous-batching serving layer correctness. Ideal for `diff` and `stress` commands to verify the serving layer doesn't introduce divergence.
+
+**Example**:
+
+```bash
+infer-check diff \
+  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
+  --backends "mlx-lm,vllm-mlx" \
+  --base-urls ",http://127.0.0.1:8000" \
+  --prompts reasoning
+```
+
+## openai-compat
+
+Generic backend for any server that implements the OpenAI API format. Works with vLLM, SGLang, Ollama, and others.
+
+**Features**:
+
+- Supports both `/v1/chat/completions` and `/v1/completions` endpoints
+- Requests logprobs with graceful fallback if unsupported
+- 120-second request timeout
+- Detailed error messages for connection, timeout, and HTTP errors
+
+**When to use**: Any OpenAI-compatible server. This is the most flexible backend and the default for Ollama-style model tags.
+
+**Default URLs by resolution**:
+
+| Model source | Default URL |
+|-------------|-------------|
+| Ollama tags (e.g., `llama3.1:8b`) | `http://127.0.0.1:11434/v1` |
+| Custom server | Use `--base-url` |
+
+**Example with Ollama**:
+
+```bash
+infer-check compare \
+  ollama:llama3.1:8b-instruct-q4_K_M \
+  ollama:llama3.1:8b-instruct-q8_0
+```
+
+**Example with custom server**:
+
+```bash
+infer-check stress \
+  --model my-model \
+  --backend openai-compat \
+  --base-url http://my-server:8000/v1 \
+  --prompts reasoning \
+  --concurrency 1,2,4,8
+```
+
+## Chat vs completions
+
+HTTP backends support two endpoint modes:
+
+- **Chat** (`--chat`, default) -- uses `/v1/chat/completions`. The server applies its chat template. Use this when the server is configured with the correct chat template for your model.
+- **Completions** (`--no-chat`) -- uses `/v1/completions`. Sends raw text with no template. Use this for raw text generation or when you want to control the prompt format yourself.
+
+The `--chat` / `--no-chat` flag applies to the `diff` command. The `compare` command always uses completions mode to avoid template differences between backends.
+
+## Backend selection
+
+Backends are selected in different ways depending on the command:
+
+| Command | How backend is chosen |
+|---------|----------------------|
+| `compare` | Auto-detected from each model spec |
+| `sweep` | `--backend` flag (shared across all models) or auto-detected |
+| `diff` | `--backends` flag (explicit list) |
+| `stress` | `--backend` flag or auto-detected from model |
+| `determinism` | `--backend` flag or auto-detected from model |
+| `report` | N/A (operates on saved results) |