Skip to content

Commit 47a1ac2

Browse files
Docs/init (#15)
* docs: add initial MkDocs configuration and custom theme styles * docs: add comprehensive getting started guide with installation, usage, and prompt suite details * docs: add project overview, problem statement, and example results to index.md * docs: add detailed backend reference covering supported inference backends and usage * docs: add interpreting results and report usage guides * docs: add guides for compare and diff commands with usage, options, and output details * docs: add guides for sweep, determinism, and stress commands with usage, options, and output details * docs: rewrite README with streamlined usage, command overview, and updated examples * docs: update command guides to use mkdocs-click CLI reference blocks * docs: add documentation dependencies group to optional dependencies * docs: update backend tables to rename llama.cpp to llama-cpp and align formatting
1 parent cca00fe commit 47a1ac2

14 files changed

Lines changed: 1075 additions & 109 deletions

README.md

Lines changed: 51 additions & 109 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,15 @@
22

33
[![PyPI - Version](https://img.shields.io/pypi/v/infer-check?logo=PyPi&color=%233775A9)](https://pypi.org/project/infer-check/)
44
[![Run tests and upload coverage](https://github.com/NullPointerDepressiveDisorder/infer-check/actions/workflows/coverage.yml/badge.svg?branch=main)](https://github.com/NullPointerDepressiveDisorder/infer-check/actions/workflows/coverage.yml)
5+
[![codecov](https://codecov.io/gh/NullPointerDepressiveDisorder/infer-check/graph/badge.svg?token=FWG0Z5YHUS)](https://codecov.io/gh/NullPointerDepressiveDisorder/infer-check)
6+
[![Docs](https://img.shields.io/badge/docs-mkdocs-blue)](https://nullpointerdepressivedisorder.github.io/infer-check)
57

68
**Catches the correctness bugs that benchmarks miss in LLM inference engines.**
79

810
Quantization silently breaks arithmetic. Serving layers silently alter output. KV caches silently corrupt under load. Benchmarks like lm-evaluation-harness test whether models are smart — `infer-check` tests whether engines are correct.
911

12+
> **[Read the full documentation](https://nullpointerdepressivedisorder.github.io/infer-check)**
13+
1014
## The problem
1115

1216
Every LLM inference engine has correctness bugs that benchmarks don't catch:
@@ -18,43 +22,6 @@ Every LLM inference engine has correctness bugs that benchmarks don't catch:
1822

1923
These aren't model quality problems — they're engine correctness failures. `infer-check` is a CLI tool that runs differential tests across backends, quantization levels, and concurrency conditions to surface them automatically.
2024

21-
## Example results
22-
23-
Results from running `infer-check` on Llama-3.1-8B-Instruct and Qwen3.5-4B (MoE) on Apple Silicon using mlx-lm and vllm-mlx. These demonstrate what the tool catches — not a comprehensive benchmark.
24-
25-
### Quantization sweep
26-
27-
4-bit quantization on Llama-3.1-8B showed clear task-dependent degradation. Numerical tasks broke worst:
28-
29-
```
30-
Llama-3.1-8B: bf16 vs 4bit
31-
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
32-
┃ prompt suite ┃ identical ┃ severe ┃ mean_similarity ┃
33-
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
34-
│ adversarial-numerics │ 0/30 │ 23/30 │ 0.311 │
35-
│ reasoning │ 1/50 │ 35/50 │ 0.384 │
36-
│ code │ 0/49 │ 30/49 │ 0.452 │
37-
└───────────────────────┴───────────┴──────────┴─────────────────┘
38-
```
39-
40-
A "severe" divergence means the quantized output is functionally wrong — not just worded differently, but giving incorrect answers to questions the bf16 baseline handles correctly. This pattern is consistent with published research on quantization-induced degradation, reproduced here on MLX's native quantization scheme.
41-
42-
### Dense vs. MoE comparison
43-
44-
Qwen3.5-4B (Gated Delta Networks + sparse MoE) showed similar degradation rates to dense Llama-3.1-8B in our testing — 35/50 severe on reasoning at 4-bit. Small sample, but the tool picks up the signal clearly on both architectures.
45-
46-
### Cross-backend diff
47-
48-
mlx-lm vs vllm-mlx at temperature=0 on Llama-3.1-8B-4bit: 50/50 identical (reasoning) and 30/30 identical (numerics). In this test, the vllm-mlx serving layer introduced zero divergence — output differences in production would come from quantization, not from the serving layer itself.
49-
50-
### Determinism
51-
52-
Llama-3.1-8B-4bit and Qwen3.5-4B both scored 50/50 identical across 20 runs per prompt on single-request mlx-lm inference at temperature=0.
53-
54-
### Stress test
55-
56-
vllm-mlx at concurrency 1/2/4/8: zero errors, 100% output consistency at all levels. No KV cache corruption or batch-dependent divergence detected.
57-
5825
## Installation
5926

6027
```
@@ -64,27 +31,43 @@ pip install infer-check
6431
pip install "infer-check[mlx]"
6532
```
6633

67-
## Usage
34+
## Quick start
6835

69-
### Quantization sweep
36+
Compare two quantizations head-to-head:
37+
38+
```
39+
infer-check compare \
40+
mlx-community/Llama-3.1-8B-Instruct-4bit \
41+
mlx-community/Llama-3.1-8B-Instruct-8bit \
42+
--prompts adversarial-numerics
43+
```
7044

71-
Compare pre-quantized models against a baseline. Each model is a separate HuggingFace repo. Use `--max-tokens` to control generation length (defaults to 1024) and `--num-prompts` to limit the number of prompts used.
45+
Run a full quantization sweep:
7246

7347
```
7448
infer-check sweep \
7549
--models "bf16=mlx-community/Meta-Llama-3.1-8B-Instruct-bf16,\
7650
8bit=mlx-community/Meta-Llama-3.1-8B-Instruct-8bit,\
7751
4bit=mlx-community/Meta-Llama-3.1-8B-Instruct-4bit" \
78-
--backend mlx-lm \
79-
--prompts reasoning \
80-
--max-tokens 512 \
81-
--num-prompts 10 \
82-
--output ./results/sweep/
52+
--prompts reasoning
8353
```
8454

85-
`--prompts` accepts either a bundled suite name (`reasoning`, `code`, `adversarial-numerics`, `determinism`, `long-context`, `quant-sensitive`) or a path to any `.jsonl` file.
55+
## Commands
56+
57+
| Command | Purpose | Docs |
58+
| --- | --- | --- |
59+
| `sweep` | Compare pre-quantized models against a baseline | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/sweep/) |
60+
| `compare` | Head-to-head comparison of two models or quantizations | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/compare/) |
61+
| `diff` | Compare outputs across different backends for the same model | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/diff/) |
62+
| `determinism` | Test output reproducibility at temperature=0 | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/determinism/) |
63+
| `stress` | Test correctness under concurrent load | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/stress/) |
64+
| `report` | Generate HTML/JSON reports from saved results | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/report/) |
65+
66+
## Example results
67+
68+
Results from running `infer-check` on Llama-3.1-8B-Instruct on Apple Silicon using mlx-lm.
8669

87-
The baseline is automatically run twice as a self-check — if it's not 50/50 identical, your comparison data is unreliable.
70+
### Quantization sweep
8871

8972
```
9073
Sweep Summary
@@ -97,87 +80,46 @@ The baseline is automatically run twice as a self-check — if it's not 50/50 id
9780
└─────────────────────┴───────────┴───────┴──────────┴────────┴─────────────────┘
9881
```
9982

100-
### Cross-backend diff
101-
102-
Same model, same quant, different inference paths. Catches serving-layer bugs.
83+
A "severe" divergence means the quantized output is functionally wrong — not just worded differently, but giving incorrect answers to questions the bf16 baseline handles correctly.
10384

104-
```
105-
# Start vllm-mlx in another terminal:
106-
# vllm-mlx serve mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --port 8000
107-
108-
infer-check diff \
109-
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
110-
--backends "mlx-lm,openai-compat" \
111-
--base-urls ",http://127.0.0.1:8000" \
112-
--prompts reasoning \
113-
--output ./results/diff/
114-
```
115-
116-
Uses `/v1/chat/completions` by default (`--chat`) so server-side chat templates match the local backend. Pass `--no-chat` for raw `/v1/completions`.
117-
118-
### Determinism
119-
120-
Same prompt N times at temperature=0. Output should be bit-identical every run.
121-
122-
```
123-
infer-check determinism \
124-
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
125-
--backend mlx-lm \
126-
--prompts determinism \
127-
--runs 20 \
128-
--output ./results/determinism/
129-
```
85+
### Cross-backend diff
13086

131-
### Stress test
87+
mlx-lm vs vllm-mlx at temperature=0: 50/50 identical (reasoning) and 30/30 identical (numerics). Zero serving-layer divergence detected.
13288

133-
Concurrent requests through a serving backend. Tests KV cache correctness under load.
89+
### Determinism & stress
13490

135-
```
136-
infer-check stress \
137-
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
138-
--backend openai-compat \
139-
--base-url http://127.0.0.1:8000 \
140-
--prompts reasoning \
141-
--concurrency 1,2,4,8 \
142-
--output ./results/stress/
143-
```
91+
100% determinism across 20 runs per prompt at temperature=0. 100% output consistency at concurrency levels 1/2/4/8.
14492

145-
### Report
93+
## Supported backends
14694

147-
Generate an HTML report from all saved results.
95+
| Backend | Type | Use case |
96+
|-------------------| --- | --- |
97+
| **mlx-lm** | In-process | Local Apple Silicon inference with logprobs |
98+
| **llama-cpp** | HTTP | `llama-server` via `/completion` endpoint |
99+
| **vllm-mlx** | HTTP | Continuous batching on Apple Silicon |
100+
| **openai-compat** | HTTP | Any OpenAI-compatible server (vLLM, SGLang, Ollama) |
148101

149-
```
150-
infer-check report ./results/ --format html
151-
```
102+
See the [backends documentation](https://nullpointerdepressivedisorder.github.io/infer-check/backends/) for setup and configuration details.
152103

153104
## Prompt suites
154105

155-
Curated prompts targeting known quantization failure modes:
106+
Six curated suites ship with the package — no need to clone the repo:
156107

157108
| Suite | Count | Purpose |
158109
| --- | --- | --- |
159-
| `reasoning.jsonl` | 50 | Multi-step math and logic |
160-
| `code.jsonl` | 49 | Python, JSON, SQL generation |
161-
| `adversarial-numerics.jsonl` | 30 | IEEE 754 edge cases, overflow, precision |
162-
| `long-context.jsonl` | 10 | Tables and transcripts with recall questions |
163-
| `quant-sensitive.jsonl` | 20 | Multi-digit arithmetic, long CoT, precise syntax |
164-
| `determinism.jsonl` | 50 | High-entropy continuations for determinism testing |
110+
| `reasoning` | 50 | Multi-step math and logic |
111+
| `code` | 49 | Python, JSON, SQL generation |
112+
| `adversarial-numerics` | 30 | IEEE 754 edge cases, overflow, precision |
113+
| `long-context` | 10 | Tables and transcripts with recall questions |
114+
| `quant-sensitive` | 20 | Multi-digit arithmetic, long CoT, precise syntax |
115+
| `determinism` | 50 | High-entropy continuations for determinism testing |
165116

166-
All suites ship with the package — no need to clone the repo. Custom suites are JSONL files with one object per line (default `max_tokens` is 1024):
117+
Custom suites are JSONL files with one object per line:
167118

168119
```json
169120
{"id": "custom-001", "text": "Your prompt here", "category": "math", "max_tokens": 512}
170121
```
171122

172-
## Supported backends
173-
174-
| Backend | Type | Use case |
175-
| --- | --- | --- |
176-
| **mlx-lm** | In-process | Local Apple Silicon inference with logprobs |
177-
| **llama.cpp** | HTTP | `llama-server` via `/completion` endpoint |
178-
| **vllm-mlx** | HTTP | Continuous batching on Apple Silicon |
179-
| **openai-compat** | HTTP | Any OpenAI-compatible server (vLLM, SGLang, Ollama) |
180-
181123
## Roadmap
182124

183125
- [ ] GGUF backend (direct llama.cpp integration without HTTP)

docs/backends.md

Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
# Backends
2+
3+
`infer-check` supports four inference backends. Each backend implements a common protocol for generation, health checks, and cleanup.
4+
5+
## Overview
6+
7+
| Backend | Type | Default URL | Use case |
8+
|-------------------|------|-------------|----------|
9+
| **mlx-lm** | In-process | (local) | Local Apple Silicon inference with logprobs |
10+
| **llama-cpp** | HTTP | `http://127.0.0.1:8080` | llama-server via `/completion` endpoint |
11+
| **vllm-mlx** | HTTP | `http://127.0.0.1:8000` | Continuous batching on Apple Silicon |
12+
| **openai-compat** | HTTP | `http://127.0.0.1:11434/v1` | Any OpenAI-compatible server (vLLM, SGLang, Ollama) |
13+
14+
## mlx-lm
15+
16+
In-process inference using Apple's MLX framework. Runs directly on Apple Silicon with no server required.
17+
18+
**Install**: `pip install "infer-check[mlx]"` (requires `mlx` and `mlx-lm` packages)
19+
20+
**Features**:
21+
22+
- Generates per-token logprobs when available via `generate_step()`
23+
- Falls back to simple generation if logprobs aren't supported
24+
- Lazy model loading -- the model is downloaded and loaded on first use, not at import time
25+
- Single-threaded sequential inference
26+
27+
**When to use**: Local testing on Mac. Best baseline for quantization sweeps since it runs natively with no serving layer overhead.
28+
29+
**Example**:
30+
31+
```bash
32+
infer-check compare \
33+
mlx-community/Llama-3.1-8B-Instruct-bf16 \
34+
mlx-community/Llama-3.1-8B-Instruct-4bit
35+
```
36+
37+
## llama.cpp
38+
39+
HTTP backend targeting [llama-server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) (the built-in HTTP server from llama.cpp).
40+
41+
**Setup**: Start llama-server separately:
42+
43+
```bash
44+
llama-server -m /path/to/model.gguf --port 8080
45+
```
46+
47+
**Features**:
48+
49+
- Uses the `/completion` endpoint for text generation
50+
- Requests top-10 token probabilities and converts them to logprobs
51+
- Aligns token distributions by ID metadata for cross-backend comparison
52+
- 120-second request timeout
53+
54+
**When to use**: Testing GGUF models served via llama.cpp. Good for comparing GGUF quantization formats against each other or against MLX native quantization.
55+
56+
**Example**:
57+
58+
```bash
59+
infer-check determinism \
60+
--model my-model \
61+
--backend llama-cpp \
62+
--base-url http://127.0.0.1:8080 \
63+
--prompts determinism \
64+
--runs 20
65+
```
66+
67+
## vllm-mlx
68+
69+
HTTP backend for [vllm-mlx](https://github.com/vllm-project/vllm-mlx), a continuous-batching inference server for Apple Silicon. Extends the OpenAI-compatible backend with model-aware health checks.
70+
71+
**Setup**: Start vllm-mlx separately:
72+
73+
```bash
74+
vllm-mlx serve mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --port 8000
75+
```
76+
77+
**Features**:
78+
79+
- Inherits all capabilities from the openai-compat backend
80+
- Model-aware health check verifies the expected model is loaded
81+
- Supports both `/v1/chat/completions` and `/v1/completions` endpoints
82+
83+
**When to use**: Testing continuous-batching serving layer correctness. Ideal for `diff` and `stress` commands to verify the serving layer doesn't introduce divergence.
84+
85+
**Example**:
86+
87+
```bash
88+
infer-check diff \
89+
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
90+
--backends "mlx-lm,vllm-mlx" \
91+
--base-urls ",http://127.0.0.1:8000" \
92+
--prompts reasoning
93+
```
94+
95+
## openai-compat
96+
97+
Generic backend for any server that implements the OpenAI API format. Works with vLLM, SGLang, Ollama, and others.
98+
99+
**Features**:
100+
101+
- Supports both `/v1/chat/completions` and `/v1/completions` endpoints
102+
- Requests logprobs with graceful fallback if unsupported
103+
- 120-second request timeout
104+
- Detailed error messages for connection, timeout, and HTTP errors
105+
106+
**When to use**: Any OpenAI-compatible server. This is the most flexible backend and the default for Ollama-style model tags.
107+
108+
**Default URLs by resolution**:
109+
110+
| Model source | Default URL |
111+
|-------------|-------------|
112+
| Ollama tags (e.g., `llama3.1:8b`) | `http://127.0.0.1:11434/v1` |
113+
| Custom server | Use `--base-url` |
114+
115+
**Example with Ollama**:
116+
117+
```bash
118+
infer-check compare \
119+
ollama:llama3.1:8b-instruct-q4_K_M \
120+
ollama:llama3.1:8b-instruct-q8_0
121+
```
122+
123+
**Example with custom server**:
124+
125+
```bash
126+
infer-check stress \
127+
--model my-model \
128+
--backend openai-compat \
129+
--base-url http://my-server:8000/v1 \
130+
--prompts reasoning \
131+
--concurrency 1,2,4,8
132+
```
133+
134+
## Chat vs completions
135+
136+
HTTP backends support two endpoint modes:
137+
138+
- **Chat** (`--chat`, default) -- uses `/v1/chat/completions`. The server applies its chat template. Use this when the server is configured with the correct chat template for your model.
139+
- **Completions** (`--no-chat`) -- uses `/v1/completions`. Sends raw text with no template. Use this for raw text generation or when you want to control the prompt format yourself.
140+
141+
The `--chat` / `--no-chat` flag applies to the `diff` command. The `compare` command always uses completions mode to avoid template differences between backends.
142+
143+
## Backend selection
144+
145+
Backends are selected in different ways depending on the command:
146+
147+
| Command | How backend is chosen |
148+
|---------|----------------------|
149+
| `compare` | Auto-detected from each model spec |
150+
| `sweep` | `--backend` flag (shared across all models) or auto-detected |
151+
| `diff` | `--backends` flag (explicit list) |
152+
| `stress` | `--backend` flag or auto-detected from model |
153+
| `determinism` | `--backend` flag or auto-detected from model |
154+
| `report` | N/A (operates on saved results) |

0 commit comments

Comments
 (0)