You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* docs: add initial MkDocs configuration and custom theme styles
* docs: add comprehensive getting started guide with installation, usage, and prompt suite details
* docs: add project overview, problem statement, and example results to index.md
* docs: add detailed backend reference covering supported inference backends and usage
* docs: add interpreting results and report usage guides
* docs: add guides for compare and diff commands with usage, options, and output details
* docs: add guides for sweep, determinism, and stress commands with usage, options, and output details
* docs: rewrite README with streamlined usage, command overview, and updated examples
* docs: update command guides to use mkdocs-click CLI reference blocks
* docs: add documentation dependencies group to optional dependencies
* docs: update backend tables to rename llama.cpp to llama-cpp and align formatting
[](https://github.com/NullPointerDepressiveDisorder/infer-check/actions/workflows/coverage.yml)
**Catches the correctness bugs that benchmarks miss in LLM inference engines.**
7
9
8
10
Quantization silently breaks arithmetic. Serving layers silently alter output. KV caches silently corrupt under load. Benchmarks like lm-evaluation-harness test whether models are smart — `infer-check` tests whether engines are correct.
9
11
12
+
> **[Read the full documentation](https://nullpointerdepressivedisorder.github.io/infer-check)**
13
+
10
14
## The problem
11
15
12
16
Every LLM inference engine has correctness bugs that benchmarks don't catch:
@@ -18,43 +22,6 @@ Every LLM inference engine has correctness bugs that benchmarks don't catch:
18
22
19
23
These aren't model quality problems — they're engine correctness failures. `infer-check` is a CLI tool that runs differential tests across backends, quantization levels, and concurrency conditions to surface them automatically.
20
24
21
-
## Example results
22
-
23
-
Results from running `infer-check` on Llama-3.1-8B-Instruct and Qwen3.5-4B (MoE) on Apple Silicon using mlx-lm and vllm-mlx. These demonstrate what the tool catches — not a comprehensive benchmark.
A "severe" divergence means the quantized output is functionally wrong — not just worded differently, but giving incorrect answers to questions the bf16 baseline handles correctly. This pattern is consistent with published research on quantization-induced degradation, reproduced here on MLX's native quantization scheme.
41
-
42
-
### Dense vs. MoE comparison
43
-
44
-
Qwen3.5-4B (Gated Delta Networks + sparse MoE) showed similar degradation rates to dense Llama-3.1-8B in our testing — 35/50 severe on reasoning at 4-bit. Small sample, but the tool picks up the signal clearly on both architectures.
45
-
46
-
### Cross-backend diff
47
-
48
-
mlx-lm vs vllm-mlx at temperature=0 on Llama-3.1-8B-4bit: 50/50 identical (reasoning) and 30/30 identical (numerics). In this test, the vllm-mlx serving layer introduced zero divergence — output differences in production would come from quantization, not from the serving layer itself.
49
-
50
-
### Determinism
51
-
52
-
Llama-3.1-8B-4bit and Qwen3.5-4B both scored 50/50 identical across 20 runs per prompt on single-request mlx-lm inference at temperature=0.
53
-
54
-
### Stress test
55
-
56
-
vllm-mlx at concurrency 1/2/4/8: zero errors, 100% output consistency at all levels. No KV cache corruption or batch-dependent divergence detected.
57
-
58
25
## Installation
59
26
60
27
```
@@ -64,27 +31,43 @@ pip install infer-check
64
31
pip install "infer-check[mlx]"
65
32
```
66
33
67
-
## Usage
34
+
## Quick start
68
35
69
-
### Quantization sweep
36
+
Compare two quantizations head-to-head:
37
+
38
+
```
39
+
infer-check compare \
40
+
mlx-community/Llama-3.1-8B-Instruct-4bit \
41
+
mlx-community/Llama-3.1-8B-Instruct-8bit \
42
+
--prompts adversarial-numerics
43
+
```
70
44
71
-
Compare pre-quantized models against a baseline. Each model is a separate HuggingFace repo. Use `--max-tokens` to control generation length (defaults to 1024) and `--num-prompts` to limit the number of prompts used.
`--prompts` accepts either a bundled suite name (`reasoning`, `code`, `adversarial-numerics`, `determinism`, `long-context`, `quant-sensitive`) or a path to any `.jsonl` file.
55
+
## Commands
56
+
57
+
| Command | Purpose | Docs |
58
+
| --- | --- | --- |
59
+
|`sweep`| Compare pre-quantized models against a baseline |[docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/sweep/)|
60
+
|`compare`| Head-to-head comparison of two models or quantizations |[docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/compare/)|
61
+
|`diff`| Compare outputs across different backends for the same model |[docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/diff/)|
62
+
|`determinism`| Test output reproducibility at temperature=0 |[docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/determinism/)|
63
+
|`stress`| Test correctness under concurrent load |[docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/stress/)|
64
+
|`report`| Generate HTML/JSON reports from saved results |[docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/report/)|
65
+
66
+
## Example results
67
+
68
+
Results from running `infer-check` on Llama-3.1-8B-Instruct on Apple Silicon using mlx-lm.
86
69
87
-
The baseline is automatically run twice as a self-check — if it's not 50/50 identical, your comparison data is unreliable.
70
+
### Quantization sweep
88
71
89
72
```
90
73
Sweep Summary
@@ -97,87 +80,46 @@ The baseline is automatically run twice as a self-check — if it's not 50/50 id
Same model, same quant, different inference paths. Catches serving-layer bugs.
83
+
A "severe" divergence means the quantized output is functionally wrong — not just worded differently, but giving incorrect answers to questions the bf16 baseline handles correctly.
|**mlx-lm**| In-process | (local) | Local Apple Silicon inference with logprobs |
10
+
|**llama-cpp**| HTTP |`http://127.0.0.1:8080`| llama-server via `/completion` endpoint |
11
+
|**vllm-mlx**| HTTP |`http://127.0.0.1:8000`| Continuous batching on Apple Silicon |
12
+
|**openai-compat**| HTTP |`http://127.0.0.1:11434/v1`| Any OpenAI-compatible server (vLLM, SGLang, Ollama) |
13
+
14
+
## mlx-lm
15
+
16
+
In-process inference using Apple's MLX framework. Runs directly on Apple Silicon with no server required.
17
+
18
+
**Install**: `pip install "infer-check[mlx]"` (requires `mlx` and `mlx-lm` packages)
19
+
20
+
**Features**:
21
+
22
+
- Generates per-token logprobs when available via `generate_step()`
23
+
- Falls back to simple generation if logprobs aren't supported
24
+
- Lazy model loading -- the model is downloaded and loaded on first use, not at import time
25
+
- Single-threaded sequential inference
26
+
27
+
**When to use**: Local testing on Mac. Best baseline for quantization sweeps since it runs natively with no serving layer overhead.
28
+
29
+
**Example**:
30
+
31
+
```bash
32
+
infer-check compare \
33
+
mlx-community/Llama-3.1-8B-Instruct-bf16 \
34
+
mlx-community/Llama-3.1-8B-Instruct-4bit
35
+
```
36
+
37
+
## llama.cpp
38
+
39
+
HTTP backend targeting [llama-server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) (the built-in HTTP server from llama.cpp).
40
+
41
+
**Setup**: Start llama-server separately:
42
+
43
+
```bash
44
+
llama-server -m /path/to/model.gguf --port 8080
45
+
```
46
+
47
+
**Features**:
48
+
49
+
- Uses the `/completion` endpoint for text generation
50
+
- Requests top-10 token probabilities and converts them to logprobs
51
+
- Aligns token distributions by ID metadata for cross-backend comparison
52
+
- 120-second request timeout
53
+
54
+
**When to use**: Testing GGUF models served via llama.cpp. Good for comparing GGUF quantization formats against each other or against MLX native quantization.
55
+
56
+
**Example**:
57
+
58
+
```bash
59
+
infer-check determinism \
60
+
--model my-model \
61
+
--backend llama-cpp \
62
+
--base-url http://127.0.0.1:8080 \
63
+
--prompts determinism \
64
+
--runs 20
65
+
```
66
+
67
+
## vllm-mlx
68
+
69
+
HTTP backend for [vllm-mlx](https://github.com/vllm-project/vllm-mlx), a continuous-batching inference server for Apple Silicon. Extends the OpenAI-compatible backend with model-aware health checks.
- Inherits all capabilities from the openai-compat backend
80
+
- Model-aware health check verifies the expected model is loaded
81
+
- Supports both `/v1/chat/completions` and `/v1/completions` endpoints
82
+
83
+
**When to use**: Testing continuous-batching serving layer correctness. Ideal for `diff` and `stress` commands to verify the serving layer doesn't introduce divergence.
-**Chat** (`--chat`, default) -- uses `/v1/chat/completions`. The server applies its chat template. Use this when the server is configured with the correct chat template for your model.
139
+
-**Completions** (`--no-chat`) -- uses `/v1/completions`. Sends raw text with no template. Use this for raw text generation or when you want to control the prompt format yourself.
140
+
141
+
The `--chat` / `--no-chat` flag applies to the `diff` command. The `compare` command always uses completions mode to avoid template differences between backends.
142
+
143
+
## Backend selection
144
+
145
+
Backends are selected in different ways depending on the command:
146
+
147
+
| Command | How backend is chosen |
148
+
|---------|----------------------|
149
+
|`compare`| Auto-detected from each model spec |
150
+
|`sweep`|`--backend` flag (shared across all models) or auto-detected |
151
+
|`diff`|`--backends` flag (explicit list) |
152
+
|`stress`|`--backend` flag or auto-detected from model |
153
+
|`determinism`|`--backend` flag or auto-detected from model |
0 commit comments