Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
160 changes: 51 additions & 109 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,15 @@

[![PyPI - Version](https://img.shields.io/pypi/v/infer-check?logo=PyPi&color=%233775A9)](https://pypi.org/project/infer-check/)
[![Run tests and upload coverage](https://github.com/NullPointerDepressiveDisorder/infer-check/actions/workflows/coverage.yml/badge.svg?branch=main)](https://github.com/NullPointerDepressiveDisorder/infer-check/actions/workflows/coverage.yml)
[![codecov](https://codecov.io/gh/NullPointerDepressiveDisorder/infer-check/graph/badge.svg?token=FWG0Z5YHUS)](https://codecov.io/gh/NullPointerDepressiveDisorder/infer-check)
[![Docs](https://img.shields.io/badge/docs-mkdocs-blue)](https://nullpointerdepressivedisorder.github.io/infer-check)

**Catches the correctness bugs that benchmarks miss in LLM inference engines.**

Quantization silently breaks arithmetic. Serving layers silently alter output. KV caches silently corrupt under load. Benchmarks like lm-evaluation-harness test whether models are smart — `infer-check` tests whether engines are correct.

> **[Read the full documentation](https://nullpointerdepressivedisorder.github.io/infer-check)**

## The problem

Every LLM inference engine has correctness bugs that benchmarks don't catch:
Expand All @@ -18,43 +22,6 @@ Every LLM inference engine has correctness bugs that benchmarks don't catch:

These aren't model quality problems — they're engine correctness failures. `infer-check` is a CLI tool that runs differential tests across backends, quantization levels, and concurrency conditions to surface them automatically.

## Example results

Results from running `infer-check` on Llama-3.1-8B-Instruct and Qwen3.5-4B (MoE) on Apple Silicon using mlx-lm and vllm-mlx. These demonstrate what the tool catches — not a comprehensive benchmark.

### Quantization sweep

4-bit quantization on Llama-3.1-8B showed clear task-dependent degradation. Numerical tasks broke worst:

```
Llama-3.1-8B: bf16 vs 4bit
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ prompt suite ┃ identical ┃ severe ┃ mean_similarity ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ adversarial-numerics │ 0/30 │ 23/30 │ 0.311 │
│ reasoning │ 1/50 │ 35/50 │ 0.384 │
│ code │ 0/49 │ 30/49 │ 0.452 │
└───────────────────────┴───────────┴──────────┴─────────────────┘
```

A "severe" divergence means the quantized output is functionally wrong — not just worded differently, but giving incorrect answers to questions the bf16 baseline handles correctly. This pattern is consistent with published research on quantization-induced degradation, reproduced here on MLX's native quantization scheme.

### Dense vs. MoE comparison

Qwen3.5-4B (Gated Delta Networks + sparse MoE) showed similar degradation rates to dense Llama-3.1-8B in our testing — 35/50 severe on reasoning at 4-bit. Small sample, but the tool picks up the signal clearly on both architectures.

### Cross-backend diff

mlx-lm vs vllm-mlx at temperature=0 on Llama-3.1-8B-4bit: 50/50 identical (reasoning) and 30/30 identical (numerics). In this test, the vllm-mlx serving layer introduced zero divergence — output differences in production would come from quantization, not from the serving layer itself.

### Determinism

Llama-3.1-8B-4bit and Qwen3.5-4B both scored 50/50 identical across 20 runs per prompt on single-request mlx-lm inference at temperature=0.

### Stress test

vllm-mlx at concurrency 1/2/4/8: zero errors, 100% output consistency at all levels. No KV cache corruption or batch-dependent divergence detected.

## Installation

```
Expand All @@ -64,27 +31,43 @@ pip install infer-check
pip install "infer-check[mlx]"
```

## Usage
## Quick start

### Quantization sweep
Compare two quantizations head-to-head:

```
infer-check compare \
mlx-community/Llama-3.1-8B-Instruct-4bit \
mlx-community/Llama-3.1-8B-Instruct-8bit \
--prompts adversarial-numerics
```

Compare pre-quantized models against a baseline. Each model is a separate HuggingFace repo. Use `--max-tokens` to control generation length (defaults to 1024) and `--num-prompts` to limit the number of prompts used.
Run a full quantization sweep:

```
infer-check sweep \
--models "bf16=mlx-community/Meta-Llama-3.1-8B-Instruct-bf16,\
8bit=mlx-community/Meta-Llama-3.1-8B-Instruct-8bit,\
4bit=mlx-community/Meta-Llama-3.1-8B-Instruct-4bit" \
--backend mlx-lm \
--prompts reasoning \
--max-tokens 512 \
--num-prompts 10 \
--output ./results/sweep/
--prompts reasoning
```

`--prompts` accepts either a bundled suite name (`reasoning`, `code`, `adversarial-numerics`, `determinism`, `long-context`, `quant-sensitive`) or a path to any `.jsonl` file.
## Commands

| Command | Purpose | Docs |
| --- | --- | --- |
| `sweep` | Compare pre-quantized models against a baseline | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/sweep/) |
| `compare` | Head-to-head comparison of two models or quantizations | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/compare/) |
| `diff` | Compare outputs across different backends for the same model | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/diff/) |
| `determinism` | Test output reproducibility at temperature=0 | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/determinism/) |
| `stress` | Test correctness under concurrent load | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/stress/) |
| `report` | Generate HTML/JSON reports from saved results | [docs](https://nullpointerdepressivedisorder.github.io/infer-check/commands/report/) |

## Example results

Results from running `infer-check` on Llama-3.1-8B-Instruct on Apple Silicon using mlx-lm.

The baseline is automatically run twice as a self-check — if it's not 50/50 identical, your comparison data is unreliable.
### Quantization sweep

```
Sweep Summary
Expand All @@ -97,87 +80,46 @@ The baseline is automatically run twice as a self-check — if it's not 50/50 id
└─────────────────────┴───────────┴───────┴──────────┴────────┴─────────────────┘
```

### Cross-backend diff

Same model, same quant, different inference paths. Catches serving-layer bugs.
A "severe" divergence means the quantized output is functionally wrong — not just worded differently, but giving incorrect answers to questions the bf16 baseline handles correctly.

```
# Start vllm-mlx in another terminal:
# vllm-mlx serve mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --port 8000

infer-check diff \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--backends "mlx-lm,openai-compat" \
--base-urls ",http://127.0.0.1:8000" \
--prompts reasoning \
--output ./results/diff/
```

Uses `/v1/chat/completions` by default (`--chat`) so server-side chat templates match the local backend. Pass `--no-chat` for raw `/v1/completions`.

### Determinism

Same prompt N times at temperature=0. Output should be bit-identical every run.

```
infer-check determinism \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--backend mlx-lm \
--prompts determinism \
--runs 20 \
--output ./results/determinism/
```
### Cross-backend diff

### Stress test
mlx-lm vs vllm-mlx at temperature=0: 50/50 identical (reasoning) and 30/30 identical (numerics). Zero serving-layer divergence detected.

Concurrent requests through a serving backend. Tests KV cache correctness under load.
### Determinism & stress

```
infer-check stress \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--backend openai-compat \
--base-url http://127.0.0.1:8000 \
--prompts reasoning \
--concurrency 1,2,4,8 \
--output ./results/stress/
```
100% determinism across 20 runs per prompt at temperature=0. 100% output consistency at concurrency levels 1/2/4/8.

### Report
## Supported backends

Generate an HTML report from all saved results.
| Backend | Type | Use case |
|-------------------| --- | --- |
| **mlx-lm** | In-process | Local Apple Silicon inference with logprobs |
| **llama-cpp** | HTTP | `llama-server` via `/completion` endpoint |
| **vllm-mlx** | HTTP | Continuous batching on Apple Silicon |
| **openai-compat** | HTTP | Any OpenAI-compatible server (vLLM, SGLang, Ollama) |

```
infer-check report ./results/ --format html
```
See the [backends documentation](https://nullpointerdepressivedisorder.github.io/infer-check/backends/) for setup and configuration details.

## Prompt suites

Curated prompts targeting known quantization failure modes:
Six curated suites ship with the package — no need to clone the repo:

| Suite | Count | Purpose |
| --- | --- | --- |
| `reasoning.jsonl` | 50 | Multi-step math and logic |
| `code.jsonl` | 49 | Python, JSON, SQL generation |
| `adversarial-numerics.jsonl` | 30 | IEEE 754 edge cases, overflow, precision |
| `long-context.jsonl` | 10 | Tables and transcripts with recall questions |
| `quant-sensitive.jsonl` | 20 | Multi-digit arithmetic, long CoT, precise syntax |
| `determinism.jsonl` | 50 | High-entropy continuations for determinism testing |
| `reasoning` | 50 | Multi-step math and logic |
| `code` | 49 | Python, JSON, SQL generation |
| `adversarial-numerics` | 30 | IEEE 754 edge cases, overflow, precision |
| `long-context` | 10 | Tables and transcripts with recall questions |
| `quant-sensitive` | 20 | Multi-digit arithmetic, long CoT, precise syntax |
| `determinism` | 50 | High-entropy continuations for determinism testing |

All suites ship with the package — no need to clone the repo. Custom suites are JSONL files with one object per line (default `max_tokens` is 1024):
Custom suites are JSONL files with one object per line:

```json
{"id": "custom-001", "text": "Your prompt here", "category": "math", "max_tokens": 512}
```

## Supported backends

| Backend | Type | Use case |
| --- | --- | --- |
| **mlx-lm** | In-process | Local Apple Silicon inference with logprobs |
| **llama.cpp** | HTTP | `llama-server` via `/completion` endpoint |
| **vllm-mlx** | HTTP | Continuous batching on Apple Silicon |
| **openai-compat** | HTTP | Any OpenAI-compatible server (vLLM, SGLang, Ollama) |

## Roadmap

- [ ] GGUF backend (direct llama.cpp integration without HTTP)
Expand Down
154 changes: 154 additions & 0 deletions docs/backends.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# Backends

`infer-check` supports four inference backends. Each backend implements a common protocol for generation, health checks, and cleanup.

## Overview

| Backend | Type | Default URL | Use case |
|-------------------|------|-------------|----------|
| **mlx-lm** | In-process | (local) | Local Apple Silicon inference with logprobs |
| **llama-cpp** | HTTP | `http://127.0.0.1:8080` | llama-server via `/completion` endpoint |
| **vllm-mlx** | HTTP | `http://127.0.0.1:8000` | Continuous batching on Apple Silicon |
| **openai-compat** | HTTP | `http://127.0.0.1:11434/v1` | Any OpenAI-compatible server (vLLM, SGLang, Ollama) |

## mlx-lm

In-process inference using Apple's MLX framework. Runs directly on Apple Silicon with no server required.

**Install**: `pip install "infer-check[mlx]"` (requires `mlx` and `mlx-lm` packages)

**Features**:

- Generates per-token logprobs when available via `generate_step()`
- Falls back to simple generation if logprobs aren't supported
- Lazy model loading -- the model is downloaded and loaded on first use, not at import time
- Single-threaded sequential inference

**When to use**: Local testing on Mac. Best baseline for quantization sweeps since it runs natively with no serving layer overhead.

**Example**:

```bash
infer-check compare \
mlx-community/Llama-3.1-8B-Instruct-bf16 \
mlx-community/Llama-3.1-8B-Instruct-4bit
```

## llama.cpp

HTTP backend targeting [llama-server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) (the built-in HTTP server from llama.cpp).

**Setup**: Start llama-server separately:

```bash
llama-server -m /path/to/model.gguf --port 8080
```

**Features**:

- Uses the `/completion` endpoint for text generation
- Requests top-10 token probabilities and converts them to logprobs
- Aligns token distributions by ID metadata for cross-backend comparison
- 120-second request timeout

**When to use**: Testing GGUF models served via llama.cpp. Good for comparing GGUF quantization formats against each other or against MLX native quantization.

**Example**:

```bash
infer-check determinism \
--model my-model \
--backend llama-cpp \
--base-url http://127.0.0.1:8080 \
--prompts determinism \
--runs 20
```

## vllm-mlx

HTTP backend for [vllm-mlx](https://github.com/vllm-project/vllm-mlx), a continuous-batching inference server for Apple Silicon. Extends the OpenAI-compatible backend with model-aware health checks.

**Setup**: Start vllm-mlx separately:

```bash
vllm-mlx serve mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --port 8000
```

**Features**:

- Inherits all capabilities from the openai-compat backend
- Model-aware health check verifies the expected model is loaded
- Supports both `/v1/chat/completions` and `/v1/completions` endpoints

**When to use**: Testing continuous-batching serving layer correctness. Ideal for `diff` and `stress` commands to verify the serving layer doesn't introduce divergence.

**Example**:

```bash
infer-check diff \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--backends "mlx-lm,vllm-mlx" \
--base-urls ",http://127.0.0.1:8000" \
--prompts reasoning
```

## openai-compat

Generic backend for any server that implements the OpenAI API format. Works with vLLM, SGLang, Ollama, and others.

**Features**:

- Supports both `/v1/chat/completions` and `/v1/completions` endpoints
- Requests logprobs with graceful fallback if unsupported
- 120-second request timeout
- Detailed error messages for connection, timeout, and HTTP errors

**When to use**: Any OpenAI-compatible server. This is the most flexible backend and the default for Ollama-style model tags.

**Default URLs by resolution**:

| Model source | Default URL |
|-------------|-------------|
| Ollama tags (e.g., `llama3.1:8b`) | `http://127.0.0.1:11434/v1` |
| Custom server | Use `--base-url` |

**Example with Ollama**:

```bash
infer-check compare \
ollama:llama3.1:8b-instruct-q4_K_M \
ollama:llama3.1:8b-instruct-q8_0
```

**Example with custom server**:

```bash
infer-check stress \
--model my-model \
--backend openai-compat \
--base-url http://my-server:8000/v1 \
--prompts reasoning \
--concurrency 1,2,4,8
```

## Chat vs completions

HTTP backends support two endpoint modes:

- **Chat** (`--chat`, default) -- uses `/v1/chat/completions`. The server applies its chat template. Use this when the server is configured with the correct chat template for your model.
- **Completions** (`--no-chat`) -- uses `/v1/completions`. Sends raw text with no template. Use this for raw text generation or when you want to control the prompt format yourself.

The `--chat` / `--no-chat` flag applies to the `diff` command. The `compare` command always uses completions mode to avoid template differences between backends.

## Backend selection

Backends are selected in different ways depending on the command:

| Command | How backend is chosen |
|---------|----------------------|
| `compare` | Auto-detected from each model spec |
| `sweep` | `--backend` flag (shared across all models) or auto-detected |
| `diff` | `--backends` flag (explicit list) |
| `stress` | `--backend` flag or auto-detected from model |
| `determinism` | `--backend` flag or auto-detected from model |
| `report` | N/A (operates on saved results) |
Loading
Loading