Hetero-Paged-Infer

A High-Performance LLM Inference Engine with PagedAttention & Continuous Batching

⚠️ Development Status: This project is in early development (v0.1.0). It currently uses a Mock GPU executor for testing and demonstration purposes. Real CUDA kernel support is planned but not yet implemented.

English | 中文 | Documentation

Overview

Hetero-Paged-Infer is an inference engine for Large Language Models (LLMs) built in Rust, designed with a modular architecture for future production deployment. It implements cutting-edge techniques from vLLM with a modular, testable architecture designed for production deployment.

Feature	Description	Status
PagedAttention KV Cache	Block-based memory management; literature context often reports <5% waste	✅
Continuous Batching	Dynamic prefill/decode scheduling	✅
Memory Pressure Awareness	Configurable OOM prevention	✅
Modular Architecture	Trait-based abstractions	✅
Comprehensive Testing	121+ tests	✅
OpenAI-Compatible Server	`/v1/completions` + `/v1/chat/completions` + SSE	✅
CUDA Kernels	Real GPU execution	🚧 Planned

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                        InferenceEngine (CPU)                          │
├──────────────────────────────────────────────────────────────────────┤
│  ┌────────────┐  ┌────────────┐  ┌────────────────────────────────┐  │
│  │ Tokenizer  │  │ Scheduler  │  │      KV Cache Manager          │  │
│  │            │  │            │  │   BlockPool + PageTable        │  │
│  └─────┬──────┘  └─────┬──────┘  └───────────────┬────────────────┘  │
│        │               │                         │                    │
├────────┼───────────────┼─────────────────────────────────────────────┤
│        │        ┌──────▼──────┐                                       │
│        │        │ GPU Executor│  (CUDA / Mock)                        │
│        │        └──────┬──────┘                                       │
│        │        ┌──────▼──────┐                                       │
│        └───────►│  KV Cache   │  (GPU Memory)                         │
│                 └─────────────┘                                       │
└──────────────────────────────────────────────────────────────────────┘

Quick Start

Prerequisites

Rust 1.70+ (2021 edition)
Linux (Ubuntu 20.04+ recommended) or macOS

Installation

# Clone the repository
git clone https://github.com/AICL-Lab/hetero-paged-infer.git
cd hetero-paged-infer

# Build in release mode
cargo build --release

# Run the test suite (121+ tests)
cargo test

CLI Usage

# Basic usage
./target/release/hetero-infer --input "Hello, world!" --max-tokens 50

# With custom parameters
./target/release/hetero-infer \
  --input "Explain quantum computing" \
  --max-tokens 100 \
  --temperature 0.8 \
  --top-p 0.95

# Start OpenAI-compatible HTTP server
./target/release/hetero-infer --serve

OpenAI-Compatible Server

# Start server with default address 127.0.0.1:3000
cargo run -- --serve

# Health / readiness / metrics
curl http://127.0.0.1:3000/healthz
curl http://127.0.0.1:3000/readyz
curl http://127.0.0.1:3000/metrics

# Completions
curl http://127.0.0.1:3000/v1/completions \
  -H "content-type: application/json" \
  -d '{"model":"hetero-infer","prompt":"hello","max_tokens":8}'

# Chat completions
curl http://127.0.0.1:3000/v1/chat/completions \
  -H "content-type: application/json" \
  -d '{"model":"hetero-infer","messages":[{"role":"user","content":"say hi"}],"max_tokens":8}'

Library Usage

use hetero_infer::{EngineConfig, GenerationParams, InferenceEngine};

// Create engine with default configuration
let mut engine = InferenceEngine::new(EngineConfig::default())?;

// Submit a generation request
let request_id = engine.submit_request(
    "Hello, world!",
    GenerationParams { 
        max_tokens: 100, 
        temperature: 0.8, 
        top_p: 0.95 
    }
)?;

// Run inference and collect results
let results = engine.run();
for result in results {
    println!("Generated: {}", result.output_text);
}

Configuration

Parameter	Default	Description
`--block-size`	16	Tokens per physical block
`--max-num-blocks`	1024	Total physical blocks
`--max-batch-size`	32	Max sequences per batch
`--max-num-seqs`	256	Maximum number of sequences
`--max-model-len`	2048	Maximum model context length
`--max-total-tokens`	4096	Maximum tokens per batch
`--memory-threshold`	0.9	Memory pressure threshold (0.0-1.0)
`--max-tokens`	100	Maximum tokens to generate
`--temperature`	1.0	Sampling temperature
`--top-p`	0.9	Nucleus sampling threshold

Config file (config.json):

{
  "block_size": 16,
  "max_num_blocks": 1024,
  "max_batch_size": 32,
  "max_num_seqs": 256,
  "max_model_len": 2048,
  "max_total_tokens": 4096,
  "memory_threshold": 0.9,
  "max_retry_attempts": 2,
  "tokenizer": {
    "kind": "simple",
    "path": null
  },
  "serving": {
    "host": "127.0.0.1",
    "port": 3000,
    "model_name": "hetero-infer",
    "backend": {
      "kind": "local_engine",
      "command": null
    }
  }
}

Load: ./hetero-infer --config config.json

For a HuggingFace tokenizer file:

{
  "tokenizer": {
    "kind": "huggingface",
    "path": "tokenizer.json"
  }
}

For command bridge mode:

{
  "serving": {
    "backend": {
      "kind": "command_bridge",
      "command": {
        "program": "/bin/sh",
        "args": ["-c", "printf 'bridge:%s' \"$HETERO_PROMPT\""]
      }
    }
  }
}

Documentation

Resource	Link
GitHub Pages	https://aicl-lab.github.io/hetero-paged-infer/

Local Documentation

# Build and open API documentation
cargo doc --open

# Build documentation site locally
cd docs
npm install
npm run build

Performance

Approach	Memory Waste	Throughput	Description
Static Allocation	Prior-art pattern: ~40-60%	Prior-art baseline	Pre-allocate max context for each request
Dynamic Allocation	Prior-art pattern: ~20-30%	Literature context: +20%	Resize per request but still fragmented
PagedAttention	Literature context: <5%	Literature context: +50%	Block-based sharing with copy-on-write

Note: Current benchmark figures are either measured with the mock executor or derived from architecture-level estimates. Real CUDA measurements are out of scope until the GPU backend is implemented.

Why PagedAttention?

Traditional LLM serving allocates contiguous memory blocks for each request's KV cache, leading to significant memory fragmentation and waste. PagedAttention solves this by:

Block-based allocation: Split KV cache into fixed-size blocks
On-demand paging: Allocate blocks only when needed
Copy-on-write: Share blocks across sequences for efficient beam search

Testing

# Run all tests
cargo test

# Run with coverage
cargo llvm-cov --html

# Run property-based tests
cargo test -- --test-threads=1

Type	Coverage	Description
Unit Tests	Included in 121+	Core functionality tests
Property Tests	Included in 121+	Invariant verification with proptest
Integration Tests	Included in 121+	End-to-end workflow tests
Doc Tests	Included in 121+	Documentation examples
Overall	121+ tests	Combined automated coverage across the repository

Contributing

We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines.

# Run all checks before submitting
cargo test && cargo fmt --check && cargo clippy

Roadmap

License

MIT License - See LICENSE.

Acknowledgments

vLLM - PagedAttention concept and inspiration
Rust - Systems programming language
Criterion - Statistical benchmarking

Made with ❤️ by AICL-Lab

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.github		.github
.vscode		.vscode
benches		benches
docs		docs
examples		examples
openspec		openspec
src		src
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
README.zh.md		README.zh.md
config.example.json		config.example.json
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hetero-Paged-Infer

Overview

Architecture

Quick Start

Prerequisites

Installation

CLI Usage

OpenAI-Compatible Server

Library Usage

Configuration

Documentation

Local Documentation

Performance

Why PagedAttention?

Testing

Contributing

Roadmap

License

Acknowledgments

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hetero-Paged-Infer

Overview

Architecture

Quick Start

Prerequisites

Installation

CLI Usage

OpenAI-Compatible Server

Library Usage

Configuration

Documentation

Local Documentation

Performance

Why PagedAttention?

Testing

Contributing

Roadmap

License

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages