A High-Performance LLM Inference Engine with PagedAttention & Continuous Batching
⚠️ Development Status: This project is in early development (v0.1.0). It currently uses a Mock GPU executor for testing and demonstration purposes. Real CUDA kernel support is planned but not yet implemented.
English | 中文 | Documentation
Hetero-Paged-Infer is an inference engine for Large Language Models (LLMs) built in Rust, designed with a modular architecture for future production deployment. It implements cutting-edge techniques from vLLM with a modular, testable architecture designed for production deployment.
| Feature | Description | Status |
|---|---|---|
| PagedAttention KV Cache | Block-based memory management; literature context often reports <5% waste | ✅ |
| Continuous Batching | Dynamic prefill/decode scheduling | ✅ |
| Memory Pressure Awareness | Configurable OOM prevention | ✅ |
| Modular Architecture | Trait-based abstractions | ✅ |
| Comprehensive Testing | 121+ tests | ✅ |
| OpenAI-Compatible Server | /v1/completions + /v1/chat/completions + SSE |
✅ |
| CUDA Kernels | Real GPU execution | 🚧 Planned |
┌──────────────────────────────────────────────────────────────────────┐
│ InferenceEngine (CPU) │
├──────────────────────────────────────────────────────────────────────┤
│ ┌────────────┐ ┌────────────┐ ┌────────────────────────────────┐ │
│ │ Tokenizer │ │ Scheduler │ │ KV Cache Manager │ │
│ │ │ │ │ │ BlockPool + PageTable │ │
│ └─────┬──────┘ └─────┬──────┘ └───────────────┬────────────────┘ │
│ │ │ │ │
├────────┼───────────────┼─────────────────────────────────────────────┤
│ │ ┌──────▼──────┐ │
│ │ │ GPU Executor│ (CUDA / Mock) │
│ │ └──────┬──────┘ │
│ │ ┌──────▼──────┐ │
│ └───────►│ KV Cache │ (GPU Memory) │
│ └─────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
- Rust 1.70+ (2021 edition)
- Linux (Ubuntu 20.04+ recommended) or macOS
# Clone the repository
git clone https://github.com/AICL-Lab/hetero-paged-infer.git
cd hetero-paged-infer
# Build in release mode
cargo build --release
# Run the test suite (121+ tests)
cargo test# Basic usage
./target/release/hetero-infer --input "Hello, world!" --max-tokens 50
# With custom parameters
./target/release/hetero-infer \
--input "Explain quantum computing" \
--max-tokens 100 \
--temperature 0.8 \
--top-p 0.95
# Start OpenAI-compatible HTTP server
./target/release/hetero-infer --serve# Start server with default address 127.0.0.1:3000
cargo run -- --serve
# Health / readiness / metrics
curl http://127.0.0.1:3000/healthz
curl http://127.0.0.1:3000/readyz
curl http://127.0.0.1:3000/metrics
# Completions
curl http://127.0.0.1:3000/v1/completions \
-H "content-type: application/json" \
-d '{"model":"hetero-infer","prompt":"hello","max_tokens":8}'
# Chat completions
curl http://127.0.0.1:3000/v1/chat/completions \
-H "content-type: application/json" \
-d '{"model":"hetero-infer","messages":[{"role":"user","content":"say hi"}],"max_tokens":8}'use hetero_infer::{EngineConfig, GenerationParams, InferenceEngine};
// Create engine with default configuration
let mut engine = InferenceEngine::new(EngineConfig::default())?;
// Submit a generation request
let request_id = engine.submit_request(
"Hello, world!",
GenerationParams {
max_tokens: 100,
temperature: 0.8,
top_p: 0.95
}
)?;
// Run inference and collect results
let results = engine.run();
for result in results {
println!("Generated: {}", result.output_text);
}| Parameter | Default | Description |
|---|---|---|
--block-size |
16 | Tokens per physical block |
--max-num-blocks |
1024 | Total physical blocks |
--max-batch-size |
32 | Max sequences per batch |
--max-num-seqs |
256 | Maximum number of sequences |
--max-model-len |
2048 | Maximum model context length |
--max-total-tokens |
4096 | Maximum tokens per batch |
--memory-threshold |
0.9 | Memory pressure threshold (0.0-1.0) |
--max-tokens |
100 | Maximum tokens to generate |
--temperature |
1.0 | Sampling temperature |
--top-p |
0.9 | Nucleus sampling threshold |
Config file (config.json):
{
"block_size": 16,
"max_num_blocks": 1024,
"max_batch_size": 32,
"max_num_seqs": 256,
"max_model_len": 2048,
"max_total_tokens": 4096,
"memory_threshold": 0.9,
"max_retry_attempts": 2,
"tokenizer": {
"kind": "simple",
"path": null
},
"serving": {
"host": "127.0.0.1",
"port": 3000,
"model_name": "hetero-infer",
"backend": {
"kind": "local_engine",
"command": null
}
}
}Load: ./hetero-infer --config config.json
For a HuggingFace tokenizer file:
{
"tokenizer": {
"kind": "huggingface",
"path": "tokenizer.json"
}
}For command bridge mode:
{
"serving": {
"backend": {
"kind": "command_bridge",
"command": {
"program": "/bin/sh",
"args": ["-c", "printf 'bridge:%s' \"$HETERO_PROMPT\""]
}
}
}
}| Resource | Link |
|---|---|
| GitHub Pages | https://aicl-lab.github.io/hetero-paged-infer/ |
| Architecture Guide | docs/en/architecture/overview.md | | Contributing Guide | CONTRIBUTING.md | | Changelog | CHANGELOG.md |
# Build and open API documentation
cargo doc --open
# Build documentation site locally
cd docs
npm install
npm run build| Approach | Memory Waste | Throughput | Description |
|---|---|---|---|
| Static Allocation | Prior-art pattern: ~40-60% | Prior-art baseline | Pre-allocate max context for each request |
| Dynamic Allocation | Prior-art pattern: ~20-30% | Literature context: +20% | Resize per request but still fragmented |
| PagedAttention | Literature context: <5% | Literature context: +50% | Block-based sharing with copy-on-write |
Note: Current benchmark figures are either measured with the mock executor or derived from architecture-level estimates. Real CUDA measurements are out of scope until the GPU backend is implemented.
Traditional LLM serving allocates contiguous memory blocks for each request's KV cache, leading to significant memory fragmentation and waste. PagedAttention solves this by:
- Block-based allocation: Split KV cache into fixed-size blocks
- On-demand paging: Allocate blocks only when needed
- Copy-on-write: Share blocks across sequences for efficient beam search
# Run all tests
cargo test
# Run with coverage
cargo llvm-cov --html
# Run property-based tests
cargo test -- --test-threads=1| Type | Coverage | Description |
|---|---|---|
| Unit Tests | Included in 121+ | Core functionality tests |
| Property Tests | Included in 121+ | Invariant verification with proptest |
| Integration Tests | Included in 121+ | End-to-end workflow tests |
| Doc Tests | Included in 121+ | Documentation examples |
| Overall | 121+ tests | Combined automated coverage across the repository |
We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines.
# Run all checks before submitting
cargo test && cargo fmt --check && cargo clippy- PagedAttention KV Cache
- Continuous Batching Scheduler
- Memory Pressure Awareness
- Property-Based Testing
- Real CUDA Kernels
- Real Tokenizer Integration
- Async CPU/GPU Overlap
MIT License - See LICENSE.
- vLLM - PagedAttention concept and inspiration
- Rust - Systems programming language
- Criterion - Statistical benchmarking
Made with ❤️ by AICL-Lab