This directory contains examples for running Memlayer as an OpenAI-compatible reverse proxy server.
Memlayer Server is a FastAPI-based reverse proxy that adds persistent memory capabilities to llama-server, while maintaining full OpenAI API compatibility. This allows you to use any OpenAI-compatible client (SDKs, tools, frameworks) with your local models and get automatic memory features.
- 100% Offline: Uses local sentence-transformers for embeddings (no API calls)
- OpenAI-compatible: Drop-in replacement for OpenAI API
- Multi-user support: Per-user memory isolation via
X-User-IDheader - Tool calling: Native function calling support via llama-server
- Streaming: Full SSE streaming support
- Performance: Shared embedding model across users for efficiency
-
Install llama.cpp and build llama-server:
git clone https://github.com/ggml-org/llama.cpp cd llama.cpp make llama-server -
Download a GGUF model (e.g., from Hugging Face)
-
Start llama-server with function calling support:
./llama-server -m /path/to/model.gguf --port 8080 -ngl 99 --chat-template llama3 --jinja
-
Install Memlayer with server dependencies:
pip install memlayer[server] # or for development: python3.12 -m pip install -e .[server]
# Using defaults (llama-server at localhost:8080, proxy at 0.0.0.0:8000)
python3.12 -m memlayer.server
# With custom settings
python3.12 -m memlayer.server \
--llama-host http://localhost:8080 \
--proxy-port 8000 \
--storage-path ./my_memories
# Enable debug mode
python3.12 -m memlayer.server --debugfrom openai import OpenAI
# Point to Memlayer proxy instead of OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # No API key required
)
# Use exactly like OpenAI API
response = client.chat.completions.create(
model="qwen2.5:7b", # Your llama-server model
messages=[
{"role": "user", "content": "My name is Alice"}
]
)
print(response.choices[0].message.content)Simple script to start the server with custom configuration.
python3.12 examples/07_server/run_server.pyDemonstrates using the OpenAI SDK with Memlayer proxy, including:
- Storing memories
- Retrieving memories
- Multi-turn conversations
python3.12 examples/07_server/test_client.pyShows how to use per-user memory isolation with the X-User-ID header.
python3.12 examples/07_server/multi_user_example.py┌─────────────────┐
│ Your Client │
│ (OpenAI SDK) │
└────────┬────────┘
│ HTTP POST /v1/chat/completions
▼
┌─────────────────────────────────────┐
│ Memlayer Proxy (FastAPI) │
│ - Parse OpenAI request │
│ - Extract user_id from X-User-ID │
│ - Route to LlamaServer wrapper │
│ - Add memory via SearchService │
│ - Return OpenAI response │
└────────┬────────────────────────────┘
│
├──► LocalEmbeddingModel (shared, singleton)
│ └─ sentence-transformers (384d)
│
├──► ChromaDB (per user_id)
│ └─ Vector storage for facts
│
├──► NetworkX (per user_id)
│ └─ Graph storage for relationships
│
▼
┌─────────────────┐
│ llama-server │
│ (llama.cpp) │
└─────────────────┘
All settings can be configured via environment variables:
export MEMLAYER_LLAMA_SERVER_HOST=http://localhost:8080
export MEMLAYER_PROXY_PORT=8000
export MEMLAYER_STORAGE_PATH=./memlayer_data
export MEMLAYER_DEFAULT_USER_ID=default_user
export MEMLAYER_LOG_LEVEL=INFO
export MEMLAYER_DEBUG_MODE=falsepython3.12 -m memlayer.server --helpOptions:
--llama-host: llama-server URL--llama-port: llama-server port--proxy-host: Proxy bind address--proxy-port: Proxy port--storage-path: Memory storage location--no-curation: Disable memory curation--curation-interval: Curation interval (seconds)--scheduler-interval: Task scheduler interval (seconds)--debug: Enable debug logging--log-level: Logging level (DEBUG, INFO, WARNING, ERROR)--reload: Enable auto-reload for development--workers: Number of worker processes
OpenAI-compatible chat completions endpoint with memory.
Headers:
Content-Type: application/jsonX-User-ID: <user_id>(optional, defaults to "default_user")
Request Body:
{
"model": "qwen2.5:7b",
"messages": [
{"role": "user", "content": "What's my name?"}
],
"temperature": 0.7,
"stream": false
}Response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1700000000,
"model": "qwen2.5:7b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Your name is Alice."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 0,
"completion_tokens": 0,
"total_tokens": 0
}
}Health check endpoint.
Response:
{
"service": "memlayer-server",
"status": "ready",
"llama_server": "http://localhost:8080",
"mode": "offline (local embeddings)"
}Use the X-User-ID header to isolate memories per user:
import httpx
response = httpx.post(
"http://localhost:8000/v1/chat/completions",
headers={
"Content-Type": "application/json",
"X-User-ID": "alice"
},
json={
"model": "qwen2.5:7b",
"messages": [
{"role": "user", "content": "My favorite color is blue"}
]
}
)Each user gets their own isolated storage at {storage_path}/{user_id}/.
Streaming is supported via Server-Sent Events (SSE):
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
stream = client.chat.completions.create(
model="qwen2.5:7b",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")-
Shared Embedding Model: The server uses a singleton embedding model shared across all users, reducing memory usage and initialization time.
-
Per-User Client Caching: LlamaServer wrapper instances are cached per
user_id, avoiding redundant initialization. -
Background Processing: Memory consolidation runs asynchronously, so responses are fast.
-
GPU Acceleration: Use
-ngl 99flag with llama-server to offload layers to GPU. -
Model Selection: Use smaller GGUF models (e.g., qwen2.5:3b) for faster responses, or larger models (e.g., llama3.2:8b) for better quality.
- Check llama-server is running:
curl http://localhost:8080/health - Check port is available:
lsof -i :8000 - Check Python version: Must use
python3.12(homebrew installed)
- Verify proxy URL:
curl http://localhost:8000/ - Check firewall settings
- Try localhost instead of 0.0.0.0:
--proxy-host 127.0.0.1
- Wait for consolidation: Memory extraction runs in background thread (2-3 seconds)
- Check storage path exists and is writable
- Enable debug mode:
--debugto see detailed logs
- Ensure llama-server supports function calling: Use
--chat-templateand--jinjaflags - Check model supports function calling: Not all models do
- Try explicit system prompts if model doesn't have native support
- Read the main documentation
- Try the provider examples
- Explore search tiers
- Learn about memory lifecycle
For issues, questions, or feature requests:
- GitHub Issues: https://github.com/divagr18/memlayer/issues
- Documentation: https://divagr18.github.io/memlayer/