You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A unified, local inference server for managing, running, and monitoring multiple LLM backends (Llama.cpp, OpenAI-compatible APIs, and more) with a modern web dashboard and tray integration.
4
+
5
+
## Features
6
+
7
+
-**Model Management**: Download, verify, start, stop, and delete LLM models from Hugging Face or manually.
8
+
-**Multi-Task Support**: Supports text generation, embeddings, reranking, and multimodal models.
9
+
-**Device Selection**: Run models on CPU, GPU, or NPU (if supported).
10
+
-**Web Dashboard**: Modern UI for status, logs, and model management.
11
+
-**Tray App**: System tray integration for quick access and server control.
12
+
-**OpenAI Proxy**: Exposes OpenAI-compatible endpoints for easy integration.
13
+
-**Cross-Platform**: Windows and Linux support.
14
+
15
+
## Directory Structure
16
+
17
+
```
18
+
.
19
+
├── app.py # Main FastAPI application entrypoint
20
+
├── modules/ # Core Python modules
21
+
│ ├── llamacpp/ # Llama.cpp management and GGUF downloader
22
+
│ ├── gpu_metrics.py # XPU/GPU metrics collection
23
+
│ ├── tray_app.py # System tray integration
24
+
│ └── utils.py # Utility functions
25
+
├── routers/ # FastAPI routers (API endpoints)
26
+
├── engine/ # Native binaries, licenses, and XPU headers
27
+
├── static/ # Web dashboard static files
28
+
├── tests/ # Example tests
29
+
├── config.yaml # Model/task configuration
30
+
├── verified.yaml # List of verified models
31
+
├── pyproject.toml # Python dependencies
32
+
└── README.md # This file
33
+
```
34
+
35
+
## Quick Start
36
+
37
+
1.**Install dependencies**
38
+
Python 3.12+ is required. This project uses `uv` for fast dependency management.
39
+
40
+
```sh
41
+
# Install uv (if you don't have it)
42
+
pip install uv
43
+
44
+
# Create a virtual environment and install dependencies
45
+
uv sync
46
+
```
47
+
48
+
2.**Run the server**
49
+
50
+
```sh
51
+
uv run app.py
52
+
```
53
+
54
+
3.**Support LlamaCPP backend and OVMS backend**
55
+
56
+
```sh
57
+
uv run app.py --backend ovms # for OVMS backend
58
+
uv run app.py --backend llamacpp # for LlamaCPP backend
59
+
```
60
+
61
+
4.**Access the dashboard**
62
+
Open [http://127.0.0.1:8000](http://127.0.0.1:8000) in your browser.
63
+
64
+
5.**Tray App**
65
+
The tray icon should appear automatically when running on supported platforms.
66
+
67
+
## Model Management
68
+
69
+
-**Download**: Use the dashboard or API to download models by Hugging Face repo ID.
70
+
-**Start/Stop**: Start or stop models for different tasks (text generation, embeddings, rerank, multimodal).
71
+
-**Device Selection**: Choose CPU/GPU/NPU for inference (if available).
72
+
-**Logs**: View download and runtime logs in the dashboard.
0 commit comments