Skip to content

Latest commit

 

History

History
122 lines (91 loc) · 4.26 KB

File metadata and controls

122 lines (91 loc) · 4.26 KB

LLM Serving — Dual Node · Qwen3.6-27B (alias: coder)

Two llama-server instances behind user-level nginx. Zero sudo.

                    ┌── nginx ──┐
                    │ :8888     │
Client ─────────────│ :19101    │── least_conn ──┬─▶ Node 1  GPU 0,1  :8081
                    └───────────┘                 └─▶ Node 2  GPU 2,3  :8082

Quick Start

# 1. Generate an API key and save to .env
python3 gen_key.py

#    Or create manually:
#    cp .env.example .env && vim .env

# 2. One-time setup (symlink, nginx deploy, service register)
./setup-llama.sh

# 3. Start everything
./start_service.sh

Daily Commands

./start_service.sh    # Start node1, node2, nginx
./stop_service.sh      # Stop all services
./reload_nginx.sh     # Redeploy nginx config from template + reload

# Test inference
./test_comm.sh            # local via nginx (default)
./test_comm.sh node1      # direct to node 1
./test_comm.sh external   # from outside (ddns)

Configuration

API Key (.env)

Generate a random key and save to .env:

python3 gen_key.py          # generate key → save to .env
python3 gen_key.py --no-save  # just print, don't save

Or manually create .env:

DPI_FACTORY_API_KEY=sk-your-key-here
DPI_FACTORY_API_BASE=http://125.188.35.185:19101/v1

After changing the key, apply it: ./reload_nginx.sh
The key is never committed — .env is gitignored, and nginx/user-nginx.conf.template uses ${DPI_FACTORY_API_KEY} substituted at deploy time.

Model

Model Qwen3.6-27B (Q4_K_M)
Alias coder
Context 200,000 tokens per node
Path ~/.lmstudio/models/lmstudio-community/Qwen3.6-27B-GGUF/

Call with "model": "coder" in API requests.

Endpoints

Port What Auth
8888 nginx (local) Bearer token from .env
19101 nginx (external) Bearer token from .env
8081 Node 1 direct None
8082 Node 2 direct None

How It Works

  • setup-llama.sh creates ~/llm-serving → ~/ai-serving/llama-cpp-ex symlink, generates services from llama-node.service.in template (source of truth; generated files ignored in git), deploys nginx config from .env, and registers systemd services
  • All services run as systemd user units — no root needed
  • nginx listens on high ports (8888, 19101) — no sudo needed
  • deploy_nginx.sh reads .env → substitutes into nginx/user-nginx.conf.template → writes to ~/llm-serving/nginx/user-nginx.conf

Project Structure

llama-cpp-ex/
├── .env.example                        # Copy to .env, add your key
├── setup-llama.sh                      # One-time setup (generates services from template)
├── start_service.sh                     # Start services
├── stop_service.sh                      # Stop services
├── reload_nginx.sh                      # Redeploy + reload nginx
├── deploy_nginx.sh                      # Generate nginx config from .env + template
├── gen_key.py                           # Generate API key → .env
├── test_comm.sh                         # Test inference endpoints
├── Makefile                             # make lint (shellcheck)
├── systemd/
│   ├── llama-node.service.in            # Template (source of truth; generates node1/node2)
│   └── user-nginx.service               # User-level nginx
├── nginx/
│   └── user-nginx.conf.template         # Source of truth (envsubst)
├── benchmark/                           # Benchmark scripts
├── test/                                # test_dual_mode.sh
└── llama.cpp/                           # llama.cpp source + build

Boot / GPU Initialization

After OS reboot, GPUs may take up to 30s to fully initialize. Both llama-node1 and llama-node2 services have ExecStartPre that polls nvidia-smi until all 4 GPUs are detected (120s timeout). If timeout hits, Restart=always retries every 5s.

Manual start via ./start_service.sh also runs wait_gpus.sh before launching anything.

Hardware

4× NVIDIA GeForce RTX 4060 Ti (16 GB each, 64 GB total)
llama.cpp built with CUDA 12.8 support