LLM Serving — Dual Node · Qwen3.6-27B (alias: coder)

Two llama-server instances behind user-level nginx. Zero sudo.

                    ┌── nginx ──┐
                    │ :8888     │
Client ─────────────│ :19101    │── least_conn ──┬─▶ Node 1  GPU 0,1  :8081
                    └───────────┘                 └─▶ Node 2  GPU 2,3  :8082

Quick Start

# 1. Generate an API key and save to .env
python3 gen_key.py

#    Or create manually:
#    cp .env.example .env && vim .env

# 2. One-time setup (symlink, nginx deploy, service register)
./setup-llama.sh

# 3. Start everything
./start_service.sh

Daily Commands

./start_service.sh    # Start node1, node2, nginx
./stop_service.sh      # Stop all services
./reload_nginx.sh     # Redeploy nginx config from template + reload

# Test inference
./test_comm.sh            # local via nginx (default)
./test_comm.sh node1      # direct to node 1
./test_comm.sh external   # from outside (ddns)

Configuration

API Key (`.env`)

Generate a random key and save to .env:

python3 gen_key.py          # generate key → save to .env
python3 gen_key.py --no-save  # just print, don't save

Or manually create .env:

DPI_FACTORY_API_KEY=sk-your-key-here
DPI_FACTORY_API_BASE=http://125.188.35.185:19101/v1

After changing the key, apply it: ./reload_nginx.sh
The key is never committed — .env is gitignored, and nginx/user-nginx.conf.template uses ${DPI_FACTORY_API_KEY} substituted at deploy time.

Model


Model	Qwen3.6-27B (Q4_K_M)
Alias	`coder`
Context	200,000 tokens per node
Path	`~/.lmstudio/models/lmstudio-community/Qwen3.6-27B-GGUF/`

Call with "model": "coder" in API requests.

Endpoints

Port	What	Auth
8888	nginx (local)	Bearer token from `.env`
19101	nginx (external)	Bearer token from `.env`
8081	Node 1 direct	None
8082	Node 2 direct	None

How It Works

setup-llama.sh creates ~/llm-serving → ~/ai-serving/llama-cpp-ex symlink, generates services from llama-node.service.in template (source of truth; generated files ignored in git), deploys nginx config from .env, and registers systemd services
All services run as systemd user units — no root needed
nginx listens on high ports (8888, 19101) — no sudo needed
deploy_nginx.sh reads .env → substitutes into nginx/user-nginx.conf.template → writes to ~/llm-serving/nginx/user-nginx.conf

Project Structure

llama-cpp-ex/
├── .env.example                        # Copy to .env, add your key
├── setup-llama.sh                      # One-time setup (generates services from template)
├── start_service.sh                     # Start services
├── stop_service.sh                      # Stop services
├── reload_nginx.sh                      # Redeploy + reload nginx
├── deploy_nginx.sh                      # Generate nginx config from .env + template
├── gen_key.py                           # Generate API key → .env
├── test_comm.sh                         # Test inference endpoints
├── Makefile                             # make lint (shellcheck)
├── systemd/
│   ├── llama-node.service.in            # Template (source of truth; generates node1/node2)
│   └── user-nginx.service               # User-level nginx
├── nginx/
│   └── user-nginx.conf.template         # Source of truth (envsubst)
├── benchmark/                           # Benchmark scripts
├── test/                                # test_dual_mode.sh
└── llama.cpp/                           # llama.cpp source + build

Boot / GPU Initialization

After OS reboot, GPUs may take up to 30s to fully initialize. Both llama-node1 and llama-node2 services have ExecStartPre that polls nvidia-smi until all 4 GPUs are detected (120s timeout). If timeout hits, Restart=always retries every 5s.

Manual start via ./start_service.sh also runs wait_gpus.sh before launching anything.

Hardware

4× NVIDIA GeForce RTX 4060 Ti (16 GB each, 64 GB total)
llama.cpp built with CUDA 12.8 support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Serving — Dual Node · Qwen3.6-27B (alias: coder)

Quick Start

Daily Commands

Configuration

API Key (`.env`)

Model

Endpoints

How It Works

Project Structure

Boot / GPU Initialization

Hardware

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

LLM Serving — Dual Node · Qwen3.6-27B (alias: coder)

Quick Start

Daily Commands

Configuration

API Key (.env)

Model

Endpoints

How It Works

Project Structure

Boot / GPU Initialization

Hardware

API Key (`.env`)