Dual-node llama.cpp serving on 4 GPUs (Qwen3.6-27B, alias coder). User-level systemd + nginx reverse proxy. Zero sudo.
- Node 1: GPU 0,1 → port 8081 | Node 2: GPU 2,3 → port 8082
- nginx: ports 8888 (local) / 19101 (external), Bearer auth from
.env
- Services auto-start on boot (
WantedBy=default.target)
| Rule |
Detail |
.env is source of truth |
All scripts load it via load_env() — MODEL_PATH, keys, params |
| Template-driven |
systemd/llama-node.service.in generates node1/node2 via setup_llama.sh (sed + envsubst). Generated files are gitignored. |
| Nginx config |
deploy_nginx.sh: .env → template → ~/llm-serving/nginx/user-nginx.conf (envsubst) |
| Symlink |
~/llm-serving → ~/ai-serving/llama-cpp-ex (created by setup) |
| Lint |
make lint runs shellcheck on all .sh files |
| Script |
Purpose |
setup_llama.sh |
One-time: symlink, generate services, deploy nginx, register systemd |
start_service.sh |
Manual start: GPU wait → node1 + node2 + nginx |
stop_service.sh |
Stop all services |
reload_nginx.sh |
Redeploy nginx config + reload |
wait_gpus.sh |
Poll nvidia-smi until 4 GPUs ready (used by start_service.sh) |
test_comm.sh |
Test inference endpoints |
- Boot path ≠ start_service.sh — systemd starts nodes directly via
ExecStartPre GPU check. start_service.sh is for manual runs only.
- After
.env changes → run ./setup_llama.sh (regenerates services) then ./start_service.sh or systemctl --user restart llama-node1 llama-node2.
- After API key change →
./reload_nginx.sh is sufficient.
ref/ — do not modify unless explicitly ordered.