Skip to content

Latest commit

 

History

History
37 lines (28 loc) · 1.74 KB

File metadata and controls

37 lines (28 loc) · 1.74 KB

Goals

Dual-node llama.cpp serving on 4 GPUs (Qwen3.6-27B, alias coder). User-level systemd + nginx reverse proxy. Zero sudo.

Architecture

  • Node 1: GPU 0,1 → port 8081 | Node 2: GPU 2,3 → port 8082
  • nginx: ports 8888 (local) / 19101 (external), Bearer auth from .env
  • Services auto-start on boot (WantedBy=default.target)

Conventions

Rule Detail
.env is source of truth All scripts load it via load_env() — MODEL_PATH, keys, params
Template-driven systemd/llama-node.service.in generates node1/node2 via setup_llama.sh (sed + envsubst). Generated files are gitignored.
Nginx config deploy_nginx.sh: .env → template → ~/llm-serving/nginx/user-nginx.conf (envsubst)
Symlink ~/llm-serving → ~/ai-serving/llama-cpp-ex (created by setup)
Lint make lint runs shellcheck on all .sh files

Key Scripts

Script Purpose
setup_llama.sh One-time: symlink, generate services, deploy nginx, register systemd
start_service.sh Manual start: GPU wait → node1 + node2 + nginx
stop_service.sh Stop all services
reload_nginx.sh Redeploy nginx config + reload
wait_gpus.sh Poll nvidia-smi until 4 GPUs ready (used by start_service.sh)
test_comm.sh Test inference endpoints

Gotchas

  • Boot path ≠ start_service.sh — systemd starts nodes directly via ExecStartPre GPU check. start_service.sh is for manual runs only.
  • After .env changes → run ./setup_llama.sh (regenerates services) then ./start_service.sh or systemctl --user restart llama-node1 llama-node2.
  • After API key change./reload_nginx.sh is sufficient.
  • ref/ — do not modify unless explicitly ordered.