dual-node-llm-serving/TODO.md at main · 3DAlgoLab/dual-node-llm-serving

Simplify current codebase. Just support dual mode, remove all others. (2026-04-23)

Changed target LLM from Gemma 4 26B A4B IT to Qwen3.6-27B. Centralized configuration in .env. (2026-04-24)

Set Model ID for API calls as coder, an alias. (2026-04-23)

Use environment variable for all API keys for security. (ex. .env) — via deploy_nginx.sh + envsubst from template. (2026-04-23)

For easy maintenance, keep this system not requiring sudo privilege if possible — implemented via ~/llm-serving symlink, user-level nginx on high ports, removed port 80 config. (2026-04-23)

Establish 2 node llama cpp servers

Check options

Support Flash Attention (enabled by default, auto)
Parallel Execution Slot, --parallel 2 per each node (2 GPUs per node)

Support API key (read from .env, DPI_FACTORY_API_KEY)

Using nginx with Bearer token auth

Internet Serving Feature

outer port: 19101 ✓
outer API base: http://125.188.35.185:19101/v1 ✓
API key required

Model re-naming for easy API calling (alias: coder)

Use -a coder when starting servers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TODO

Archived List

FilesExpand file tree

TODO.md

Latest commit

History

TODO.md

File metadata and controls

TODO

Archived List