|
| 1 | +# Start AgentJet Swarm Server via Docker |
| 2 | + |
| 3 | +This guide explains how to launch the **AgentJet Swarm Server** inside a Docker container. The Swarm Server is the GPU-side component responsible for gradient computation, and weight updates. It exposes an OpenAI-compatible API that Swarm Clients connect to for training. |
| 4 | + |
| 5 | +> **Not familiar with Swarm?** Read the [Swarm Introduction](./swarm_intro.md) first. |
| 6 | +
|
| 7 | + |
| 8 | +## Prerequisites |
| 9 | + |
| 10 | +| Requirement | Detail | |
| 11 | +|---|---| |
| 12 | +| Docker | With GPU support (`nvidia-container-toolkit`) | |
| 13 | +| AgentJet Docker image | `ajet:latest` (built from the AgentJet repository) | |
| 14 | +| LLM model weights | Downloaded locally (e.g., `Qwen2.5-7B-Instruct`) | |
| 15 | + |
| 16 | + |
| 17 | +## Command Template |
| 18 | + |
| 19 | +Run the command below: |
| 20 | + |
| 21 | +```bash |
| 22 | +docker run --rm -it \ |
| 23 | + -v /path/to/host/Qwen/Qwen2.5-7B-Instruct:/Qwen/Qwen2.5-7B-Instruct \ |
| 24 | + -v ./swarmlog:/workspace/log \ |
| 25 | + -p 10086:10086 \ |
| 26 | + ajet:latest \ |
| 27 | + bash -c "(ajet-swarm overwatch) & (NO_COLOR=1 LOGURU_COLORIZE=NO ajet-swarm start &>/workspace/log/swarm_server.log)" |
| 28 | +``` |
| 29 | + |
| 30 | +And when completed, you will see a interface like this, which means the deployment is successful: |
| 31 | + |
| 32 | +<div align="center"> |
| 33 | +<img width="640" alt="image" src="https://serve.gptacademic.cn/publish/shared/Image/swarm_overwatch.jpg"/> |
| 34 | +</div> |
| 35 | + |
| 36 | +<br/> |
| 37 | +<br/> |
| 38 | +<br/> |
| 39 | + |
| 40 | + |
| 41 | +| Flag / Argument | What it does | |
| 42 | +|---|---| |
| 43 | +| `--rm` | Automatically remove the container when it exits. Keeps things clean. | |
| 44 | +| `-it` | Allocates an interactive TTY. Required for the `ajet-swarm overwatch` TUI monitor to render correctly inside the container. | |
| 45 | +| `-v /path/to/host/Qwen/Qwen2.5-7B-Instruct:/Qwen/Qwen2.5-7B-Instruct` | **Model mount** — mounts your local model weights directory into the container. The path inside the container must match the `model` field you configure in your training job. | |
| 46 | +| `-v ./swarmlog:/workspace/log` | **Log mount** — mounts a local `./swarmlog` directory to persist server logs outside the container. The VERL training log is written here. | |
| 47 | +| `-p 10086:10086` | **Port mapping** — exposes port `10086` so that Swarm Clients on other machines can reach the server via `http://<server-ip>:10086`. | |
| 48 | +| `ajet:latest` | The AgentJet Docker image. | |
| 49 | +| `bash -c "..."` | Runs two processes concurrently inside the container (see below). | |
| 50 | + |
| 51 | + |
| 52 | +<br/> |
| 53 | +<br/> |
| 54 | +<br/> |
| 55 | + |
| 56 | + |
| 57 | +### The Two Processes Inside `bash -c` |
| 58 | + |
| 59 | +The command launches two background processes with `&`: |
| 60 | + |
| 61 | +``` |
| 62 | +(ajet-swarm overwatch) |
| 63 | +& |
| 64 | +(NO_COLOR=1 LOGURU_COLORIZE=NO ajet-swarm start &>/workspace/log/swarm_server.log) |
| 65 | +``` |
| 66 | + |
| 67 | +| Process | What it does | |
| 68 | +|---|---| |
| 69 | +| `ajet-swarm overwatch` | Starts the **real-time TUI monitor** in the foreground. Displays the current server state (OFFLINE / BOOTING / ROLLING / WEIGHT_SYNCING), active episodes, and rollout statistics. | |
| 70 | +| `ajet-swarm start` | Starts the **Swarm Server** itself — initializes VERL training loop, vLLM inference engine, and the FastAPI HTTP server on port `10086`. | |
| 71 | +| `NO_COLOR=1 LOGURU_COLORIZE=NO` | Disables ANSI color codes in the server log so the log file `swarm_server.log` is readable as plain text. | |
| 72 | +| `&>/workspace/log/swarm_server.log` | Redirects both stdout and stderr of the server process to the log file (which is persisted to your host machine via the volume mount). | |
| 73 | + |
| 74 | +<br/> |
| 75 | +<br/> |
| 76 | +<br/> |
| 77 | + |
| 78 | +## Concrete Example |
| 79 | + |
| 80 | +The following example mounts a model downloaded at host directory `/root/agentjet/modelscope_cache/Qwen/Qwen2___5-7B-Instruct`, |
| 81 | +and we would like to mount it at container directory: `/mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2.5-7B-Instruct` |
| 82 | + |
| 83 | +```bash |
| 84 | +docker run --rm -it \ |
| 85 | + -v /root/agentjet/modelscope_cache/Qwen/Qwen2___5-7B-Instruct:/mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2.5-7B-Instruct \ |
| 86 | + -v ./swarmlog:/workspace/log \ |
| 87 | + -p 10086:10086 \ |
| 88 | + ajet:latest \ |
| 89 | + bash -c "(ajet-swarm overwatch) & (NO_COLOR=1 LOGURU_COLORIZE=NO ajet-swarm start &>/workspace/log/swarm_server.log)" |
| 90 | +``` |
| 91 | + |
| 92 | +Make sure the container-side path matches whatever `model` path you specify in your `AgentJetJob`. |
| 93 | + |
| 94 | + |
| 95 | +## What Happens After Launch |
| 96 | + |
| 97 | + |
| 98 | +<div align="center"> |
| 99 | +<img width="600" alt="image" src="https://serve.gptacademic.cn/publish/shared/Image/swarm-server.gif"/> |
| 100 | +</div> |
| 101 | + |
| 102 | +Once the container starts, you will see the `ajet-swarm overwatch` TUI in your terminal. The server begins in **OFFLINE** state and transitions through: |
| 103 | + |
| 104 | +``` |
| 105 | +OFFLINE → BOOTING → ROLLING → WEIGHT_SYNCING → ROLLING → ... |
| 106 | +``` |
| 107 | + |
| 108 | +The server only moves to **BOOTING** after a Swarm Client sends it a training configuration and calls `start_engine()`. Until then it waits safely in **OFFLINE**. |
| 109 | + |
| 110 | +Meanwhile, all VERL and training logs stream into `./swarmlog/swarm_server.log` on your host machine. |
| 111 | + |
| 112 | + |
| 113 | +## Connecting a Swarm Client |
| 114 | + |
| 115 | +From any machine (no GPU required) that can reach the server on port `10086`, run your Swarm Client: |
| 116 | + |
| 117 | +```python |
| 118 | +from ajet.tuner_lib.experimental.as_swarm_client import SwarmClient |
| 119 | +from ajet.copilot.job import AgentJetJob |
| 120 | + |
| 121 | +swarm_worker = SwarmClient("http://<server-ip>:10086") |
| 122 | +swarm_worker.auto_sync_train_config_and_start_engine( |
| 123 | + AgentJetJob( |
| 124 | + algorithm="grpo", |
| 125 | + n_gpu=8, |
| 126 | + model="/mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2.5-7B-Instruct", |
| 127 | + batch_size=32, |
| 128 | + num_repeat=4, |
| 129 | + ) |
| 130 | +) |
| 131 | +``` |
| 132 | + |
| 133 | +> The `model` path here must be the **container-side** path (right-hand side of the `-v` mount), not the host path. |
| 134 | +
|
| 135 | +See [Swarm Best Practices](./swarm_best_practice.md) for full client examples. |
| 136 | + |
| 137 | + |
| 138 | +## Troubleshooting |
| 139 | + |
| 140 | +| Symptom | Likely Cause | Fix | |
| 141 | +|---|---|---| |
| 142 | +| Server stays **OFFLINE** forever | No client has called `start_engine()` | Run your Swarm Client script to send the training config | |
| 143 | +| `Model not found` error in log | Container-side model path is wrong | Verify the right-hand side of your `-v` flag matches the `model` field in `AgentJetJob` | |
| 144 | +| Client cannot connect to port `10086` | Firewall or wrong IP | Check server firewall rules; use `ajet-swarm overwatch --swarm-url=http://<ip>:10086` to test connectivity | |
| 145 | +| Log file is empty | `./swarmlog` directory doesn't exist on host | Create it first: `mkdir -p ./swarmlog` | |
0 commit comments