Skip to content

Commit 3e57540

Browse files
committed
docs: Add guide for starting AgentJet Swarm Server via Docker
1 parent b81f715 commit 3e57540

File tree

1 file changed

+145
-0
lines changed

1 file changed

+145
-0
lines changed

docs/en/ajet-swarm-docker.md

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# Start AgentJet Swarm Server via Docker
2+
3+
This guide explains how to launch the **AgentJet Swarm Server** inside a Docker container. The Swarm Server is the GPU-side component responsible for gradient computation, and weight updates. It exposes an OpenAI-compatible API that Swarm Clients connect to for training.
4+
5+
> **Not familiar with Swarm?** Read the [Swarm Introduction](./swarm_intro.md) first.
6+
7+
8+
## Prerequisites
9+
10+
| Requirement | Detail |
11+
|---|---|
12+
| Docker | With GPU support (`nvidia-container-toolkit`) |
13+
| AgentJet Docker image | `ajet:latest` (built from the AgentJet repository) |
14+
| LLM model weights | Downloaded locally (e.g., `Qwen2.5-7B-Instruct`) |
15+
16+
17+
## Command Template
18+
19+
Run the command below:
20+
21+
```bash
22+
docker run --rm -it \
23+
-v /path/to/host/Qwen/Qwen2.5-7B-Instruct:/Qwen/Qwen2.5-7B-Instruct \
24+
-v ./swarmlog:/workspace/log \
25+
-p 10086:10086 \
26+
ajet:latest \
27+
bash -c "(ajet-swarm overwatch) & (NO_COLOR=1 LOGURU_COLORIZE=NO ajet-swarm start &>/workspace/log/swarm_server.log)"
28+
```
29+
30+
And when completed, you will see a interface like this, which means the deployment is successful:
31+
32+
<div align="center">
33+
<img width="640" alt="image" src="https://serve.gptacademic.cn/publish/shared/Image/swarm_overwatch.jpg"/>
34+
</div>
35+
36+
<br/>
37+
<br/>
38+
<br/>
39+
40+
41+
| Flag / Argument | What it does |
42+
|---|---|
43+
| `--rm` | Automatically remove the container when it exits. Keeps things clean. |
44+
| `-it` | Allocates an interactive TTY. Required for the `ajet-swarm overwatch` TUI monitor to render correctly inside the container. |
45+
| `-v /path/to/host/Qwen/Qwen2.5-7B-Instruct:/Qwen/Qwen2.5-7B-Instruct` | **Model mount** — mounts your local model weights directory into the container. The path inside the container must match the `model` field you configure in your training job. |
46+
| `-v ./swarmlog:/workspace/log` | **Log mount** — mounts a local `./swarmlog` directory to persist server logs outside the container. The VERL training log is written here. |
47+
| `-p 10086:10086` | **Port mapping** — exposes port `10086` so that Swarm Clients on other machines can reach the server via `http://<server-ip>:10086`. |
48+
| `ajet:latest` | The AgentJet Docker image. |
49+
| `bash -c "..."` | Runs two processes concurrently inside the container (see below). |
50+
51+
52+
<br/>
53+
<br/>
54+
<br/>
55+
56+
57+
### The Two Processes Inside `bash -c`
58+
59+
The command launches two background processes with `&`:
60+
61+
```
62+
(ajet-swarm overwatch)
63+
&
64+
(NO_COLOR=1 LOGURU_COLORIZE=NO ajet-swarm start &>/workspace/log/swarm_server.log)
65+
```
66+
67+
| Process | What it does |
68+
|---|---|
69+
| `ajet-swarm overwatch` | Starts the **real-time TUI monitor** in the foreground. Displays the current server state (OFFLINE / BOOTING / ROLLING / WEIGHT_SYNCING), active episodes, and rollout statistics. |
70+
| `ajet-swarm start` | Starts the **Swarm Server** itself — initializes VERL training loop, vLLM inference engine, and the FastAPI HTTP server on port `10086`. |
71+
| `NO_COLOR=1 LOGURU_COLORIZE=NO` | Disables ANSI color codes in the server log so the log file `swarm_server.log` is readable as plain text. |
72+
| `&>/workspace/log/swarm_server.log` | Redirects both stdout and stderr of the server process to the log file (which is persisted to your host machine via the volume mount). |
73+
74+
<br/>
75+
<br/>
76+
<br/>
77+
78+
## Concrete Example
79+
80+
The following example mounts a model downloaded at host directory `/root/agentjet/modelscope_cache/Qwen/Qwen2___5-7B-Instruct`,
81+
and we would like to mount it at container directory: `/mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2.5-7B-Instruct`
82+
83+
```bash
84+
docker run --rm -it \
85+
-v /root/agentjet/modelscope_cache/Qwen/Qwen2___5-7B-Instruct:/mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2.5-7B-Instruct \
86+
-v ./swarmlog:/workspace/log \
87+
-p 10086:10086 \
88+
ajet:latest \
89+
bash -c "(ajet-swarm overwatch) & (NO_COLOR=1 LOGURU_COLORIZE=NO ajet-swarm start &>/workspace/log/swarm_server.log)"
90+
```
91+
92+
Make sure the container-side path matches whatever `model` path you specify in your `AgentJetJob`.
93+
94+
95+
## What Happens After Launch
96+
97+
98+
<div align="center">
99+
<img width="600" alt="image" src="https://serve.gptacademic.cn/publish/shared/Image/swarm-server.gif"/>
100+
</div>
101+
102+
Once the container starts, you will see the `ajet-swarm overwatch` TUI in your terminal. The server begins in **OFFLINE** state and transitions through:
103+
104+
```
105+
OFFLINE → BOOTING → ROLLING → WEIGHT_SYNCING → ROLLING → ...
106+
```
107+
108+
The server only moves to **BOOTING** after a Swarm Client sends it a training configuration and calls `start_engine()`. Until then it waits safely in **OFFLINE**.
109+
110+
Meanwhile, all VERL and training logs stream into `./swarmlog/swarm_server.log` on your host machine.
111+
112+
113+
## Connecting a Swarm Client
114+
115+
From any machine (no GPU required) that can reach the server on port `10086`, run your Swarm Client:
116+
117+
```python
118+
from ajet.tuner_lib.experimental.as_swarm_client import SwarmClient
119+
from ajet.copilot.job import AgentJetJob
120+
121+
swarm_worker = SwarmClient("http://<server-ip>:10086")
122+
swarm_worker.auto_sync_train_config_and_start_engine(
123+
AgentJetJob(
124+
algorithm="grpo",
125+
n_gpu=8,
126+
model="/mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2.5-7B-Instruct",
127+
batch_size=32,
128+
num_repeat=4,
129+
)
130+
)
131+
```
132+
133+
> The `model` path here must be the **container-side** path (right-hand side of the `-v` mount), not the host path.
134+
135+
See [Swarm Best Practices](./swarm_best_practice.md) for full client examples.
136+
137+
138+
## Troubleshooting
139+
140+
| Symptom | Likely Cause | Fix |
141+
|---|---|---|
142+
| Server stays **OFFLINE** forever | No client has called `start_engine()` | Run your Swarm Client script to send the training config |
143+
| `Model not found` error in log | Container-side model path is wrong | Verify the right-hand side of your `-v` flag matches the `model` field in `AgentJetJob` |
144+
| Client cannot connect to port `10086` | Firewall or wrong IP | Check server firewall rules; use `ajet-swarm overwatch --swarm-url=http://<ip>:10086` to test connectivity |
145+
| Log file is empty | `./swarmlog` directory doesn't exist on host | Create it first: `mkdir -p ./swarmlog` |

0 commit comments

Comments
 (0)