You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/advanced_features/remote_training.md
+14Lines changed: 14 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -50,6 +50,19 @@ Request processing flow:
50
50
- Retrieves model configuration (hidden_size, vocab_size, etc.) via POST `/setup`
51
51
- Each training step sends a request via POST `/generate` and receives results via NCCL recv
52
52
- Supports TP>1 training: only rank 0 sends requests, results are broadcast to other ranks
53
+
- After the first successful connection, a background heartbeat is started; on `close()`, a best-effort `/disconnect` is sent
54
+
55
+
### Client Lifecycle and Automatic Exit
56
+
57
+
The target server tracks client activity and automatically shuts down after the client exits, preventing leftover GPU-occupying server processes after training completes:
58
+
59
+
- After the client's first successful request or successful NCCL initialization, a background heartbeat thread is started, sending POST `/heartbeat` every 15 seconds by default
60
+
- When the client exits normally, it sends a best-effort POST `/disconnect`; upon receiving it, the server immediately triggers shutdown
61
+
- When the client exits abnormally, the server watchdog triggers shutdown after `--client-heartbeat-timeout` is exceeded (default 60 seconds)
62
+
- The server only counts actual client API calls as active requests; `GET /health` and unrelated POSTs do not renew the watchdog timer
63
+
-`--client-heartbeat-timeout 0` disables the server-side timeout watchdog, but `/disconnect` will still trigger automatic shutdown
64
+
65
+
Since NCCL transport does not support safe disconnect and reconnect within the same server process, it is recommended to treat each target server process as a resource for a single training session: it automatically exits after training completes or the client disconnects, and a new instance is started for the next training run.
53
66
54
67
### NCCL Transport
55
68
@@ -198,6 +211,7 @@ export NCCL_IB_GID_INDEX=3 # RoCE GID index
198
211
|`SPECFORGE_TOPK`|`0`| Server-side target_p top-k compression (`0` = full distribution) |
0 commit comments