Commit d11f808
authored
[feat] Add retransmission mechanism for
## Background
When using `MooncakeStore` as TQ's backend, we observe occasional
transmission errors during verl e2e runs:
```
E0508 17:18:06.011560 731271 tcp_transport.cpp:708] TcpTransport::getConnection failed to create connection to 61.28.30.25:16181. Error: connect: Connection timed out
E0508 17:18:06.011600 731277 tcp_transport.cpp:886] TcpTransport::startTransfer failed to get connection to 61.28.30.25:15816
E0508 17:18:06.011888 731271 transfer_task.cpp:281] Batch 281200032997056 completed with task failures: task_ids=[0]
E0508 17:18:06.011895 731271 client_service.cpp:1100] Transfer failed for key: 68108@uid with error: -800
E0508 17:18:06.011996 731271 real_client.cpp:2253] BatchGet failed for key '68108@uid': TRANSFER_FAIL
```
These `Connection timed out` / `TRANSFER_FAIL` (`error: -800`) errors
are **transient network issues** that typically resolve on a subsequent
attempt. However, the previous client implementation had no retry logic
whatsoever:
- On the **tensor path**, any single key returning a negative status
code would trigger an immediate `RuntimeError`, failing the entire batch
and crashing the training job.
- On the **bytes path**, the failure was far worse: `get_batch` returns
`b""` for keys that encountered a transfer failure, and the client
blindly passed these empty bytes through `pickle.loads(... if result !=
b"" else None)`, treating them as legitimate `None` values.
**This leads to silent content corruption.** A training worker could
proceed with corrupted or missing data without ever knowing that a
transmission failure occurred, compromising model correctness.
This PR addresses all failure modes on both the **read (`get`) and write
(`put`) paths** by adding controlled retries that isolate failed keys
and attempt retransmission before giving up.
## Summary
This PR introduces a retry mechanism for transient failures in
`MooncakeStoreClient`, covering both **read** (`get`) and **write**
(`put`) operations, for both tensor and non-tensor data paths.
Previously, the client had **zero tolerance for transient errors**:
- **Tensor read** (`_get_tensors_thread_worker`): a single key failure
(`ret < 0`) would immediately raise `RuntimeError`, causing the entire
batch to fail.
- **Non-tensor read** (`_get_bytes_thread_worker`): no failure detection
at all. Empty byte strings (`b""`) — which MooncakeStore returns on
transmission failures — were silently deserialized as `None`, making it
impossible for callers to distinguish between "value is None" and
"transfer failed".
- **Tensor write** (`_put_tensors_thread_worker`): any single key
returning a non-zero status would immediately abort the entire batch
with `RuntimeError`.
- **Non-tensor write** (`_put_bytes_thread_worker`): a single
`upsert_batch` failure would immediately abort the batch with
`RuntimeError`.
This change adds **up to 3 retries with 1-second backoff** across all
four paths. For paths that expose per-key status codes, only the failed
subset of keys is retried on each attempt.
## Future Work
- Replace the `b""` heuristic in `_get_bytes_thread_worker` with proper
per-key error codes once MooncakeStore exposes them, then upgrade the
exhausted-retry path from `logger.error` to `raise RuntimeError`.
- When MooncakeStore supports **per-key status codes for `upsert_batch`
and `get_batch`**, switch the bytes write/read paths from whole-batch
retry to per-key selective retry, matching the tensor-path behaviour.
---------
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>MooncakeStoreClient (#94)1 parent 270ea73 commit d11f808
1 file changed
Lines changed: 188 additions & 21 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
17 | 18 | | |
18 | 19 | | |
19 | 20 | | |
| |||
33 | 34 | | |
34 | 35 | | |
35 | 36 | | |
36 | | - | |
| 37 | + | |
37 | 38 | | |
| 39 | + | |
| 40 | + | |
38 | 41 | | |
39 | 42 | | |
40 | 43 | | |
| |||
147 | 150 | | |
148 | 151 | | |
149 | 152 | | |
150 | | - | |
151 | | - | |
152 | | - | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
153 | 203 | | |
154 | | - | |
| 204 | + | |
| 205 | + | |
155 | 206 | | |
| 207 | + | |
156 | 208 | | |
157 | 209 | | |
158 | 210 | | |
159 | 211 | | |
160 | 212 | | |
161 | 213 | | |
162 | | - | |
| 214 | + | |
163 | 215 | | |
164 | | - | |
165 | | - | |
166 | | - | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
167 | 243 | | |
168 | 244 | | |
169 | 245 | | |
| |||
238 | 314 | | |
239 | 315 | | |
240 | 316 | | |
241 | | - | |
242 | | - | |
243 | | - | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
244 | 372 | | |
245 | 373 | | |
246 | 374 | | |
247 | 375 | | |
248 | 376 | | |
249 | 377 | | |
250 | | - | |
251 | | - | |
252 | | - | |
253 | | - | |
254 | | - | |
255 | | - | |
256 | | - | |
257 | | - | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
258 | 424 | | |
259 | | - | |
| 425 | + | |
| 426 | + | |
260 | 427 | | |
261 | 428 | | |
262 | 429 | | |
| |||
0 commit comments