Commit cde575c
authored
[fix,feat] Support
## New Features
- Added `MooncakeStore` Configurations: Introduced related configuration
options for `MooncakeStore` in `config.py`.
- Easy Initialization: Implemented support for `tq.init()` when using
the MooncakeStore backend.
- E2E CI Coverage: Added end-to-end continuous integration tests
specifically for the MooncakeStore backend.
## Bug Fixes
- **`KVStorageManager` Check**: Removed an outdated validation check in
`KVStorageManager` that previously caused issues during put operations.
- **Metadata Update Tracking**: Fixed a metadata update issue in
`TransferQueueController`. Now, when a field transforms between a normal
tensor and a nested tensor, the system correctly recomputes and updates
the `per_sample_shape`, `is_nested`, and `shape` information.
- **ZMQ Related**: Set `recv_multipart(copy=False)` by default.
## Known Issues
- **Graceful Shutdown Limitations**: We cannot gracefully shut down
`mooncake_master` because the distributed `TransferQueueClient` holding
`MooncakeDistributedStore()` will raise heartbeat error. As a
workaround, we currently launch `mooncake_master` when setting
`auto_init=true` but bypass shutting it down. To minimize possible
influence, we call `remove_all()` to delete all the keys in
`mooncake_master`.
- **Uneven BatchMeta Fields**: `TransferQueueController` currently
cannot handle non-uniform `BatchMeta` instances where samples do not
have equal fields. This prevents key-value-based backends from
accurately clearing all keys. In `MooncakeStore`, we are temporarily
using `remove_by_regex` to mitigate this issue.
- **1D Tensor Handling**: When a user inputs a 1D tensor, previous
refactoring populated an empty `torch.Size([])` which could mislead
key-value-based backends during zero-copy operations. Since these
backends must perform fine-grained splits on the input TensorDict,
distinguishing between 1D and 2D input tensors is difficult. We have now
added a warning for this type of input and manually populate the shape
with `torch.Size([1])`.
-
```python3
# in AsyncTransferQueueClient.async_put()
for field_name, field_data in data.items():
if isinstance(field_data, torch.Tensor) and field_data.ndim == 1:
logger.warning(
f"[{self.client_id}]: Data field '{field_name}' is a tensor with only one dimension. "
f"You may receive 2D tensors in key-value based backend."
)
```
---
## Configuration Reference
The config structure for `MooncakeStore` looks like this:
```yml
backend:
# Pluggable storage/transport backend of TransferQueue. Choose from:
# SimpleStorage, Yuanrong, MooncakeStore, ...
storage_backend: MooncakeStore
# For MooncakeStore:
MooncakeStore:
# Auto init metadata_server
auto_init: true
# Address of the HTTP metadata server
metadata_server: localhost:50050
# Address of master server
master_server_address: localhost:50051
# Address of local host
local_hostname: localhost
# Protocol for transmission. Choose from: tcp, rdma. (default: tcp)
protocol: tcp
# Memory segment size in bytes for mounting (default: 4GB)
global_segment_size: 4294967296
# Local buffer size in bytes (default: 1GB)
local_buffer_size: 1073741824
# Network device name. Set to "" to let Mooncake to auto-picks devices
device_name: ""
```
CC:@zhaohaidao @dpj135 @mpb159753
---------
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>MooncakeStore easy init (Ascend#45)1 parent 0945d28 commit cde575c
18 files changed
Lines changed: 560 additions & 209 deletions
File tree
- .github/workflows
- tests
- e2e
- transfer_queue
- storage
- clients
- managers
- utils
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
| 36 | + | |
36 | 37 | | |
37 | 38 | | |
38 | 39 | | |
| |||
43 | 44 | | |
44 | 45 | | |
45 | 46 | | |
46 | | - | |
| 47 | + | |
47 | 48 | | |
48 | 49 | | |
49 | | - | |
50 | | - | |
51 | | - | |
52 | | - | |
53 | | - | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
117 | 117 | | |
118 | 118 | | |
119 | 119 | | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
120 | 123 | | |
121 | 124 | | |
122 | 125 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
16 | | - | |
17 | | - | |
| 16 | + | |
18 | 17 | | |
19 | 18 | | |
20 | 19 | | |
| |||
23 | 22 | | |
24 | 23 | | |
25 | 24 | | |
| 25 | + | |
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
| |||
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
51 | 83 | | |
52 | 84 | | |
53 | 85 | | |
| |||
59 | 91 | | |
60 | 92 | | |
61 | 93 | | |
62 | | - | |
63 | | - | |
64 | | - | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
65 | 110 | | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
66 | 115 | | |
67 | 116 | | |
68 | | - | |
69 | | - | |
70 | | - | |
71 | | - | |
72 | | - | |
73 | | - | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | | - | |
78 | | - | |
79 | | - | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
80 | 121 | | |
81 | 122 | | |
82 | 123 | | |
| |||
244 | 285 | | |
245 | 286 | | |
246 | 287 | | |
247 | | - | |
| 288 | + | |
248 | 289 | | |
249 | 290 | | |
250 | 291 | | |
| |||
283 | 324 | | |
284 | 325 | | |
285 | 326 | | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
286 | 339 | | |
287 | 340 | | |
288 | | - | |
| 341 | + | |
289 | 342 | | |
290 | 343 | | |
291 | 344 | | |
| |||
362 | 415 | | |
363 | 416 | | |
364 | 417 | | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
365 | 424 | | |
366 | 425 | | |
367 | 426 | | |
| |||
744 | 803 | | |
745 | 804 | | |
746 | 805 | | |
747 | | - | |
| 806 | + | |
748 | 807 | | |
| 808 | + | |
749 | 809 | | |
750 | 810 | | |
751 | 811 | | |
752 | 812 | | |
| 813 | + | |
| 814 | + | |
| 815 | + | |
| 816 | + | |
| 817 | + | |
| 818 | + | |
753 | 819 | | |
754 | 820 | | |
755 | 821 | | |
| |||
793 | 859 | | |
794 | 860 | | |
795 | 861 | | |
| 862 | + | |
| 863 | + | |
| 864 | + | |
| 865 | + | |
| 866 | + | |
| 867 | + | |
796 | 868 | | |
797 | 869 | | |
798 | 870 | | |
| |||
0 commit comments