You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+10-13Lines changed: 10 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -76,7 +76,7 @@ Currently, we support the following storage backends:
76
76
77
77
- SimpleStorage: A basic CPU memory storage with minimal data format constraints and easy usability.
78
78
-[Yuanrong](https://gitee.com/openeuler/yuanrong-datasystem) (beta, [#PR107](https://github.com/TransferQueue/TransferQueue/pull/107), [#PR96](https://github.com/TransferQueue/TransferQueue/pull/96)): An Ascend native data system that provides hierarchical storage interfaces including HBM/DRAM/SSD.
79
-
-[MooncakeStore](https://github.com/kvcache-ai/Mooncake) (alpha, [#PR162](https://github.com/TransferQueue/TransferQueue/pull/162)): A high-performance, KV-based hierarchical storage that supports RDMA transport between GPU and DRAM.
79
+
-[MooncakeStore](https://github.com/kvcache-ai/Mooncake) (beta, [#PR162](https://github.com/TransferQueue/TransferQueue/pull/162)): A high-performance, KV-based hierarchical storage that supports RDMA transport between GPU and DRAM.
80
80
-[RayRDT](https://docs.ray.io/en/master/ray-core/direct-transport.html) (alpha, [#PR167](https://github.com/TransferQueue/TransferQueue/pull/167)): Ray's new feature that allows Ray to store and pass objects directly between Ray actors.
81
81
82
82
Among them, `SimpleStorageUnit` serves as our default storage backend, coordinated by the `AsyncSimpleStorageManager` class. Each storage unit can be deployed on a separate node, allowing for distributed data management.
@@ -121,6 +121,8 @@ To simplify the usage of TransferQueue, we have provided a Redis-style high-leve
121
121
-**Metadata Tags**: Lightweight metadata for status tracking
Refer to [tutorials/basic.ipynb](https://github.com/Ascend/TransferQueue/blob/main/tutorial/basic.ipynb) and [tutorials/02_kv_interface.py](https://github.com/Ascend/TransferQueue/blob/main/tutorial/02_kv_interface.py) for detailed usage examples.
125
+
124
126
#### StreamingDataLoader API
125
127
126
128
Designed as a drop-in replacement for the standard PyTorch `DataLoader`, this API allows each rank to automatically consume data without single-controller intervention.
@@ -147,17 +149,12 @@ Developers can leverage `TransferQueueClient` directly to implement advanced fea
147
149
#### verl
148
150
The primary motivation for integrating TransferQueue to verl now is to **alleviate the data transfer bottleneck of the single controller `RayPPOTrainer`**. Currently, all `DataProto` objects must be routed through `RayPPOTrainer`, resulting in a single point bottleneck of the whole post-training system.
Official integration to verl is available at [verl/pulls/5401](https://github.com/verl-project/verl/pull/5401), with design doc at [[RFC] PPOTrainer with TransferQueue Integration](https://github.com/verl-project/verl/issues/5400). You may also refer to our [recipe](https://github.com/Ascend/TransferQueue/blob/main/recipe/simple_use_case/single_controller_demo.py), where we mimic the verl usage in a high-level manner.
159
157
160
-
You may refer to the [recipe](https://github.com/Ascend/TransferQueue/tree/dev/recipe/simple_use_case), where we mimic the verl usage in both async & sync scenarios. Official integration to verl is also available now at [verl/pulls/3649](https://github.com/volcengine/verl/pull/3649) (with subsequent PRs to further optimize the integration).
> Note: The above benchmark for TransferQueue is based on our naive `SimpleStorage` backend. By introducing high-performance storage backends and optimizing serialization/deserialization, we expect to achieve even better performance. Warmly welcome contributions from the community!
216
+
> Note: Optimization for MooncakeStore and other backends are still in process. Warmly welcome contributions from the community!
220
217
221
-
For detailed performance benchmarks, please refer to [this blog](https://www.yuque.com/haomingzi-lfse7/hlx5g0/tml8ke0zkgn6roey?singleDoc#).
218
+
For detailed performance benchmarks, please refer to [this blog](https://www.yuque.com/haomingzi-lfse7/lhp4el/tml8ke0zkgn6roey?singleDoc#).
222
219
223
-
We also provide a [stress test report](https://www.yuque.com/haomingzi-lfse7/hlx5g0/ydbwgo5k2umaag78?singleDoc#) that demonstrates **768 concurrent clients writing 1.4 TB of data** into TransferQueue across 4 nodes. The system remains stable without any crashes or data loss, achieving 80% bandwidth.
220
+
We also provide a [stress test report](https://www.yuque.com/haomingzi-lfse7/lhp4el/mt0vedqy7c337pgg?singleDoc#) that demonstrates more than **8192 concurrent clients writing 2 TB of data** into TransferQueue across 4 nodes. The system remains stable without any crashes or data loss.
Copy file name to clipboardExpand all lines: scripts/performance_test/README_PERFTEST.md
+57-11Lines changed: 57 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -42,19 +42,63 @@ python perftest.py \
42
42
|`--head_node_ip`| Head node IP address | - | Yes |
43
43
|`--worker_node_ip`| Worker node IP address (required for Yuanrong) | None | No |
44
44
|`--output_csv`| Path to output CSV file | None | No |
45
+
|`--use_complex_case`| Use complex test case with nested tensors and NonTensorStack fields | False | No |
45
46
46
47
## Backend Configuration
47
48
48
49
The script reads the backend configuration directly from the provided `--backend_config` YAML file. The backend type is determined by `backend.storage_backend` in the config file. When `--backend` is specified, it overrides the value in the config.
49
50
50
-
For device support of each backend:
51
-
-`SimpleStorage`: `cpu`
52
-
-`Yuanrong`: `cpu`, `npu`
53
-
-`MooncakeStore`: `cpu`, `gpu`
51
+
### SimpleStorage Configuration
54
52
55
-
## Test Data Format
53
+
```yaml
54
+
backend:
55
+
storage_backend: SimpleStorage
56
+
SimpleStorage:
57
+
total_storage_size: 100000
58
+
num_data_storage_units: 16
59
+
```
60
+
61
+
### Yuanrong Configuration
62
+
63
+
```yaml
64
+
backend:
65
+
storage_backend: Yuanrong
66
+
Yuanrong:
67
+
port: 31501
68
+
enable_yr_npu_transport: true
69
+
```
70
+
71
+
For Yuanrong backend, writer runs on the head node and reader runs on the worker node. `--worker_node_ip` is required.
72
+
73
+
### MooncakeStore Configuration
74
+
75
+
```yaml
76
+
backend:
77
+
storage_backend: MooncakeStore
78
+
MooncakeStore:
79
+
auto_init: true
80
+
metadata_server: localhost:50050
81
+
master_server_address: localhost:50051
82
+
local_hostname: ""
83
+
protocol: rdma
84
+
global_segment_size: 86294967296
85
+
local_buffer_size: 86294967296
86
+
device_name: ""
87
+
```
88
+
89
+
## Test Scenarios
90
+
91
+
### Simple Test Case (Default)
92
+
93
+
When `--use_complex_case` is **not** specified (default), the test creates a `TensorDict` with only regular tensors:
56
94
57
-
The test case creates a `TensorDict` with three types of fields to simulate real training batches:
2. **Nested tensors** (non-NPU devices): Variable-length ragged sequences with lengths forming an arithmetic progression from 1 to `seq_length`. Average length ≈ `seq_length / 2`, so each nested field is roughly half the size of a regular field.
@@ -73,10 +117,6 @@ Each iteration performs a PUT → LIST → GET → DELETE cycle via TransferQueu
73
117
74
118
The test runs `--num_test_iterations` iterations. Data creation only happens in the first iteration; subsequent iterations reuse the same TensorDict to isolate transfer overhead.
75
119
76
-
## Yuanrong Backend
77
-
78
-
For Yuanrong backend, writer runs on the head node and reader runs on the worker node. `--worker_node_ip` is required.
79
-
80
120
## Running Full Test Suite
81
121
82
122
The `run_perf_test.sh` script automates the full test suite across all backends and data sizes, then generates a comparison chart:
@@ -130,12 +170,18 @@ After running the tests, `draw_figure.py` reads all CSV files from `results/` an
0 commit comments