You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/storage_backends/openyuanrong_datasystem.md
+69-33Lines changed: 69 additions & 33 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,7 +26,7 @@ When Yuanrong backend is selected, `YuanrongStorageManager` and `YuanrongStorage
26
26
## Quick Start
27
27
28
28
### Prerequisites
29
-
-**Python Version**: $ \geq 3.10~and \leq 3.11 $
29
+
-**Python Version**: >= 3.10, <= 3.11
30
30
-**Architecture**: aarch64 or x86_64
31
31
32
32
### Installation Steps
@@ -37,7 +37,7 @@ Follow these steps to build and install:
37
37
38
38
Install PyTorch and TransferQueue
39
39
```bash
40
-
# Install Torch (matching the version specified for your hardware)
40
+
# Install Torch (recommended version: 2.8.0 or higher)
41
41
pip install torch==2.8.0
42
42
43
43
# Install TransferQueue from pypi
@@ -86,7 +86,7 @@ pip install torch-npu==2.8.0
86
86
87
87
After installation, you can run TransferQueue with Yuanrong backend.
88
88
89
-
First, start a local Ray cluster. Yuanrong backend relies on Ray for distributed management:
89
+
First, start a local Ray cluster. TransferQueue relies on Ray for distributed management:
90
90
```bash
91
91
ray start --head
92
92
```
@@ -120,12 +120,13 @@ tq.close()
120
120
121
121
## Deployment
122
122
123
+
Yuanrong datasystem is deployed **per-host** (one worker per node), managing all TransferQueue clients on the same node. It is not a per-client deployment.
124
+
123
125
When `auto_init: True` is set in the configuration, TransferQueue automatically initializes the Yuanrong backend during `tq.init()`. The deployment process:
124
126
125
127
1.**Detects Ray cluster nodes** - identifies all alive nodes in the Ray cluster
126
-
2.**Creates placement group** - uses `STRICT_SPREAD` strategy to ensure workers are distributed across nodes
127
-
3.**Launches YuanrongWorkerActor** - creates one actor per node to manage the datasystem worker
128
-
4.**Sets up metastore service** - the head node (driver node) starts the metastore service, other nodes connect as workers
128
+
2.**Launches YuanrongWorkerActor** - creates one actor per node to manage the datasystem worker
129
+
3.**Sets up metastore service** - the head node (driver node) starts the metastore service, other nodes connect as workers
129
130
130
131
### Configuration
131
132
@@ -146,12 +147,12 @@ backend:
146
147
- `metastore_port`: Port for metastore service on the head node.
147
148
- `worker_args`: Additional arguments passed to `dscli start` command:
148
149
- `--shared_memory_size_mb`: Shared memory size in MB for datasystem worker.
149
-
- `--enable_huge_tlb`: Configure huge page memory to reduce TLB misses and improve memory access efficiency. Note: may cause system memory shortage, kernel OOM, or system instability. Required for >21GB shared memory on Ascend 910B.
150
+
- `--enable_huge_tlb`: Configure huge page memory to reduce TLB misses and improve memory access efficiency. Note: may cause system memory shortage, kernel OOM, or system instability. **Please allocate huge pages before starting datasystem** - refer to [Huge Page Guide](https://pages.openeuler.openatom.cn/openyuanrong-datasystem/docs/zh-cn/latest/appendix/hugepage_guide.html).
150
151
151
152
**NPU Transfer Options:**
152
153
- `enable_yr_npu_transport`: Enable NPU transport for high-performance device-to-device data transfer. Set to `true` when using NPU tensors.
153
-
- `worker_args` (recommended when `enable_yr_npu_transport: true`):
154
-
- `--remote_h2d_device_ids`: Enable RH2D (Remote Host-to-Device) for efficient cross-node NPU data transfer. Specify NPU device IDs as comma-separated values (e.g., `0,1,2,3`).
154
+
- `worker_args` (**mandatory** when `enable_yr_npu_transport: true`):
155
+
- `--remote_h2d_device_ids`: Enable RH2D (Remote Host-to-Device) for efficient cross-node NPU data transfer. Specify NPU device IDs as comma-separated values (e.g., `0,1,2,3`). Yuanrong manages all specified devices - to put/get tensors on NPU `X`, device ID `X` must be included in this argument.
155
156
156
157
> More configuration parameters for deploying the data system can refer to [dscli config](https://gitcode.com/openeuler/yuanrong-datasystem/blob/master/docs/source_zh_cn/deployment/dscli.md).
157
158
@@ -169,6 +170,8 @@ ray start --head --resources='{"node:192.168.0.1": 1}'
169
170
ray start --address="192.168.0.1:6379" --resources='{"node:192.168.0.2": 1}'
170
171
```
171
172
173
+
The `--resources` parameter defines node-specific resources. It can be used to control Ray actor placement across nodes. For NPU environments, you may also add `--resources='{"NPU": 4}'` or configure `ASCEND_RT_VISIBLE_DEVICES`.
174
+
172
175
#### Multi-Node Configuration
173
176
174
177
```yaml
@@ -186,6 +189,8 @@ TransferQueue will detect all Ray nodes and deploy datasystem workers automatica
186
189
187
190
#### Multi-Node Demo
188
191
192
+
> **Note**: Before running the demo below, modify `HEAD_NODE_IP` and `WORKER_NODE_IP` to match your actual node IPs.
193
+
189
194
```python
190
195
import torch
191
196
import ray
@@ -311,60 +316,91 @@ Note: In manual startup mode, you need to manage the lifecycle of Yuanrong worke
311
316
312
317
## FAQ
313
318
314
-
### Port Conflict
319
+
### Failed to Start Datasystem Worker
315
320
316
-
If `worker_port` or `metastore_port` is already in use, initialization will fail:
321
+
If initialization fails with `RuntimeError: Failed to start datasystem worker...`, check the following possible causes:
317
322
318
-
```
319
-
RuntimeError: Failed to start datasystem worker...
320
-
```
323
+
**1. Port Conflict**
321
324
322
-
Check port usage:
325
+
Check if `worker_port` or `metastore_port` is already in use:
323
326
```bash
324
327
netstat -tlnp | grep 31501
325
328
netstat -tlnp | grep 2379
326
329
```
327
-
328
330
Solution: Change the port or clean up the occupying process.
329
331
330
-
> If a TransferQueue task terminates abnormally without calling `tq.close()`, the datasystem will become a defunct process and occupy the port.
332
+
> If a TransferQueue task terminates abnormally without calling `tq.close()`, the datasystem may become a defunct process and occupy the port.
333
+
334
+
**2. Shared Memory Allocation Failure**
335
+
336
+
If you encounter an error like:
337
+
```
338
+
Runtime error: failed to mmap shared memory: Cannot allocate memory
339
+
```
340
+
Check the following:
341
+
- Docker container shared memory limit (default is 64MB, may need increase)
342
+
- System available memory for shared memory allocation
343
+
- Huge page configuration if `--enable_huge_tlb true` is enabled
HTTP/HTTPS proxy settings may interfere with Yuanrong's internal communication, causing metastore connection timeout errors.
350
+
351
+
Yuanrong datasystem uses IP addresses directly for internal node communication. If proxy environment variables (`http_proxy`, `https_proxy`, `HTTP_PROXY`, `HTTPS_PROXY`) are set, they may route internal traffic through the proxy instead of direct connections.
If the previous run did not close properly, datasystem worker processes may remain:
362
+
If the previous run did not close properly (e.g., task crashed without `tq.close()`), datasystem worker processes may remain:
335
363
336
364
```bash
337
365
# Check residual processes
338
-
ps aux | grep dscli
366
+
ps aux | grep datasystem_worker
339
367
340
-
# Clean up
341
-
dscli stop --worker_address <IP>:31501
342
-
# Or force cleanup
343
-
pkill -f dscli
368
+
# Clean up gracefully
369
+
dscli stop --worker_address <IP>:<PORT>
370
+
371
+
# Force cleanup (use with caution)
372
+
pkill -f datasystem_worker
344
373
```
345
374
346
375
### Multi-Process Initialization
347
376
348
-
Each process must call `tq.init()` to obtain a TransferQueue client before using `tq.get_client()`:
349
-
- The first process initializes the TransferQueueController and Yuanrong backend
350
-
- Other processes automatically connect to the existing TransferQueueController
377
+
In multi-process scenarios, each process must call `tq.init()` before using TransferQueue APIs:
378
+
- The first process initializes the `TransferQueueController` and Yuanrong backend
379
+
- Subsequent processes automatically connect to the existing controller
351
380
352
-
Recommendation: Let the first process (which initialized the backend) call `tq.close()` to cleanup Yuanrong workers. Other processes only need to close their clients.
381
+
Best practice: Let the process that initialized the backend (typically the main/driver process) call `tq.close()` for cleanup. Other processes can simply close their clients without affecting the shared backend.
353
382
354
383
355
384
### NPU Transfer Issues
356
385
357
-
When enabling `enable_yr_npu_transport: true`, ensure:
When using `enable_yr_npu_transport: true`, ensure:
387
+
- CANN toolkit is properly installed
388
+
- `torch-npu`version matches `torch` version
389
+
- `--remote_h2d_device_ids`includes all device IDs you intend to use
390
+
391
+
Common errors and solutions:
392
+
- `Device not found`: Check if device ID is included in `--remote_h2d_device_ids`
393
+
- `CANN error`: Verify CANN installation path and environment variables
361
394
362
395
### Out of Memory Error
363
-
If you encounter an OutOfMemoryError (OOM) thrown by DataSystems during operation, please increase the value of the configuration option `--shared_memory_size_mb`.
396
+
397
+
If Yuanrong throws an OOM error during operation:
364
398
```
365
399
RuntimeError: code: [Out of memory], msg: [Shared memory no space in arena: ...]
366
400
```
367
401
402
+
Solution: Increase `--shared_memory_size_mb` in `worker_args`, or reduce the data volume being cached.
403
+
368
404
369
405
## Datasystem Logs
370
406
@@ -390,7 +426,7 @@ First, select the appropriate [CANN image](https://hub.docker.com/r/ascendai/can
0 commit comments