Skip to content

Commit bcd05d4

Browse files
committed
Fixed comments
Signed-off-by: dpj135 <958208521@qq.com>
1 parent 0c31647 commit bcd05d4

1 file changed

Lines changed: 69 additions & 33 deletions

File tree

docs/storage_backends/openyuanrong_datasystem.md

Lines changed: 69 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ When Yuanrong backend is selected, `YuanrongStorageManager` and `YuanrongStorage
2626
## Quick Start
2727

2828
### Prerequisites
29-
- **Python Version**: $ \geq 3.10~and \leq 3.11 $
29+
- **Python Version**: >= 3.10, <= 3.11
3030
- **Architecture**: aarch64 or x86_64
3131

3232
### Installation Steps
@@ -37,7 +37,7 @@ Follow these steps to build and install:
3737

3838
Install PyTorch and TransferQueue
3939
```bash
40-
# Install Torch (matching the version specified for your hardware)
40+
# Install Torch (recommended version: 2.8.0 or higher)
4141
pip install torch==2.8.0
4242

4343
# Install TransferQueue from pypi
@@ -86,7 +86,7 @@ pip install torch-npu==2.8.0
8686

8787
After installation, you can run TransferQueue with Yuanrong backend.
8888

89-
First, start a local Ray cluster. Yuanrong backend relies on Ray for distributed management:
89+
First, start a local Ray cluster. TransferQueue relies on Ray for distributed management:
9090
```bash
9191
ray start --head
9292
```
@@ -120,12 +120,13 @@ tq.close()
120120

121121
## Deployment
122122

123+
Yuanrong datasystem is deployed **per-host** (one worker per node), managing all TransferQueue clients on the same node. It is not a per-client deployment.
124+
123125
When `auto_init: True` is set in the configuration, TransferQueue automatically initializes the Yuanrong backend during `tq.init()`. The deployment process:
124126

125127
1. **Detects Ray cluster nodes** - identifies all alive nodes in the Ray cluster
126-
2. **Creates placement group** - uses `STRICT_SPREAD` strategy to ensure workers are distributed across nodes
127-
3. **Launches YuanrongWorkerActor** - creates one actor per node to manage the datasystem worker
128-
4. **Sets up metastore service** - the head node (driver node) starts the metastore service, other nodes connect as workers
128+
2. **Launches YuanrongWorkerActor** - creates one actor per node to manage the datasystem worker
129+
3. **Sets up metastore service** - the head node (driver node) starts the metastore service, other nodes connect as workers
129130

130131
### Configuration
131132

@@ -146,12 +147,12 @@ backend:
146147
- `metastore_port`: Port for metastore service on the head node.
147148
- `worker_args`: Additional arguments passed to `dscli start` command:
148149
- `--shared_memory_size_mb`: Shared memory size in MB for datasystem worker.
149-
- `--enable_huge_tlb`: Configure huge page memory to reduce TLB misses and improve memory access efficiency. Note: may cause system memory shortage, kernel OOM, or system instability. Required for >21GB shared memory on Ascend 910B.
150+
- `--enable_huge_tlb`: Configure huge page memory to reduce TLB misses and improve memory access efficiency. Note: may cause system memory shortage, kernel OOM, or system instability. **Please allocate huge pages before starting datasystem** - refer to [Huge Page Guide](https://pages.openeuler.openatom.cn/openyuanrong-datasystem/docs/zh-cn/latest/appendix/hugepage_guide.html).
150151

151152
**NPU Transfer Options:**
152153
- `enable_yr_npu_transport`: Enable NPU transport for high-performance device-to-device data transfer. Set to `true` when using NPU tensors.
153-
- `worker_args` (recommended when `enable_yr_npu_transport: true`):
154-
- `--remote_h2d_device_ids`: Enable RH2D (Remote Host-to-Device) for efficient cross-node NPU data transfer. Specify NPU device IDs as comma-separated values (e.g., `0,1,2,3`).
154+
- `worker_args` (**mandatory** when `enable_yr_npu_transport: true`):
155+
- `--remote_h2d_device_ids`: Enable RH2D (Remote Host-to-Device) for efficient cross-node NPU data transfer. Specify NPU device IDs as comma-separated values (e.g., `0,1,2,3`). Yuanrong manages all specified devices - to put/get tensors on NPU `X`, device ID `X` must be included in this argument.
155156

156157
> More configuration parameters for deploying the data system can refer to [dscli config](https://gitcode.com/openeuler/yuanrong-datasystem/blob/master/docs/source_zh_cn/deployment/dscli.md).
157158

@@ -169,6 +170,8 @@ ray start --head --resources='{"node:192.168.0.1": 1}'
169170
ray start --address="192.168.0.1:6379" --resources='{"node:192.168.0.2": 1}'
170171
```
171172

173+
The `--resources` parameter defines node-specific resources. It can be used to control Ray actor placement across nodes. For NPU environments, you may also add `--resources='{"NPU": 4}'` or configure `ASCEND_RT_VISIBLE_DEVICES`.
174+
172175
#### Multi-Node Configuration
173176

174177
```yaml
@@ -186,6 +189,8 @@ TransferQueue will detect all Ray nodes and deploy datasystem workers automatica
186189

187190
#### Multi-Node Demo
188191

192+
> **Note**: Before running the demo below, modify `HEAD_NODE_IP` and `WORKER_NODE_IP` to match your actual node IPs.
193+
189194
```python
190195
import torch
191196
import ray
@@ -311,60 +316,91 @@ Note: In manual startup mode, you need to manage the lifecycle of Yuanrong worke
311316

312317
## FAQ
313318

314-
### Port Conflict
319+
### Failed to Start Datasystem Worker
315320

316-
If `worker_port` or `metastore_port` is already in use, initialization will fail:
321+
If initialization fails with `RuntimeError: Failed to start datasystem worker...`, check the following possible causes:
317322

318-
```
319-
RuntimeError: Failed to start datasystem worker...
320-
```
323+
**1. Port Conflict**
321324

322-
Check port usage:
325+
Check if `worker_port` or `metastore_port` is already in use:
323326
```bash
324327
netstat -tlnp | grep 31501
325328
netstat -tlnp | grep 2379
326329
```
327-
328330
Solution: Change the port or clean up the occupying process.
329331

330-
> If a TransferQueue task terminates abnormally without calling `tq.close()`, the datasystem will become a defunct process and occupy the port.
332+
> If a TransferQueue task terminates abnormally without calling `tq.close()`, the datasystem may become a defunct process and occupy the port.
333+
334+
**2. Shared Memory Allocation Failure**
335+
336+
If you encounter an error like:
337+
```
338+
Runtime error: failed to mmap shared memory: Cannot allocate memory
339+
```
340+
Check the following:
341+
- Docker container shared memory limit (default is 64MB, may need increase)
342+
- System available memory for shared memory allocation
343+
- Huge page configuration if `--enable_huge_tlb true` is enabled
344+
345+
Solution: Increase container shared memory (`--shm-size` flag), or reduce `--shared_memory_size_mb` value.
346+
347+
**3. Proxy Configuration**
348+
349+
HTTP/HTTPS proxy settings may interfere with Yuanrong's internal communication, causing metastore connection timeout errors.
350+
351+
Yuanrong datasystem uses IP addresses directly for internal node communication. If proxy environment variables (`http_proxy`, `https_proxy`, `HTTP_PROXY`, `HTTPS_PROXY`) are set, they may route internal traffic through the proxy instead of direct connections.
352+
353+
Solution: unset proxy variables before running:
354+
```bash
355+
unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
356+
```
357+
358+
331359

332360
### Residual Worker Process
333361

334-
If the previous run did not close properly, datasystem worker processes may remain:
362+
If the previous run did not close properly (e.g., task crashed without `tq.close()`), datasystem worker processes may remain:
335363

336364
```bash
337365
# Check residual processes
338-
ps aux | grep dscli
366+
ps aux | grep datasystem_worker
339367
340-
# Clean up
341-
dscli stop --worker_address <IP>:31501
342-
# Or force cleanup
343-
pkill -f dscli
368+
# Clean up gracefully
369+
dscli stop --worker_address <IP>:<PORT>
370+
371+
# Force cleanup (use with caution)
372+
pkill -f datasystem_worker
344373
```
345374

346375
### Multi-Process Initialization
347376

348-
Each process must call `tq.init()` to obtain a TransferQueue client before using `tq.get_client()`:
349-
- The first process initializes the TransferQueueController and Yuanrong backend
350-
- Other processes automatically connect to the existing TransferQueueController
377+
In multi-process scenarios, each process must call `tq.init()` before using TransferQueue APIs:
378+
- The first process initializes the `TransferQueueController` and Yuanrong backend
379+
- Subsequent processes automatically connect to the existing controller
351380

352-
Recommendation: Let the first process (which initialized the backend) call `tq.close()` to cleanup Yuanrong workers. Other processes only need to close their clients.
381+
Best practice: Let the process that initialized the backend (typically the main/driver process) call `tq.close()` for cleanup. Other processes can simply close their clients without affecting the shared backend.
353382

354383

355384
### NPU Transfer Issues
356385

357-
When enabling `enable_yr_npu_transport: true`, ensure:
358-
- CANN is properly installed
359-
- torch-npu version matches torch version
360-
- `--remote_h2d_device_ids` parameter correctly specifies NPU device IDs
386+
When using `enable_yr_npu_transport: true`, ensure:
387+
- CANN toolkit is properly installed
388+
- `torch-npu` version matches `torch` version
389+
- `--remote_h2d_device_ids` includes all device IDs you intend to use
390+
391+
Common errors and solutions:
392+
- `Device not found`: Check if device ID is included in `--remote_h2d_device_ids`
393+
- `CANN error`: Verify CANN installation path and environment variables
361394

362395
### Out of Memory Error
363-
If you encounter an OutOfMemoryError (OOM) thrown by DataSystems during operation, please increase the value of the configuration option `--shared_memory_size_mb`.
396+
397+
If Yuanrong throws an OOM error during operation:
364398
```
365399
RuntimeError: code: [Out of memory], msg: [Shared memory no space in arena: ...]
366400
```
367401

402+
Solution: Increase `--shared_memory_size_mb` in `worker_args`, or reduce the data volume being cached.
403+
368404

369405
## Datasystem Logs
370406

@@ -390,7 +426,7 @@ First, select the appropriate [CANN image](https://hub.docker.com/r/ascendai/can
390426
| ------------ | --------------- | ------------ | -------------- | ------------------------------------ |
391427
| 8.2.rc1 | A3 | Ubuntu 22.04 | 3.11 | cann:8.2.rc1-a3-ubuntu22.04-py3.11 |
392428
| 8.2.rc1 | 910B | Ubuntu 22.04 | 3.11 | cann:8.2.rc1-910b-ubuntu22.04-py3.11 |
393-
429+
---
394430
Pull the image:
395431

396432
```bash

0 commit comments

Comments
 (0)