You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[feat] Support metastore mode for Yuanrong backend init (Ascend#74)
## Description
Simplify Yuanrong backend initialization to exclusively support
metastore mode, removing the external etcd dependency. This refactor
uses Ray's native cluster discovery and placement groups to manage
distributed Yuanrong datasystem workers.
## Changes
- transfer_queue/config.yaml: Remove etcd and metastore mode
configuration
- Removed: etcd_address, host, port, metastore_mode, metastore_address
- Kept: auto_init, worker_port, metastore_port
- Added: worker_args for additional dscli start arguments
- Host IPs are now auto-detected from ray.nodes() via NodeManagerAddress
- transfer_queue: Add new file transfer_queue/utils/yuanrong_utils.py
- YuanrongWorkerActor (Ray actor class):
- Determines its node via IP intersection with provided node_ips
- Starts metastore service on head node (rank 0)
- Provides start() and stop() methods for lifecycle management
- initialize_yuanrong_backend(): Complete initialization logic
- Gets Ray cluster information via ray.nodes()
- Creates placement group with STRICT_SPREAD strategy (0.1 CPU per
bundle)
- Creates YuanrongWorkerActor instances on each bundle
- Starts head worker first, then parallel starts remaining workers
- Returns dict with worker_actors, metastore_address, placement_group
- Handles exceptions with proper cleanup
- cleanup_yuanrong_resources(): Complete cleanup logic
- Stops all workers concurrently, collecting exceptions
- Kills actors and removes placement group
- start_datasystem_worker() / stop_datasystem_worker(): dscli wrapper
functions
- get_local_ip_addresses(): IP discovery for node self-determination
- transfer_queue/interface.py: Simplified Yuanrong backend integration
- Replace ~100 lines of inline initialization with single function call
to initialize_yuanrong_backend(conf)
- Simplify close() to single call to cleanup_yuanrong_resources(value)
- Remove unused imports: shutil, get_local_ip_addresses, etcd-related
functions
- tests/: Update test configurations to use worker_port instead of
host/port
## Related issues
FixesAscend#50
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
OpenYuanrong-datasystem relies on etcd for cluster coordination.
64
-
Download and install etcd from the official releases: [ETCD GitHub Releases](https://github.com/etcd-io/etcd/releases)
65
-
66
-
```bash
67
-
# Example for Linux ARM64 (adjust for your architecture)
68
-
# Unpack and install etcd
69
-
ETCD_VERSION = "v3.6.5"# Replace with the desired version
70
-
tar -xvf etcd-${ETCD_VERSION}-linux-arm64.tar.gz
71
-
cd etcd-${ETCD_VERSION}-linux-arm64
72
-
73
-
# Copy the executable file to the system path
74
-
sudo cp etcd etcdctl /usr/local/bin/
75
-
76
-
# Verify installation
77
-
etcd --version
78
-
etcdctl version
79
-
```
80
-
81
-
#### 4. (Optional) Install CANN and torch-npu
61
+
#### 3. (Optional) Install CANN and torch-npu
82
62
83
63
If you have NPU devices and want to accelerate the transmission of NPU tensor,
84
64
you can install **Ascend-cann-toolkit** and **torch-npu**.
@@ -106,19 +86,36 @@ pip install torch-npu==2.8.0
106
86
Next, we will provide deployment and code examples for single-node scenarios.
107
87
For multi-node scenarios, please refer to [Appendix B](#B-deploy-multi-node-datasystem-for-multi-node-training-and-inference-scenarios).
108
88
109
-
Unlike using TransferQueue with its default backend, integrating OpenYuanrong-Datasystem requires **pre-launching** the datasystem services before running your Python application.
TransferQueue automatically initializes Yuanrong datasystem workers across all Ray cluster nodes. Just set `auto_init: True` in the configuration and TransferQueue will handle the multi-node deployment.
270
247
248
+
Let's take two nodes (for instance, 192.168.0.1 and 192.168.0.2) as an example.
271
249
272
-
#### Deploy multi-nodes datasystem
273
-
On each node, you need to connect to the etcd service on the head node using your local node's IP address.
Now you can use datasystem on head-node and work-node.
284
272
285
273
> For more detailed deployment instructions, please refer to [yuanrong documents](https://gitcode.com/openeuler/yuanrong-datasystem/blob/master/README.md#%E9%83%A8%E7%BD%B2-openyuanrong-datasystem).
286
274
> The configuration parameters for deploying the data system can refer [dscli config](https://gitcode.com/openeuler/yuanrong-datasystem/blob/master/docs/source_zh_cn/deployment/dscli.md#%E9%85%8D%E7%BD%AE%E9%A1%B9%E8%AF%B4%E6%98%8E).
287
275
288
276
There is a demo with multi-node scenarios as fellow.
289
277
290
-
#### Deploy ray
291
-
```bash
292
-
# on head node
293
-
ray start --head --resources='{"node:10.170.27.24": 1}'
294
-
295
-
# on worker node (assume ray port of head_node is 6379)
296
-
ray start --address="10.170.27.24:6379" --resources='{"node:10.170.27.33": 1}'
297
-
```
298
-
299
278
#### Run demo
300
-
In the demo below, we use ray actors to implement distributed deployment of processes.
279
+
In the demo below, we use ray actors to implement distributed deployment of processes.
301
280
The actor writer writes data to the head node, and the actor reader reads data from the worker nodes.
0 commit comments