1- Getting Started with DeviceMesh
1+ DeviceMesh ์์ํ๊ธฐ
22=====================================================
33
4- **Author **: `Iris Zhang <https://github.com/wz337 >`__, `Wanchao Liang <https://github.com/wanchaol >`__
4+ **์ ์ **: `Iris Zhang <https://github.com/wz337 >`__, `Wanchao Liang <https://github.com/wanchaol >`__
5+ **์ญ์: ** `๊ฐ๋์ <https://github.com/ehdtjr >`_
56
67.. note ::
7- |edit | View and edit this tutorial in `github <https://github.com/pytorchkorea /tutorials-kr/blob/main /recipes_source/distributed_device_mesh.rst >`__.
8+ |edit | ์ด ํํ ๋ฆฌ์ผ์ `github <https://github.com/PyTorchKorea /tutorials-kr/blob/master /recipes_source/distributed_device_mesh.rst >`__ ์์ ๋ณด๊ฑฐ๋ ํธ์งํ ์ ์์ต๋๋ค .
89
9- Prerequisites:
10+ ์ฌ์ ์ค๋น( Prerequisites) :
1011
11- - `Distributed Communication Package - torch.distributed <https://pytorch.org/docs/stable/distributed.html >`__
12+ - `๋ถ์ฐ ํต์ ํจํค์ง - torch.distributed <https://pytorch.org/docs/stable/distributed.html >`__
1213- Python 3.8 - 3.11
1314- PyTorch 2.2
1415
1516
16- Setting up distributed communicators, i.e. NVIDIA Collective Communication Library (NCCL) communicators, for distributed training can pose a significant challenge. For workloads where users need to compose different parallelisms ,
17- users would need to manually set up and manage NCCL communicators (for example, :class: `ProcessGroup `) for each parallelism solution. This process could be complicated and susceptible to errors .
18- :class: `DeviceMesh ` can simplify this process, making it more manageable and less prone to errors .
17+ ๋ถ์ฐ ํ์ต์ ์ํด ๋ถ์ฐ ํต์ ๊ธฐ(communicator), ์ฆ NVIDIA Collective Communication Library(NCCL) ํต์ ๊ธฐ๋ฅผ ์ค์ ํ๋ ์ผ์ ์๋นํ ์ด๋ ค์์ด ๋ ์ ์์ต๋๋ค. ์๋ก ๋ค๋ฅธ ๋ณ๋ ฌํ ๋ฐฉ์์ ์กฐํฉํด์ผ ํ๋ ์์
์ด๋ผ๋ฉด ,
18+ ๊ฐ ๋ณ๋ ฌํ ๋ฐฉ์๋ง๋ค NCCL ํต์ ๊ธฐ(์: :class: `ProcessGroup `)๋ฅผ ์ง์ ์ค์ ํ๊ณ ๊ด๋ฆฌํด์ผ ํฉ๋๋ค. ์ด ๊ณผ์ ์ ๋ณต์กํ๊ณ ์ค๋ฅ๊ฐ ๋ฐ์ํ๊ธฐ ์ฝ์ต๋๋ค .
19+ :class: `DeviceMesh ` ๋ ์ด ๊ณผ์ ์ ๋จ์ํํ ์ ์๊ณ , ๋ ๋ค๋ฃจ๊ธฐ ์ฝ๊ฒ ๋ง๋ค๋ฉฐ ์ค๋ฅ ๋ฐ์ ๊ฐ๋ฅ์ฑ๋ ์ค์ฌ์ค๋๋ค .
1920
20- What is DeviceMesh
21- ------------------
22- :class: `DeviceMesh ` is a higher level abstraction that manages :class: `ProcessGroup `. It allows users to effortlessly
23- create inter-node and intra-node process groups without worrying about how to set up ranks correctly for different sub process groups .
24- Users can also easily manage the underlying process_groups/devices for multi-dimensional parallelism via :class: ` DeviceMesh ` .
21+ DeviceMesh๋ ๋ฌด์์ธ๊ฐ
22+ ---------------------
23+ :class: `DeviceMesh ` ๋ :class: `ProcessGroup ` ์ ๊ด๋ฆฌํ๋ ์์ ์์ค์ ์ถ์ํ์
๋๋ค.
24+ ์๋ก ๋ค๋ฅธ ํ์ ํ๋ก์ธ์ค ๊ทธ๋ฃน์ ๋ํด ๋ญํฌ(rank)๋ฅผ ์ด๋ป๊ฒ ์ฌ๋ฐ๋ฅด๊ฒ ์ค์ ํ ์ง ๊ณ ๋ฏผํ์ง ์๊ณ ๋, ๋
ธ๋ ๊ฐ(inter-node) ๋ฐ ๋
ธ๋ ๋ด(intra-node) ํ๋ก์ธ์ค ๊ทธ๋ฃน์ ์์ฝ๊ฒ ๋ง๋ค ์ ์์ต๋๋ค .
25+ ๋ํ :class: ` DeviceMesh ` ๋ฅผ ํตํด ๋ค์ฐจ์ ๋ณ๋ ฌํ์ ์ฌ์ฉ๋๋ ๋ด๋ถ์ ํ๋ก์ธ์ค ๊ทธ๋ฃน๊ณผ ๋๋ฐ์ด์ค๋ฅผ ์ฝ๊ฒ ๊ด๋ฆฌํ ์ ์์ต๋๋ค .
2526
2627.. figure :: /_static/img/distributed/device_mesh.png
2728 :width: 100%
2829 :align: center
2930 :alt: PyTorch DeviceMesh
3031
31- Why DeviceMesh is Useful
32+ DeviceMesh๊ฐ ์ ์ฉํ ์ด์
3233------------------------
33- DeviceMesh is useful when working with multi-dimensional parallelism (i.e. 3-D parallel) where parallelism composability is required. For example, when your parallelism solutions require both communication across hosts and within each host .
34- The image above shows that we can create a 2D mesh that connects the devices within each host, and connects each device with its counterpart on the other hosts in a homogeneous setup .
34+ DeviceMesh๋ ์ฌ๋ฌ ๋ณ๋ ฌํ ๋ฐฉ์์ ์กฐํฉ(composability)ํด์ผ ํ๋ ๋ค์ฐจ์ ๋ณ๋ ฌํ(์: 3์ฐจ์ ๋ณ๋ ฌ)๋ฅผ ๋ค๋ฃฐ ๋ ์ ์ฉํฉ๋๋ค. ์๋ฅผ ๋ค์ด, ๋ณ๋ ฌํ ๋ฐฉ์์ด ํธ์คํธ ๊ฐ ํต์ ๊ณผ ๊ฐ ํธ์คํธ ๋ด๋ถ์ ํต์ ์ ๋ชจ๋ ์๊ตฌํ๋ ๊ฒฝ์ฐ๊ฐ ๊ทธ๋ ์ต๋๋ค .
35+ ์ ์ด๋ฏธ์ง๋ ๋์ผํ ๊ตฌ์ฑ์ ํ๊ฒฝ์์ ๊ฐ ํธ์คํธ ๋ด๋ถ์ ๋๋ฐ์ด์ค๋ฅผ ์ฐ๊ฒฐํ๊ณ , ๊ฐ ๋๋ฐ์ด์ค๋ฅผ ๋ค๋ฅธ ํธ์คํธ์ ๋์ ๋๋ฐ์ด์ค์ ์ฐ๊ฒฐํ๋ 2D ๋ฉ์๋ฅผ ๋ง๋ค ์ ์์์ ๋ณด์ฌ์ค๋๋ค .
3536
36- Without DeviceMesh, users would need to manually set up NCCL communicators, cuda devices on each process before applying any parallelism, which could be quite complicated.
37- The following code snippet illustrates a hybrid sharding 2-D Parallel pattern setup without :class: `DeviceMesh `.
38- First, we need to manually calculate the shard group and replicate group. Then, we need to assign the correct shard and
39- replicate group to each rank.
37+ DeviceMesh๊ฐ ์๋ค๋ฉด, ์ด๋ค ๋ณ๋ ฌํ๋ฅผ ์ ์ฉํ๊ธฐ ์ ์ ๊ฐ ํ๋ก์ธ์ค๋ง๋ค NCCL ํต์ ๊ธฐ์ CUDA ๋๋ฐ์ด์ค๋ฅผ ์ง์ ์ค์ ํด์ผ ํ๋ฉฐ, ์ด๋ ๊ฝค ๋ณต์กํ ์์
์
๋๋ค.
38+ ๋ค์ ์ฝ๋๋ :class: `DeviceMesh ` ์์ด ํ์ด๋ธ๋ฆฌ๋ ์ค๋ฉ(hybrid sharding) 2์ฐจ์ ๋ณ๋ ฌ ํจํด์ ์ค์ ํ๋ ์์์
๋๋ค.
39+ ๋จผ์ ์ค๋(shard) ๊ทธ๋ฃน๊ณผ ๋ณต์ ๊ทธ๋ฃน์ ์ง์ ๊ณ์ฐํ๊ณ , ๊ฐ ๋ญํฌ์ ์๋ง์ ๊ทธ๋ฃน์ ํ ๋นํด์ผ ํฉ๋๋ค.
4040
4141.. code-block :: python
4242
@@ -45,17 +45,17 @@ replicate group to each rank.
4545 import torch
4646 import torch.distributed as dist
4747
48- # Understand world topology
48+ # ์๋ ํ ํด๋ก์ง ์ดํด
4949 rank = int (os.environ[" RANK" ])
5050 world_size = int (os.environ[" WORLD_SIZE" ])
5151 print (f " Running example on { rank= } in a world with { world_size= } " )
5252
53- # Create process groups to manage 2-D like parallel pattern
53+ # 2์ฐจ์ ํํ์ ๋ณ๋ ฌ ํจํด์ ๊ด๋ฆฌํ๊ธฐ ์ํ ํ๋ก์ธ์ค ๊ทธ๋ฃน ์์ฑ
5454 dist.init_process_group(" nccl" )
5555 torch.cuda.set_device(rank)
5656
57- # Create shard groups (e.g. (0, 1, 2, 3), (4, 5, 6, 7))
58- # and assign the correct shard group to each rank
57+ # ์ค๋ ๊ทธ๋ฃน ์์ฑ (์: (0, 1, 2, 3), (4, 5, 6, 7))
58+ # ๊ฐ ๋ญํฌ์ ์ฌ๋ฐ๋ฅธ ์ค๋ ๊ทธ๋ฃน ํ ๋น
5959 num_node_devices = torch.cuda.device_count()
6060 shard_rank_lists = list (range (0 , num_node_devices // 2 )), list (range (num_node_devices // 2 , num_node_devices))
6161 shard_groups = (
@@ -66,8 +66,8 @@ replicate group to each rank.
6666 shard_groups[0 ] if rank in shard_rank_lists[0 ] else shard_groups[1 ]
6767 )
6868
69- # Create replicate groups (for example, (0, 4), (1, 5), (2, 6), (3, 7))
70- # and assign the correct replicate group to each rank
69+ # ๋ณต์ ๊ทธ๋ฃน ์์ฑ (์: (0, 4), (1, 5), (2, 6), (3, 7))
70+ # ๊ฐ ๋ญํฌ์ ์ฌ๋ฐ๋ฅธ ๋ณต์ ๊ทธ๋ฃน ํ ๋น
7171 current_replicate_group = None
7272 shard_factor = len (shard_rank_lists[0 ])
7373 for i in range (num_node_devices // 2 ):
@@ -76,44 +76,44 @@ replicate group to each rank.
7676 if rank in replicate_group_ranks:
7777 current_replicate_group = replicate_group
7878
79- To run the above code snippet, we can leverage PyTorch Elastic. Let's create a file named ``2d_setup.py ``.
80- Then, run the following `torch elastic/torchrun <https://pytorch.org/docs/stable/elastic/quickstart.html >`__ command .
79+ ์ ์ฝ๋๋ฅผ ์คํํ๋ ค๋ฉด PyTorch Elastic์ ํ์ฉํ ์ ์์ต๋๋ค. ``2d_setup.py `` ๋ผ๋ ํ์ผ์ ๋ง๋ ๋ค,
80+ `torch elastic/torchrun <https://pytorch.org/docs/stable/elastic/quickstart.html >`__ ๋ช
๋ น์ ์คํํ์ธ์ .
8181
8282.. code-block :: python
8383
8484 torchrun -- nproc_per_node= 8 -- rdzv_id= 100 -- rdzv_endpoint= localhost:29400 2d_setup .py
8585
8686 .. note ::
87- For simplicity of demonstration, we are simulating 2D parallel using only one node. Note that this code snippet can also be used when running on multi hosts setup .
87+ ์์๋ฅผ ๊ฐ๋จํ ๋ณด์ฌ์ฃผ๊ธฐ ์ํด ๋จ์ผ ๋
ธ๋๋ง ์ฌ์ฉํด 2D ๋ณ๋ ฌ์ ์๋ฎฌ๋ ์ด์
ํ๊ณ ์์ต๋๋ค. ์ด ์ฝ๋๋ ๋ฉํฐ ํธ์คํธ ํ๊ฒฝ์์๋ ๊ทธ๋๋ก ์ฌ์ฉํ ์ ์์ต๋๋ค .
8888
89- With the help of :func: `init_device_mesh `, we can accomplish the above 2D setup in just two lines, and we can still
90- access the underlying :class: `ProcessGroup ` if needed .
89+ :func: `init_device_mesh ` ๋ฅผ ํ์ฉํ๋ฉด ์์ 2D ์ค์ ์ ๋จ ๋ ์ค๋ก ๋๋ผ ์ ์๊ณ , ํ์ํ ๋๋
90+ ๋ด๋ถ์ :class: `ProcessGroup ` ์๋ ์ ๊ทผํ ์ ์์ต๋๋ค .
9191
9292
9393.. code-block :: python
9494
9595 from torch.distributed.device_mesh import init_device_mesh
9696 mesh_2d = init_device_mesh(" cuda" , (2 , 4 ), mesh_dim_names = (" replicate" , " shard" ))
9797
98- # Users can access the underlying process group thru `get_group` API .
98+ # `get_group` API๋ฅผ ํตํด ๋ด๋ถ ํ๋ก์ธ์ค ๊ทธ๋ฃน์ ์ ๊ทผํ ์ ์์ต๋๋ค .
9999 replicate_group = mesh_2d.get_group(mesh_dim = " replicate" )
100100 shard_group = mesh_2d.get_group(mesh_dim = " shard" )
101101
102- Let's create a file named ``2d_setup_with_device_mesh.py ``.
103- Then, run the following `torch elastic/torchrun <https://pytorch.org/docs/stable/elastic/quickstart.html >`__ command .
102+ ``2d_setup_with_device_mesh.py `` ๋ผ๋ ํ์ผ์ ๋ง๋ ๋ค,
103+ `torch elastic/torchrun <https://pytorch.org/docs/stable/elastic/quickstart.html >`__ ๋ช
๋ น์ ์คํํ์ธ์ .
104104
105105.. code-block :: python
106106
107107 torchrun -- nproc_per_node= 8 2d_setup_with_device_mesh .py
108108
109109
110- How to use DeviceMesh with HSDP
110+ HSDP์์ DeviceMesh๋ฅผ ์ฌ์ฉํ๋ ๋ฐฉ๋ฒ
111111-------------------------------
112112
113- Hybrid Sharding Data Parallel(HSDP) is 2D strategy to perform FSDP within a host and DDP across hosts .
113+ Hybrid Sharding Data Parallel(HSDP)์ ํธ์คํธ ๋ด๋ถ์์๋ FSDP๋ฅผ, ํธ์คํธ ๊ฐ์๋ DDP๋ฅผ ์ํํ๋ 2D ์ ๋ต์
๋๋ค .
114114
115- Let's see an example of how DeviceMesh can assist with applying HSDP to your model with a simple setup. With DeviceMesh,
116- users would not need to manually create and manage shard group and replicate group .
115+ DeviceMesh๊ฐ ๊ฐ๋จํ ์ค์ ์ผ๋ก ๋ชจ๋ธ์ HSDP๋ฅผ ์ ์ฉํ๋ ๋ฐ ์ด๋ป๊ฒ ๋์์ด ๋๋์ง ์์๋ก ์ดํด๋ณด๊ฒ ์ต๋๋ค. DeviceMesh๋ฅผ ์ฌ์ฉํ๋ฉด
116+ ์ค๋ ๊ทธ๋ฃน๊ณผ ๋ณต์ ๊ทธ๋ฃน์ ์ง์ ๋ง๋ค๊ณ ๊ด๋ฆฌํ์ง ์์๋ ๋ฉ๋๋ค .
117117
118118.. code-block :: python
119119
@@ -141,39 +141,39 @@ users would not need to manually create and manage shard group and replicate gro
141141 ToyModel(), device_mesh = mesh_2d
142142 )
143143
144- Let's create a file named ``hsdp.py ``.
145- Then, run the following `torch elastic/torchrun <https://pytorch.org/docs/stable/elastic/quickstart.html >`__ command .
144+ ``hsdp.py `` ๋ผ๋ ํ์ผ์ ๋ง๋ ๋ค,
145+ `torch elastic/torchrun <https://pytorch.org/docs/stable/elastic/quickstart.html >`__ ๋ช
๋ น์ ์คํํ์ธ์ .
146146
147147.. code-block :: python
148148
149149 torchrun -- nproc_per_node= 8 hsdp.py
150150
151- How to use DeviceMesh for your custom parallel solutions
151+ ์ฌ์ฉ์ ์ ์ ๋ณ๋ ฌ ๋ฐฉ์์์ DeviceMesh๋ฅผ ์ฌ์ฉํ๋ ๋ฐฉ๋ฒ
152152--------------------------------------------------------
153- When working with large scale training, you might have more complex custom parallel training composition. For example, you may need to slice out sub-meshes for different parallelism solutions .
154- DeviceMesh allows users to slice child mesh from the parent mesh and re-use the NCCL communicators already created when the parent mesh is initialized .
153+ ๋๊ท๋ชจ ํ์ต ํ๊ฒฝ์์๋ ๋ ๋ณต์กํ ์ฌ์ฉ์ ์ ์ ๋ณ๋ ฌ ํ์ต ๊ตฌ์ฑ์ ๋ค๋ค์ผ ํ ์๋ ์์ต๋๋ค. ์๋ฅผ ๋ค์ด, ์๋ก ๋ค๋ฅธ ๋ณ๋ ฌํ ๋ฐฉ์์ ๋ง์ถฐ ํ์ ๋ฉ์( sub-mesh)๋ฅผ ๋๋์ด ์ฌ์ฉํด์ผ ํ ์ ์์ต๋๋ค .
154+ DeviceMesh๋ฅผ ์ฌ์ฉํ๋ฉด ์์ ๋ฉ์์์ ํ์ ๋ฉ์๋ฅผ ์๋ผ๋ด๊ณ , ์์ ๋ฉ์๋ฅผ ์ด๊ธฐํํ ๋ ์ด๋ฏธ ๋ง๋ค์ด์ง NCCL ํต์ ๊ธฐ๋ฅผ ๊ทธ๋๋ก ์ฌ์ฌ์ฉํ ์ ์์ต๋๋ค .
155155
156156.. code-block :: python
157157
158158 from torch.distributed.device_mesh import init_device_mesh
159159 mesh_3d = init_device_mesh(" cuda" , (2 , 2 , 2 ), mesh_dim_names = (" replicate" , " shard" , " tp" ))
160160
161- # Users can slice child meshes from the parent mesh .
161+ # ์์ ๋ฉ์์์ ํ์ ๋ฉ์๋ฅผ ์๋ผ๋ผ ์ ์์ต๋๋ค .
162162 hsdp_mesh = mesh_3d[" replicate" , " shard" ]
163163 tp_mesh = mesh_3d[" tp" ]
164164
165- # Users can access the underlying process group thru `get_group` API .
165+ # `get_group` API๋ฅผ ํตํด ๋ด๋ถ ํ๋ก์ธ์ค ๊ทธ๋ฃน์ ์ ๊ทผํ ์ ์์ต๋๋ค .
166166 replicate_group = hsdp_mesh[" replicate" ].get_group()
167167 shard_group = hsdp_mesh[" shard" ].get_group()
168168 tp_group = tp_mesh.get_group()
169169
170170
171- Conclusion
171+ ๊ฒฐ๋ก
172172----------
173- In conclusion, we have learned about :class: `DeviceMesh ` and :func: `init_device_mesh `, as well as how
174- they can be used to describe the layout of devices across the cluster .
173+ ์ง๊ธ๊น์ง :class: `DeviceMesh ` ์ :func: `init_device_mesh ` ๋ฅผ ์ดํด๋ณด๊ณ ,
174+ ์ด๋ฅผ ํ์ฉํด ํด๋ฌ์คํฐ์ ๋ถ์ฐ๋ ๋๋ฐ์ด์ค์ ๋ฐฐ์น๋ฅผ ํํํ๋ ๋ฐฉ๋ฒ๋ ์์๋ดค์ต๋๋ค .
175175
176- For more information, please see the following:
176+ ๋ ์์ธํ ๋ด์ฉ์ ๋ค์ ์๋ฃ๋ฅผ ์ฐธ๊ณ ํ์ธ์.
177177
178178- `2D parallel combining Tensor/Sequence Parallel with FSDP <https://github.com/pytorch/examples/blob/main/distributed/tensor_parallelism/fsdp_tp_example.py >`__
179179- `Composable PyTorch Distributed with PT2 <https://static.sched.com/hosted_files/pytorch2023/d1/%5BPTC%2023%5D%20Composable%20PyTorch%20Distributed%20with%20PT2.pdf >`__
0 commit comments