Skip to content

Commit 14f7a1c

Browse files
authored
recipes_source/amx.rst ๋ฒˆ์—ญ (#1131)
* recipes_source/amx.rst ๋ฒˆ์—ญ * recipes_source/amx.rst ๋ฒˆ์—ญ ํ‘œํ˜„ ๊ฐœ์„  * recipes_source/amx.rst ๋ฆฌ๋ทฐ ๋ฐ˜์˜
1 parent 72b8dd6 commit 14f7a1c

1 file changed

Lines changed: 41 additions & 40 deletions

File tree

โ€Žrecipes_source/amx.rstโ€Ž

Lines changed: 41 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,62 +1,63 @@
11
==============================================
2-
Leverage Intelยฎ Advanced Matrix Extensions
2+
Intelยฎ Advanced Matrix Extensions ํ™œ์šฉํ•˜๊ธฐ
33
==============================================
4+
**๋ฒˆ์—ญ**: `๊น€์ง€ํ˜„ <https://github.com/hijyun>`_
45

5-
Introduction
6+
์†Œ๊ฐœ
67
============
78

8-
Advanced Matrix Extensions (AMX), also known as Intelยฎ Advanced Matrix Extensions (Intelยฎ AMX), is an x86 extension,
9-
which introduce two new components: a 2-dimensional register file called 'tiles' and an accelerator of Tile Matrix Multiplication (TMUL) that is able to operate on those tiles.
10-
AMX is designed to work on matrices to accelerate deep-learning training and inference on the CPU and is ideal for workloads like natural-language processing, recommendation systems and image recognition.
9+
Advanced Matrix Extensions(AMX)๋Š” Intelยฎ Advanced Matrix Extensions(Intelยฎ AMX)๋ผ๊ณ ๋„ ๋ถ€๋ฅด๋Š” x86 ํ™•์žฅ ๊ธฐ๋Šฅ์ž…๋‹ˆ๋‹ค.
10+
์ด ํ™•์žฅ ๊ธฐ๋Šฅ์€ ๋‘ ๊ฐ€์ง€ ์ƒˆ๋กœ์šด ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค. ํ•˜๋‚˜๋Š” โ€˜tilesโ€™๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” 2์ฐจ์› ๋ ˆ์ง€์Šคํ„ฐ ํŒŒ์ผ์ด๊ณ , ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ์ด๋Ÿฌํ•œ tiles์—์„œ ๋™์ž‘ํ•  ์ˆ˜ ์žˆ๋Š” Tile Matrix Multiplication(TMUL) ๊ฐ€์†๊ธฐ์ž…๋‹ˆ๋‹ค.
11+
AMX๋Š” ํ–‰๋ ฌ ์—ฐ์‚ฐ์— ์ตœ์ ํ™”๋˜์–ด CPU์—์„œ ๋”ฅ๋Ÿฌ๋‹ ํ•™์Šต๊ณผ ์ถ”๋ก ์„ ๊ฐ€์†ํ•˜๋ฉฐ, ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ, ์ถ”์ฒœ ์‹œ์Šคํ…œ, ์ด๋ฏธ์ง€ ์ธ์‹๊ณผ ๊ฐ™์€ ์ž‘์—…์— ์ด์ƒ์ ์ž…๋‹ˆ๋‹ค.
1112

12-
Intel advances AI capabilities with 4th Gen Intelยฎ Xeonยฎ Scalable processors and Intelยฎ AMX, delivering 3x to 10x higher inference and training performance versus the previous generation, see `Accelerate AI Workloads with Intelยฎ AMX`_.
13-
Compared to 3rd Gen Intel Xeon Scalable processors running Intelยฎ Advanced Vector Extensions 512 Neural Network Instructions (Intelยฎ AVX-512 VNNI),
14-
4th Gen Intel Xeon Scalable processors running Intel AMX can perform 2,048 INT8 operations per cycle, rather than 256 INT8 operations per cycle. They can also perform 1,024 BF16 operations per cycle, as compared to 64 FP32 operations per cycle, see page 4 of `Accelerate AI Workloads with Intelยฎ AMX`_.
15-
For more detailed information of AMX, see `Intelยฎ AMX Overview`_.
1613

14+
Intel์€ 4์„ธ๋Œ€ Intelยฎ Xeonยฎ Scalable ํ”„๋กœ์„ธ์„œ์™€ Intelยฎ AMX๋ฅผ ํ†ตํ•ด AI ๊ธฐ๋Šฅ์„ ๋ฐœ์ „์‹œ์ผœ, ์ด์ „ ์„ธ๋Œ€ ๋Œ€๋น„ 3๋ฐฐ์—์„œ 10๋ฐฐ ๋” ๋†’์€ ์ถ”๋ก  ๋ฐ ํ•™์Šต ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. `Accelerate AI Workloads with Intelยฎ AMX`_ ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.
15+
Intelยฎ Advanced Vector Extensions 512 Neural Network Instructions(Intelยฎ AVX-512 VNNI)๋ฅผ ์‹คํ–‰ํ•˜๋Š” 3์„ธ๋Œ€ Intel Xeon Scalable ํ”„๋กœ์„ธ์„œ์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ,
16+
Intel AMX๋ฅผ ์ง€์›ํ•˜๋Š” 4์„ธ๋Œ€ Intel Xeon Scalable ํ”„๋กœ์„ธ์„œ๋Š” ํ•œ ์‚ฌ์ดํด๋‹น 256๊ฐœ์˜ INT8 ์—ฐ์‚ฐ์ด ์•„๋‹ˆ๋ผ 2,048๊ฐœ์˜ INT8 ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ํ•œ ์‚ฌ์ดํด๋‹น 64๊ฐœ์˜ FP32 ์—ฐ์‚ฐ๊ณผ ๋น„๊ตํ•ด, ํ•œ ์‚ฌ์ดํด๋‹น 1,024๊ฐœ์˜ BF16 ์—ฐ์‚ฐ๋„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. `Accelerate AI Workloads with Intelยฎ AMX`_ ์˜ 4ํŽ˜์ด์ง€๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”. AMX์— ๋Œ€ํ•œ ๋” ์ž์„ธํ•œ ์ •๋ณด๋Š” `Intelยฎ AMX Overview`_ ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.
1717

18-
AMX in PyTorch
19-
==============
2018

21-
PyTorch leverages AMX for computing intensive operators with BFloat16 and quantization with INT8 by its backend oneDNN
22-
to get higher performance out-of-box on x86 CPUs with AMX support.
23-
For more detailed information of oneDNN, see `oneDNN`_.
19+
PyTorch์—์„œ์˜ AMX
20+
==================
2421

25-
The operation is fully handled by oneDNN according to the execution code path generated. For example, when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.
26-
Since oneDNN is the default acceleration library for PyTorch CPU, no manual operations are required to enable the AMX support.
22+
PyTorch๋Š” ๋ฐฑ์—”๋“œ์ธ oneDNN์„ ํ†ตํ•ด BFloat16 ๊ธฐ๋ฐ˜์˜ ์—ฐ์‚ฐ ์ง‘์•ฝ์  ์—ฐ์‚ฐ์ž์™€ INT8 ์–‘์žํ™”์— AMX๋ฅผ ํ™œ์šฉํ•˜์—ฌ,
23+
AMX๋ฅผ ์ง€์›ํ•˜๋Š” x86 CPU์—์„œ ๋ณ„๋„์˜ ์„ค์ • ์—†์ด ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
24+
oneDNN์— ๋Œ€ํ•œ ๋” ์ž์„ธํ•œ ์ •๋ณด๋Š” `oneDNN`_ ์„ ์ฐธ๊ณ ํ•˜์„ธ์š”.
2725

28-
Guidelines of leveraging AMX with workloads
29-
-------------------------------------------
26+
์ด ์—ฐ์‚ฐ์€ ์ƒ์„ฑ๋œ ์‹คํ–‰ ์ฝ”๋“œ ๊ฒฝ๋กœ์— ๋”ฐ๋ผ oneDNN์ด ์ „์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, AMX๋ฅผ ์ง€์›ํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ์—์„œ oneDNN ๊ตฌํ˜„์ด ์ง€์›ํ•˜๋Š” ์—ฐ์‚ฐ์„ ์‹คํ–‰ํ•˜๋ฉด, oneDNN ๋‚ด๋ถ€์—์„œ AMX ๋ช…๋ น์–ด๋ฅผ ์ž๋™์œผ๋กœ ํ˜ธ์ถœํ•ฉ๋‹ˆ๋‹ค.
27+
oneDNN์€ PyTorch CPU์˜ ๊ธฐ๋ณธ ๊ฐ€์† ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ด๋ฏ€๋กœ, AMX ์ง€์›์„ ํ™œ์„ฑํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋ณ„๋„์˜ ์ˆ˜๋™ ์ž‘์—…์€ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
3028

31-
This section provides guidelines on how to leverage AMX with various workloads.
29+
AMX๋ฅผ ์›Œํฌ๋กœ๋“œ์— ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ€์ด๋“œ๋ผ์ธ
30+
---------------------------------------------------
3231

33-
- BFloat16 data type:
32+
์ด ์ ˆ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ์›Œํฌ๋กœ๋“œ์—์„œ AMX๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ๊ฐ€์ด๋“œ๋ผ์ธ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
3433

35-
- Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX acceleration for supported operators.
34+
- BFloat16 ๋ฐ์ดํ„ฐ ํƒ€์ž…:
35+
36+
- ``torch.cpu.amp`` ๋˜๋Š” ``torch.autocast("cpu")`` ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ง€์›๋˜๋Š” ์—ฐ์‚ฐ์ž์— ๋Œ€ํ•ด AMX ๊ฐ€์†์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
3637

3738
::
3839

3940
model = model.to(memory_format=torch.channels_last)
4041
with torch.cpu.amp.autocast():
4142
output = model(input)
4243

43-
.. note:: Use ``torch.channels_last`` memory format to get better performance.
44+
.. note:: ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ์–ป์œผ๋ ค๋ฉด ``torch.channels_last`` ๋ฉ”๋ชจ๋ฆฌ ํ˜•์‹(memory format)์„ ์‚ฌ์šฉํ•˜์„ธ์š”.
4445

45-
- Quantization:
46+
- ์–‘์žํ™”(Quantization):
4647

47-
- Applying quantization would utilize AMX acceleration for supported operators.
48+
- ์–‘์žํ™”๋ฅผ ์ ์šฉํ•˜๋ฉด ์ง€์›๋˜๋Š” ์—ฐ์‚ฐ์ž์— ๋Œ€ํ•ด AMX ๊ฐ€์†์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
4849

4950
- torch.compile:
5051

51-
- When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.
52+
- ์ƒ์„ฑ๋œ ๊ทธ๋ž˜ํ”„ ๋ชจ๋ธ์ด oneDNN์˜ ์ง€์› ์—ฐ์‚ฐ์œผ๋กœ ์‹คํ–‰๋˜๋ฉด AMX ๊ฐ€์†์ด ํ™œ์„ฑํ™”๋ฉ๋‹ˆ๋‹ค.
5253

53-
.. note:: When using PyTorch on CPUs that support AMX, the framework will automatically enable AMX usage by default. This means that PyTorch will attempt to leverage the AMX feature whenever possible to speed up matrix multiplication operations. However, it's important to note that the decision to dispatch to the AMX kernel ultimately depends on the internal optimization strategy of the oneDNN library and the quantization backend, which PyTorch relies on for performance enhancements. The specific details of how AMX utilization is handled internally by PyTorch and the oneDNN library may be subject to change with updates and improvements to the framework.
54+
.. note:: AMX๋ฅผ ์ง€์›ํ•˜๋Š” CPU์—์„œ PyTorch๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ, ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ AMX ์‚ฌ์šฉ์„ ์ž๋™์œผ๋กœ ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, PyTorch๋Š” ํ–‰๋ ฌ ๊ณฑ์…ˆ ์—ฐ์‚ฐ์˜ ์†๋„๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•ด ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ AMX ๊ธฐ๋Šฅ์„ ํ™œ์šฉํ•˜๋ ค๊ณ  ์‹œ๋„ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ AMX ์ปค๋„๋กœ ๋””์ŠคํŒจ์น˜ํ• ์ง€ ์—ฌ๋ถ€๋Š” ์ตœ์ข…์ ์œผ๋กœ PyTorch๊ฐ€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด ์˜์กดํ•˜๋Š” oneDNN ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์™€ ์–‘์žํ™” ๋ฐฑ์—”๋“œ์˜ ๋‚ด๋ถ€ ์ตœ์ ํ™” ์ „๋žต์— ๋”ฐ๋ผ ๊ฒฐ์ •๋œ๋‹ค๋Š” ์ ์— ์œ ์˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. PyTorch์™€ oneDNN ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋‚ด๋ถ€์—์„œ AMX ํ™œ์šฉ์ด ์ฒ˜๋ฆฌ๋˜๋Š” ๊ตฌ์ฒด์ ์ธ ๋ฐฉ์‹์€ ํ”„๋ ˆ์ž„์›Œํฌ์˜ ์—…๋ฐ์ดํŠธ์™€ ๊ฐœ์„ ์— ๋”ฐ๋ผ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
5455

5556

56-
CPU operators that can leverage AMX:
57+
AMX๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” CPU ์—ฐ์‚ฐ์ž:
5758
------------------------------------
5859

59-
BF16 CPU ops that can leverage AMX:
60+
AMX๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” BF16 CPU ์—ฐ์‚ฐ์ž:
6061

6162
- ``conv1d``
6263
- ``conv2d``
@@ -72,7 +73,7 @@ BF16 CPU ops that can leverage AMX:
7273
- ``linear``
7374
- ``matmul``
7475

75-
Quantization CPU ops that can leverage AMX:
76+
AMX๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์–‘์žํ™” CPU ์—ฐ์‚ฐ์ž:
7677

7778
- ``conv1d``
7879
- ``conv2d``
@@ -84,18 +85,18 @@ Quantization CPU ops that can leverage AMX:
8485

8586

8687

87-
Confirm AMX is being utilized
88-
------------------------------
88+
AMX๊ฐ€ ํ™œ์šฉ๋˜๊ณ  ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ
89+
------------------------------------
8990

90-
Set environment variable ``export ONEDNN_VERBOSE=1``, or use ``torch.backends.mkldnn.verbose`` to enable oneDNN to dump verbose messages.
91+
ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ``export ONEDNN_VERBOSE=1`` ์„ ์„ค์ •ํ•˜๊ฑฐ๋‚˜, ``torch.backends.mkldnn.verbose`` ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ oneDNN์ด ์ƒ์„ธ ๋ฉ”์‹œ์ง€๋ฅผ ์ถœ๋ ฅํ•˜๋„๋ก ํ™œ์„ฑํ™”ํ•˜์„ธ์š”.
9192

9293
::
9394

9495
with torch.backends.mkldnn.verbose(torch.backends.mkldnn.VERBOSE_ON):
9596
with torch.cpu.amp.autocast():
9697
model(input)
9798

98-
For example, get oneDNN verbose:
99+
์˜ˆ๋ฅผ ๋“ค์–ด, ๋‹ค์Œ๊ณผ ๊ฐ™์ด oneDNN์˜ ์ƒ์„ธ ์ถœ๋ ฅ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
99100

100101
::
101102

@@ -111,20 +112,20 @@ For example, get oneDNN verbose:
111112
onednn_verbose,exec,cpu,matmul,brg:avx512_core_amx_int8,undef,src_s8::blocked:ab:f0 wei_s8:p:blocked:BA16a64b4a:f0 dst_s8::blocked:ab:f0,attr-scratchpad:user ,,1x30522:30522x768:1x768,7.66382
112113
...
113114

114-
If you get the verbose of ``avx512_core_amx_bf16`` for BFloat16 or ``avx512_core_amx_int8`` for quantization with INT8, it indicates that AMX is activated.
115+
BFloat16์˜ ๊ฒฝ์šฐ ``avx512_core_amx_bf16`` ๊ฐ€ ํฌํ•จ๋œ ์ƒ์„ธ ์ถœ๋ ฅ์ด ๋‚˜ํƒ€๋‚˜๊ฑฐ๋‚˜, INT8 ์–‘์žํ™”์˜ ๊ฒฝ์šฐ ``avx512_core_amx_int8`` ๊ฐ€ ํฌํ•จ๋œ ์ƒ์„ธ ์ถœ๋ ฅ์ด ๋‚˜ํƒ€๋‚˜๋ฉด AMX๊ฐ€ ํ™œ์„ฑํ™”๋˜์—ˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
115116

116117

117-
Conclusion
118+
๊ฒฐ๋ก 
118119
----------
119120

120121

121-
In this tutorial, we briefly introduced AMX, how to utilize AMX in PyTorch to accelerate workloads, and how to confirm that AMX is being utilized.
122+
์ด ํŠœํ† ๋ฆฌ์–ผ์—์„œ๋Š” AMX์™€, PyTorch์—์„œ AMX๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์›Œํฌ๋กœ๋“œ๋ฅผ ๊ฐ€์†ํ•˜๋Š” ๋ฐฉ๋ฒ•, ๊ทธ๋ฆฌ๊ณ  AMX๊ฐ€ ํ™œ์šฉ๋˜๊ณ  ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๊ฐ„๋žตํžˆ ์†Œ๊ฐœํ–ˆ์Šต๋‹ˆ๋‹ค.
122123

123-
With the improvements and updates of PyTorch and oneDNN, the utilization of AMX may be subject to change accordingly.
124+
PyTorch์™€ oneDNN์˜ ๊ฐœ์„  ๋ฐ ๊ฐฑ์‹ ์— ๋”ฐ๋ผ, AMX์˜ ํ™œ์šฉ ๋ฐฉ์‹๋„ ๊ทธ์— ๋งž๊ฒŒ ๋ณ€๊ฒฝ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
124125

125-
As always, if you run into any problems or have any questions, you can use
126-
`forum <https://discuss.pytorch.org/>`_ or `GitHub issues
127-
<https://github.com/pytorch/pytorch/issues>`_ to get in touch.
126+
๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๊ฑฐ๋‚˜ ๊ถ๊ธˆํ•œ ์ ์ด ์žˆ๋‹ค๋ฉด ์–ธ์ œ๋“ 
127+
`ํฌ๋Ÿผ <https://discuss.pytorch.org/>`_ ์ด๋‚˜ `GitHub ์ด์Šˆ
128+
<https://github.com/pytorch/pytorch/issues>`_ ๋ฅผ ํ†ตํ•ด ๋ฌธ์˜ํ•ด ์ฃผ์„ธ์š”.
128129

129130

130131
.. _Accelerate AI Workloads with Intelยฎ AMX: https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/ai-solution-brief.html

0 commit comments

Comments
ย (0)