Vector Calculation Utilizing AMX

AMX를 활용한 벡터 계산

이 문서는 **AMX(Advanced Matrix Extensions)**를 활용한 벡터 계산에 대해 설명하며, 특히 BF16 (bfloat16) 데이터 타입을 활용한 연산에 중점을 둡니다. AMX의 동작 원리, 구현 세부사항, 이론적 최대 성능, 그리고 효율적인 계산을 위한 최적화 기법에 대해 다룹니다.

AMX란?

*AMX (Advanced Matrix Extensions)**는 Intel에서 도입한 매트릭스 연산 가속 기술로, 하드웨어에서 직접 대규모 행렬 연산을 효율적으로 수행할 수 있도록 설계된 x86 명령어 확장입니다. **타일(Tile)**이라는 구조를 사용하여 데이터를 2차원 형태로 표현하며, 반복되는 연산의 오버헤드를 줄이는 데 탁월한 성능을 제공합니다.
Intel® AMX is a dedicated hardware block found on the Intel® Xeon® Scalable processor core that helps optimize and accelerate deep learning training and inferencing workloads that rely on matrix math.
Intel® AMX enables AI workloads to run on the CPU instead of offloading them to a discrete accelerator, providing a significant performance boost.
Its architecture supports BF16 (training/inference) and int8 (inference) data types and includes two main components
Together, these components enable Intel® AMX to store more data in each core and compute larger matrices in a single operation. Additionally, Intel® AMX is architected to be fully extensible and scalable.

AMX의 구성 요소

Intel AMX will store larger chunks of data in each core and then compute larger matrices in a single operation. The first implementation is called TMUL (tile matrix multiply unit) that comprises a grid of fused multiply-add units capable of operating on tiles. In its initial form, it implements a set of up to eight tiles (named TMM0 …TMM7), which are arrays with 16 rows of size of 64 bytes. Load a tile representing a small section from a larger image in memory, operate on that tile, and repeat with the next tile that represents the next portion of the image. When done, store the resultant tile to memory.
Tiles: These consist of eight two-dimensional registers, each 1 kilobyte in size, that store large chunks of data.
Tile Matrix Multiplication (TMUL): TMUL is an accelerator engine attached to the tiles that performs matrix-multiply computations for AI.
The matrix multiplications in the TMUL instruction set compute C[M][N] += A[M][K] * B[K][N].

Tile loads and stores and accelerator commands are sent to multi-cycle execution units—TMUL.
In simple terms, Intel AMX will store larger chunks of data in each core and then compute larger matrices in a single operation.

AMX의 주요 특징

int8 및 BF16 데이터 타입에 최적화되어 있음.
- For AMX to accelerate your deep learning model, it needs to be in BF16 or INT8 format. You can convert your model to this optimized form using auto-mixed precision for BF16 or quantization for INT8, either natively in your framework (e.g. PyTorch* or TensorFlow*) or with open-source tools from Intel which have additional features. BF16 is an easy conversion and will generally preserve accuracy. INT8 is a more efficient data type, and you can use Intel’s open-source compression tools to preserve accuracy.
- INT8 is a data type used for inferencing when the precision of FP32, a single-precision floating-point format often used in AI, isn’t needed. Because the INT8 data type is lower precision, more INT8 operations can be processed per compute cycle, which is ideal for real-time applications and matrix multiplication tasks for which speed and efficiency are a priority.
- BF16 is a data type that delivers sufficient accuracy for most training. It can also deliver higher accuracy for inferencing if needed. It enables the training of machine learning (ML) models with nearly the same accuracy as achieved with FP32, yet it incurs only a fraction of the computational cost.
- Intel AMX only supports BF16 and INT8 data types, whereas FP32 data types are still supported through Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions as found in 3rd Gen Intel Xeon processors.
BF16이란?
- *BF16 (bfloat16)**은 16비트 부동소수점 형식으로, 기계 학습 및 수치 계산 작업에서 널리 사용됩니다. IEEE 754 FP32 포맷과 유사하지만, 8비트를 지수로 사용하여 FP32와 동일한 범위를 지원합니다.
BF16의 장점
- 작은 메모리 크기: FP32 대비 메모리 사용량이 절감됩니다.
- 하드웨어 가속: AMX를 통해 고성능 매트릭스 연산 가능.
- 충분한 정밀도: 머신 러닝 및 벡터 계산에 적합.
This is a technique for training a large neural network in which the model's parameters are stored in datatypes of different precision (most commonly floating point 16 and floating point 32) to allow it to run faster and use less memory.

Most models today use the single precision floating-point (FP32) data type, which requires 32 bits of memory. There are, however, two lower-precision data types, float16 and bfloat16 (BF16), which each require only 16 bits of memory. Bfloat16 is a floating-point format that occupies 16 bits of computer memory but represents the approximate dynamic range of 32-bit floating-point numbers. Bfloat16 format is as follows:

- 1 bit - sign,
- 8 bits - exponent,
- 7 bits - fraction.

Intel AMX accelerates deep learning workloads such as recommender systems, natural language processing, and image detection. [Classical machine learning](https://www.intel.com/content/www/us/en/developer/topic-technology/artificial-intelligence/training/machine-learning.html) workloads which use tabular data will use existing Intel AVX-512 instructions. **Many deep learning workloads are *mixed precision*, and 4th Gen Intel Xeon processors can seamlessly transition between Intel AMX and Intel AVX-512 to run code using the most efficient instruction set**.

행렬 데이터를 타일로 구성하여 하드웨어에서 효율적으로 처리.
행렬 곱셈 및 벡터 내적 같은 연산에서 큰 성능 향상.

AMX가 int8에서 동작하는 원리

int8 데이터에서 AMX는 타일 기반 연산을 통해 밀집 행렬 곱셈을 최적화합니다. 작동 방식은 다음과 같습니다:

타일 구성: 타일은 서브 매트릭스를 나타내도록 구성되며, 2D 그리드 형태로 데이터를 저장합니다.
데이터 로드: 입력 행렬(예: 가중치 및 활성화 값)을 타일에 로드합니다.
행렬 곱셈 수행: _tile_dpbusd() 같은 명령어를 통해 타일 간의 행과 열 간 내적을 계산합니다.
결과 저장: 계산된 결과는 타일 또는 메모리에 저장됩니다.

이 방식을 통해 AMX는 int8 데이터에서 병렬 연산을 극대화할 수 있습니다.

(내용 추가) 4쌍씩 서로 내적한다는 내용

Specifically, it multiplies groups of 4 adjacent pairs of signed 8-bit integers in tile 2 with corresponding signed 8-bit integers in tile 3, producing 4 intermediate 32-bit results. These 4 results are added to the corresponding 32-bit integer in tile 1 and the 32-bit result is stored back to tile 1.

Compute dot-product of bytes in tiles with a source/destination accumulator. Multiply groups of 4 adjacent pairs of signed 8-bit integers in src0 with corresponding signed 8-bit integers in src1, producing 4 intermediate 32-bit results. Sum these 4 results with the corresponding 32-bit integer in dst, and store the 32-bit result back to tile dst. The shape of tile is specified in the struct of __tile1024i. The register of the tile is allocated by compiler.

DEFINE DPBD(c, x, y) {
	tmp1 := SignExtend32(x.byte[0]) * SignExtend32(y.byte[0])
	tmp2 := SignExtend32(x.byte[1]) * SignExtend32(y.byte[1])
	tmp3 := SignExtend32(x.byte[2]) * SignExtend32(y.byte[2])
	tmp4 := SignExtend32(x.byte[3]) * SignExtend32(y.byte[3])
	RETURN c + tmp1 + tmp2 + tmp3 + tmp4
}
FOR m := 0 TO dst.rows - 1
	tmp := dst.row[m]
	FOR k := 0 TO (src0.colsb / 4) - 1
		FOR n := 0 TO (dst.colsb / 4) - 1
			tmp.dword[n] := DPBD(tmp.dword[n], src0.row[m].dword[k], src1.row[k].dword[n])
		ENDFOR
	ENDFOR
	write_row_and_zero(dst, m, tmp, dst.colsb)
ENDFOR
zero_upper_rows(dst, dst.rows)
zero_tileconfig_start()

The above instruction computes dot-product of bytes in tiles with a source/destination accumulator. Specifically, it multiplies groups of 4 adjacent pairs of signed 8-bit integers in tile 2 with corresponding signed 8-bit integers in tile 3, producing 4 intermediate 32-bit results. These 4 results are added to the corresponding 32-bit integer in tile 1 and the 32-bit result is stored back to tile 1.

AMX가 BF16에서 동작하는 원리

BF16 데이터에서 AMX는 float 연산에 최적화된 명령어를 사용하며, 작동 방식은 다음과 같습니다:

타일 구성: BF16 서브 매트릭스를 저장할 수 있도록 타일 크기를 설정합니다.
데이터 변환: 필요한 경우 FP32 데이터를 BF16으로 변환한 뒤 타일에 로드합니다.
행렬 곱셈 수행: _tile_dpbf16ps() 명령어를 통해 고속 부동소수점 내적 연산을 수행합니다.
결과 저장: 최종 결과를 FP32 또는 BF16으로 변환해 저장합니다.

BF16 연산은 대규모 벡터 거리 계산이나 신경망 추론과 같은 작업에 특히 적합합니다.

(내용 추가) 2쌍씩 서로 내적한다는 내용

Compute dot-product of BF16 (16-bit) floating-point pairs in tiles src0 and src1, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in dst, and store the 32-bit result back to tile dst. The shape of tile is specified in the struct of __tile1024i. The register of the tile is allocated by compiler.

FOR m := 0 TO dst.rows - 1
	tmp := dst.row[m]
	FOR k := 0 TO (src0.colsb / 4) - 1
		FOR n := 0 TO (dst.colsb / 4) - 1
			tmp.fp32[n] += FP32(src0.row[m].bf16[2*k+0]) * FP32(src1.row[k].bf16[2*n+0])
			tmp.fp32[n] += FP32(src0.row[m].bf16[2*k+1]) * FP32(src1.row[k].bf16[2*n+1])
		ENDFOR
	ENDFOR
	write_row_and_zero(dst, m, tmp, dst.colsb)
ENDFOR
zero_upper_rows(dst, dst.rows)
zero_tileconfig_start()

Intel 컴파일러(icpx)를 사용해야 하는 이유

AMX를 효율적으로 사용하려면 **Intel® oneAPI DPC++/C++ Compiler (icpx)**를 사용해야 합니다. 이는 AMX 명령어 및 하드웨어 가속 기능을 제대로 지원합니다.

왜 icpx인가?

AMX 관련 명령어를 직접 지원.
AMX를 지원하는 Intel 프로세서에 최적화.
타일 구성 및 메모리 정렬 관리 도구 제공.

컴파일 예제:

icpx -march=native -o amx_program amx_vector_calculation.cpp

벡터 계산을 위한 AMX 데이터 정렬

(내용 추가) 데이터를 2쌍씩 or 4쌍씩 예쁘게 정렬해야 필요한 벡터 내적값을 다 계산할 수 있다는 내용


matrix A

v1 v1 v1 v1 v1 v1 v1 v1
v2 v2 v2 v2 v2 v2 v2 v2
...

matrix B

v1 v1 v2 v2 v3 v3 v4 v4
v1 v1 ...
v1 v1
v1 v1

(내용 추가) ## AMX의 성능 (내용 추가) ### 이론적 최대치는 ~~

• With this new tiled architecture and support for INT8 and BF16 data formats, Intel AMX generation-on-generation performance gains are significant. Compared to 3rd Gen Intel Xeon Scalable processors running Intel® Advanced Vector Extensions 512 Vector Neural Network Instructions (Intel® AVX-512 VNNI), 4th Gen Intel Xeon Scalable processors running Intel AMX can perform 2,048 INT8 operations per cycle, rather than 256 INT8 operations per cycle. They can also perform 1,024 BF16 operations per cycle,4 as compared to 64 FP32 operations per cycle, as shown in Figure 2.

3.1.3 Handling of Tile Row and Column Limits Intel AMX operations will zero any rows and any columns beyond the dimensions specified by TILECFG. Tile operations will zero the data beyond the configured number of column bytes as each row is written. For example, with 64-byte rows and a tile configured with 10 rows and 48 columns, an operation writing dword elements would write each of the first 10 rows with 48 bytes of output/result data and zero the remaining 16 bytes in each row. Tile operations also fully zero any rows after the first 10 configured rows. When using a 1 KByte tile with 64-byte rows, there would be 16 rows, so in this example, the last 6 rows would also be zeroed. Intel AMX instructions will always obey the metadata on reads and the zeroing rules on writes, and so a subsequent XSAVE would see zeros in the appropriate locations. Tiles that are not written by Intel AMX instructions between XRSTOR and XSAVE will write back with the same image they were loaded with regardless of the value of TILECFG