Skip to content

Commit e26609c

Browse files
committed
adds benchmarking tests for tensor transpose and gemm along with documentation
1 parent f6d6bc2 commit e26609c

7 files changed

Lines changed: 1153 additions & 0 deletions

File tree

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
.. _test_gemm_benchmark:
2+
3+
=======================
4+
GEMM Benchmark Test
5+
=======================
6+
7+
Overview
8+
========
9+
10+
Test_GEMM measures the performance of General Matrix Multiply (GEMM) operations using the TAMM framework. It reads test configurations from CSV files and reports timing, GFLOPS, and throughput metrics for both GPU and CPU execution.
11+
12+
Basic Usage
13+
-----------
14+
15+
.. code-block:: bash
16+
17+
# Run with default settings
18+
mpirun -n 2 ./Test_GEMM test_cases.csv
19+
20+
# Specify custom iterations
21+
mpirun -n 2 ./Test_GEMM test_cases.csv 50
22+
23+
Command Line Arguments
24+
======================
25+
26+
.. code-block:: bash
27+
28+
mpirun -n 2 ./Test_GEMM <csv_file> [iterations]
29+
30+
**Required Arguments:**
31+
32+
* ``csv_file``: Path to CSV file containing GEMM test case definitions
33+
34+
**Optional Arguments:**
35+
36+
* ``iterations``: Number of benchmark runs per test case (default: 100)
37+
38+
**Examples:**
39+
40+
.. code-block:: bash
41+
42+
# Test with default 100 iterations
43+
mpirun -n 2 ./Test_GEMM gemm_cases.csv
44+
45+
# Test with 50 iterations
46+
mpirun -n 2 ./Test_GEMM gemm_cases.csv 50
47+
48+
# Test with 200 iterations
49+
mpirun -n 2 ./Test_GEMM gemm_cases.csv 200
50+
51+
Input File Format
52+
=================
53+
54+
The program expects a CSV file with GEMM configurations using the following 7-column format:
55+
56+
CSV Header
57+
----------
58+
59+
.. code-block:: text
60+
61+
contraction_size,output_a_size,output_b_size,total_output_size,batch_size,reduction_a_size,reduction_b_size
62+
63+
Column Definitions
64+
------------------
65+
66+
* **contraction_size**: Size of the contraction dimension (K in GEMM)
67+
* **output_a_size**: Output size for matrix A (M dimension)
68+
* **output_b_size**: Output size for matrix B (N dimension)
69+
* **total_output_size**: Total output matrix size (M×N)
70+
* **batch_size**: Batch size for batched operations
71+
* **reduction_a_size**: Reduction size for matrix A operations
72+
* **reduction_b_size**: Reduction size for matrix B operations
73+
74+
Sample Data
75+
-----------
76+
77+
.. code-block:: text
78+
79+
contraction_size,output_a_size,output_b_size,total_output_size,batch_size,reduction_a_size,reduction_b_size
80+
1,2655,526,1396530,1,1,1
81+
1,2655,1000,2655000,1,1,1
82+
45,59,45,2655,1,1,1
83+
45,59,59,3481,1,1,1
84+
59,45,45,2025,1,1,1
85+
67,75,67,5025,1,1,1
86+
87+
Understanding the Format
88+
========================
89+
90+
GEMM Dimensions
91+
---------------
92+
93+
* **M (output_a_size)**: Number of rows in matrix A and result matrix C
94+
* **N (output_b_size)**: Number of columns in matrix B and result matrix C
95+
* **K (contraction_size)**: Shared dimension between matrices A and B
96+
* **B (batch_size)**: Number of matrices in batched operations
97+
98+
Matrix Operation
99+
----------------
100+
101+
The GEMM operation computes: **C = α × A × B + β × C**
102+
103+
* Matrix A dimensions: M × K
104+
* Matrix B dimensions: K × N
105+
* Matrix C dimensions: M × N
106+
* All dimensions must be positive integers
107+
108+
Batch Operations
109+
----------------
110+
111+
* **batch_size**: Defines number of independent GEMM operations
112+
* **reduction_a_size**: Additional reduction factor for matrix A
113+
* **reduction_b_size**: Additional reduction factor for matrix B
114+
* Example: batch_size=4 performs 4 separate GEMM operations
115+
116+
Sample Test Cases
117+
=================
118+
119+
.. code-block:: text
120+
121+
contraction_size,output_a_size,output_b_size,total_output_size,batch_size,reduction_a_size,reduction_b_size
122+
1,2655,526,1396530,1,1,1
123+
1,2655,1000,2655000,1,1,1
124+
1,5025,526,2643150,1,1,1
125+
45,59,45,2655,1,1,1
126+
45,59,59,3481,1,1,1
127+
59,45,45,2025,1,1,1
128+
67,75,67,5025,1,1,1
129+
526,1,1,1,1,1,1
130+
526,2025,1,2025,1,1,1
131+
526,2025,2025,4100625,1,1,1
132+
133+
Sample Output
134+
-------------
135+
136+
.. code-block:: text
137+
138+
Loaded 10 GEMM test cases (double precision)
139+
140+
Testing: GEMM_double_59x45x45_B1_AR1_BR1
141+
Matrix dimensions: A(59x45) × B(45x45) = C(59x45)
142+
Batch size: 1
143+
Reduction dimensions: AR=1, BR=1
144+
Data type: double (8 bytes per element)
145+
Buffer sizes: A=2655, B=2025, C=2655
146+
Allocating buffers...
147+
Running 100 timing iterations...
148+
Total FLOPs: 2.385e+05
149+
Data size: 0.056 MB
150+
Iterations: 100
151+
Average time: 0.045123 ms
152+
Performance: 5.284 GFLOPS
153+
Throughput: 2.484 GB/s
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
.. _test_tensor_transpose:
2+
3+
===============================
4+
Tensor Transpose Benchmark Test
5+
===============================
6+
7+
Overview
8+
========
9+
10+
Test_Tensor_Transpose measures the performance of tensor transpose operations in. It reads test configurations from CSV files and reports timing and throughput metrics for both GPU and CPU execution.
11+
12+
Basic Usage
13+
-----------
14+
15+
.. code-block:: bash
16+
17+
# Run with default settings
18+
mpirun -n 2 ./transpose_benchmark test_cases.csv
19+
20+
# Specify iterations and data type
21+
mpirun -n 2 ./transpose_benchmark test_cases.csv 50 float
22+
23+
Command Line Arguments
24+
======================
25+
26+
.. code-block:: bash
27+
28+
mpirun -n <processes> ./transpose_benchmark <csv_file> [iterations] [data_type]
29+
30+
**Required Arguments:**
31+
32+
* ``csv_file``: Path to CSV file containing test case definitions
33+
34+
**Optional Arguments:**
35+
36+
* ``iterations``: Number of benchmark runs per test case (default: 100)
37+
* ``data_type``: Data precision - ``float``, ``double``, or ``all`` (default: ``double``)
38+
39+
**Examples:**
40+
41+
.. code-block:: bash
42+
43+
# Test double precision with 100 iterations
44+
mpirun -n 2 ./transpose_benchmark cases.csv
45+
46+
# Test single precision with 50 iterations
47+
mpirun -n 2 ./transpose_benchmark cases.csv 50 float
48+
49+
# Test both precisions with 200 iterations
50+
mpirun -n 2 ./transpose_benchmark cases.csv 200 all
51+
52+
Input File Format
53+
=================
54+
55+
The program expects a CSV file with tensor transpose configurations using the following 9-column format:
56+
57+
CSV Header
58+
----------
59+
60+
.. code-block:: text
61+
62+
block_size_dim_0,block_size_dim_1,block_size_dim_2,block_size_dim_3,permutation_map_idx_0,permutation_map_idx_1,permutation_map_idx_2,permutation_map_idx_3,original_transpose_string
63+
64+
Column Definitions
65+
------------------
66+
67+
* **block_size_dim_0 to block_size_dim_3**: Tensor dimension sizes (use ``1`` for unused dimensions)
68+
* **permutation_map_idx_0 to permutation_map_idx_3**: Target position for each dimension (use ``-1`` for unused positions)
69+
* **original_transpose_string**: Mathematical description of the transpose operation
70+
71+
Sample Data
72+
-----------
73+
74+
.. code-block:: text
75+
76+
block_size_dim_0,block_size_dim_1,block_size_dim_2,block_size_dim_3,permutation_map_idx_0,permutation_map_idx_1,permutation_map_idx_2,permutation_map_idx_3,original_transpose_string
77+
45,45,1,1,1,0,-1,-1,C( 0 1 ) <- C'( 1 0 )
78+
100,45,1,1,1,0,-1,-1,C( 0 1 ) <- C'( 1 0 )
79+
75,67,2,1,0,2,1,-1,B( 2 1 3 ) -> B'( 2 3 1 )
80+
75,67,3,1,1,0,2,-1,C( 0 1 2 ) <- C'( 1 0 2 )
81+
82+
Understanding the Format
83+
========================
84+
85+
Dimension Specification
86+
-----------------------
87+
88+
* Valid dimensions must be > 1
89+
* Use ``1`` for unused/singleton dimensions
90+
* Example: ``75,67,4,1`` represents a 3D tensor of size 75×67×4
91+
92+
Permutation Mapping
93+
-------------------
94+
95+
* Each index specifies where the corresponding input dimension goes in the output
96+
* Use ``-1`` for unused positions
97+
* Example: ``1,0,-1,-1`` means dimension 0 → position 1, dimension 1 → position 0
98+
99+
Transpose String Notation
100+
-------------------------
101+
102+
The mathematical notation describes the operation:
103+
104+
* ``C( 0 1 ) <- C'( 1 0 )``: Matrix transpose (swap dimensions 0 and 1)
105+
* ``B( 2 1 3 ) -> B'( 2 3 1 )``: Cyclic permutation of 3D tensor
106+
* ``C( 0 1 2 ) <- C'( 1 0 2 )``: Transpose first two dimensions, keep third unchanged
107+
108+
Sample Test Cases
109+
=================
110+
111+
.. code-block:: text
112+
113+
block_size_dim_0,block_size_dim_1,block_size_dim_2,block_size_dim_3,permutation_map_idx_0,permutation_map_idx_1,permutation_map_idx_2,permutation_map_idx_3,original_transpose_string
114+
45,45,1,1,1,0,-1,-1,C( 0 1 ) <- C'( 1 0 )
115+
100,45,1,1,1,0,-1,-1,C( 0 1 ) <- C'( 1 0 )
116+
55,45,1,1,1,0,-1,-1,C( 0 1 ) <- C'( 1 0 )
117+
75,67,2,1,0,2,1,-1,B( 2 1 3 ) -> B'( 2 3 1 )
118+
75,67,3,1,1,0,2,-1,C( 0 1 2 ) <- C'( 1 0 2 )
119+
75,67,4,1,1,2,0,-1,B( 0 2 3 ) -> B'( 2 3 0 )
120+
75,67,5,1,2,0,1,-1,B( 1 3 4 ) -> B'( 4 1 3 )
121+
122+
Sample Output
123+
-------------
124+
125+
.. code-block:: text
126+
127+
========== TESTING WITH TYPE: d ==========
128+
=== GPU TESTS ===
129+
Testing (d): C( 0 1 ) <- C'( 1 0 )
130+
Dimensions: [45,45] -> [45,45]
131+
Permutation: [0,1] -> [1,0]
132+
Elements: 2025
133+
Data size: 0.015488 MB
134+
Element size: 8 bytes
135+
Iterations: 100
136+
Average time: 0.032145 ms
137+
Throughput: 0.964 GB/s
138+
Hardware: GPU

src/tamm/kernels/multiply.hpp

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,18 @@ void copy_data_to_gpu(ExecutionHW hw, gpuStream_t& thandle, const T2* ainter_buf
3939
#endif
4040
}
4141

42+
template<typename T>
43+
void copy_data_to_gpu(ExecutionHW hw, gpuStream_t& thandle, const T* ainter_buf, size_t asize,
44+
T* ainter_buf_dev) {
45+
if(hw == ExecutionHW::CPU) return;
46+
47+
auto& oprof = tamm::OpProfiler::instance();
48+
TimerGuard tg_copy{&oprof.multOpCopyTime};
49+
#if defined(USE_CUDA) || defined(USE_HIP) || defined(USE_DPCPP)
50+
gpuMemcpyAsync<T>(ainter_buf_dev, ainter_buf, asize, gpuMemcpyHostToDevice, thandle);
51+
#endif
52+
}
53+
4254
template<typename T, typename T1, typename T2, typename T3>
4355
void gemm_wrapper(ExecutionHW hw, gpuStream_t& thandle, int AR, int BR, int B, int M, int N, int K,
4456
T alpha, T beta, const T2* ainter_buf, const T2* ainter_buf_dev,
@@ -209,6 +221,42 @@ void transpose_output(ExecutionHW hw, gpuStream_t& thandle, bool gpu_trans, T1*
209221
assign<T1>(cbuf, cdims, clabels, T1{1}, cinter_buf, cinter_dims, cinter_labels, is_assign);
210222
}
211223

224+
template<typename T>
225+
bool transpose_tensor(ExecutionHW hw, gpuStream_t& thandle, T* output_buf,
226+
const SizeVec& output_dims, const IntLabelVec& output_labels,
227+
const T* input_buf, size_t input_size, const SizeVec& input_dims,
228+
const IntLabelVec& input_labels, T*& output_buf_dev) {
229+
bool gpu_trans = false;
230+
231+
#if defined(USE_CUDA) || defined(USE_HIP) || defined(USE_DPCPP)
232+
if(hw == ExecutionHW::GPU) {
233+
gpu_trans = true;
234+
235+
// Allocate temporary device buffer for input
236+
T* input_buf_dev{nullptr};
237+
allocate_device_buffers(hw, input_buf_dev, input_size);
238+
239+
// Copy input data to GPU
240+
copy_data_to_gpu(hw, thandle, input_buf, input_size, input_buf_dev);
241+
242+
// Perform GPU transpose
243+
assign_gpu<T>(thandle, output_buf_dev, output_dims, output_labels, T{1}, input_buf_dev,
244+
input_dims, input_labels, true);
245+
246+
// Clean up temporary input buffer
247+
free_device_buffers(hw, input_buf_dev, input_size);
248+
249+
return gpu_trans;
250+
}
251+
#endif
252+
253+
// CPU transpose
254+
assign<T>(output_buf, output_dims, output_labels, T{1}, input_buf, input_dims, input_labels,
255+
true);
256+
257+
return gpu_trans;
258+
}
259+
212260
template<typename T, typename T1, typename T2, typename T3>
213261
void block_multiply(
214262
#if defined(USE_CUDA) || defined(USE_HIP) || defined(USE_DPCPP)

src/tamm/multop.hpp

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -490,6 +490,31 @@ class MultOp: public Op {
490490
std::cout << "\t Reduction A size = " << AR << std::endl;
491491
std::cout << "\t Reduction B size = " << BR << std::endl;
492492

493+
std::cout << "\t LHS_block sizes = ";
494+
for(auto& d: lhs_dims) { std::cout << d << " "; }
495+
std::cout << std::endl;
496+
std::cout << "\t RHS1_block sizes = ";
497+
for(auto& d: rhs1_dims) { std::cout << d << " "; }
498+
std::cout << std::endl;
499+
std::cout << "\t RHS2_block sizes = ";
500+
for(auto& d: rhs2_dims) { std::cout << d << " "; }
501+
std::cout << std::endl;
502+
std::cout << "\t Transpose order LHS = C( ";
503+
for(auto& d: lhs) { std::cout << d << " "; }
504+
std::cout << ") <- C'( ";
505+
for(auto& d: lhs_transpose_order) { std::cout << d << " "; }
506+
std::cout << ")" << std::endl;
507+
std::cout << "\t Transpose order RHS1 = A( ";
508+
for(auto& d: rhs1) { std::cout << d << " "; }
509+
std::cout << ") -> A'( ";
510+
for(auto& d: rhs1_transpose_order) { std::cout << d << " "; }
511+
std::cout << ")" << std::endl;
512+
std::cout << "\t Transpose order RHS2 = B( ";
513+
for(auto& d: rhs2) { std::cout << d << " "; }
514+
std::cout << ") -> B'( ";
515+
for(auto& d: rhs2_transpose_order) { std::cout << d << " "; }
516+
std::cout << ")" << std::endl;
517+
493518
task_id++;
494519
}
495520
}

0 commit comments

Comments
 (0)