mm-test-library

A small C++17 study of dense matrix multiplication (GEMM) in row-major layout. It implements all six loop orderings (ijp, ipj, jip, jpi, pij, pji) using a simple layered structure so the connection between GEMM, BLAS-2, and BLAS-1 is explicit.

All kernels compute the accumulating product:

C(MxN) += A(MxK) * B(KxN)

For tightly packed row-major matrices:

ldA = K, ldB = N, ldC = N

The kernels intentionally use +=, so callers should zero C first when they want plain C = A * B.

Layering

The implementation is organized as:

GEMM  (6 loop orders)          src/mm/gemm.cpp
  -> BLAS-2 (gemv / ger)       src/mm/blas2.cpp
       -> BLAS-1 (dot / axpy)  src/mm/blas1.cpp

Each loop order maps to one BLAS-2 view:

Variant	Outer loop	BLAS-2 body	BLAS-1 primitive	Row-major expectation
`ijp`	`i`	`gemv_row_dot`	`dot`	middle
`ipj`	`i`	`gemv_row_axpy`	`axpy`	fast
`jip`	`j`	`gemv_col_dot`	`dot`	middle
`jpi`	`j`	`gemv_col_axpy`	`axpy`	slow
`pij`	`p`	`ger_row` rank-1 update	`axpy`	fast
`pji`	`p`	`ger_col` rank-1 update	`axpy`	slow

For row-major storage, ipj and pij are expected to be fastest because their inner axpy walks rows of B and C with unit stride. jpi and pji are expected to be slowest because they walk columns of row-major matrices.

Layout

include/mm/        public headers
src/mm/            library implementation
test/              GoogleTest correctness tests
bench/             Google Benchmark harness and benchmark outputs
examples/mm/       one runnable sample per loop ordering
scripts/           plotting scripts
cmake/             dependency setup via FetchContent
third_party/       fetched GoogleTest / Google Benchmark sources, git-ignored

Dependencies

The project uses CMake FetchContent for:

GoogleTest v1.17.0
Google Benchmark v1.9.5

The first configure needs network access. Downloaded sources are placed under third_party/ and are ignored by git.

For plotting, install matplotlib:

python3 -m pip install matplotlib

or:

sudo apt install python3-matplotlib

Build

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DMM_LOG_LEVEL=0
cmake --build build -j

To see exact compiler commands:

cmake --build build --verbose

Tests

ctest --test-dir build --output-on-failure

Or run GoogleTest directly:

./build/test_mm_correctness
./build/test_mm_correctness --gtest_list_tests
./build/test_mm_correctness --gtest_filter='*M64_N64_K64*'

The correctness suite checks all six variants against mm_ref, plus exact small examples, identity, zero matrix, accumulation semantics, padded leading dimensions, and non-positive dimensions.

Examples

Each sample performs the same hand-verifiable multiplication:

A = [  1  -2   3 ]      B = [ 2   1 ]
    [ -1   0   2 ]          [ 0  -1 ]
                             [ 1   3 ]

C = A * B = [ 5  12 ]
            [ 0   5 ]

Run any loop ordering:

./build/sample_mm_ijp
./build/sample_mm_ipj
./build/sample_mm_jip
./build/sample_mm_jpi
./build/sample_mm_pij
./build/sample_mm_pji

Benchmark

The benchmark target compares the six variants over square sizes:

N = 64, 128, 256, 512, 1024
M = N, K = N

Run:

./build/mm_bench

For cleaner single-thread measurements, pin the benchmark to one logical CPU:

taskset -c 7 ./build/mm_bench

For more stable output and JSON export:

taskset -c 7 ./build/mm_bench \
  --benchmark_repetitions=5 \
  --benchmark_report_aggregates_only=true \
  --benchmark_format=json \
  --benchmark_out=bench/results.json

Google Benchmark reports average time per iteration and the custom GFLOP/s counter:

GFLOP/s = (2 * M * N * K) / seconds_per_iteration

The benchmark currently resets C with std::fill inside each timed iteration because the kernels accumulate into C.

Plotting

After creating bench/results.json, generate a plot:

python3 scripts/plot_bench_loop_ordering.py bench/results.json

This writes:

bench/loop_ordering_gflops.png

If the JSON was generated with repetitions and aggregate rows, plot the median:

python3 scripts/plot_bench_loop_ordering.py bench/results.json --aggregate median

Choose a custom output path:

python3 scripts/plot_bench_loop_ordering.py bench/results.json \
  -o bench/my_loop_ordering_plot.png

Generated benchmark data and plots (*.json, *.csv, *.png) are ignored by git by default.

Logging

The library has a compile-time log level. Logs go to stderr.

`-DMM_LOG_LEVEL`	Output
`0`	off, default
`1`	one line per GEMM call and a summary
`2`	plus each BLAS-2 call
`3`	plus each BLAS-1 `dot` / `axpy` call

Example:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DMM_LOG_LEVEL=1
cmake --build build --clean-first -j
./build/sample_mm_ipj 2>log.txt

A level-1 summary looks like:

[mm] mm_ipj: done M=2 N=2 K=3 flops=24 gemv_row_axpy=2 axpy=6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mm-test-library

Layering

Layout

Dependencies

Build

Tests

Examples

Benchmark

Plotting

Logging

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bench		bench
cmake		cmake
examples/mm		examples/mm
include/mm		include/mm
scripts		scripts
src/mm		src/mm
test		test
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

mm-test-library

Layering

Layout

Dependencies

Build

Tests

Examples

Benchmark

Plotting

Logging

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages