This is just a query to clarify if I am doing something wrong on compiling the examples.
➜ matmul git:(master) ✗ make tiled
../../polycc matmul.c --noparallel --second-level-tile -o matmul.tiled.c
[pluto] compute_deps (isl)
[pluto] Number of statements: 1
[pluto] Total number of loops: 3
[pluto] Number of deps: 3
[pluto] Maximum domain dimensionality: 3
[pluto] Number of parameters: 3
[pluto] Diamond tiling not possible/useful
[pluto] Affine transformations [<iter coeff's> <param> <const>]
T(S1): (i, j, k)
loop types (loop, loop, loop)
[Pluto] After tiling:
T(S1): (zT3/16, zT4/2, zT5/16, zT3, zT4, zT5, i, j, k)
loop types (loop, loop, loop, loop, loop, loop, loop, loop, loop)
[Pluto] After intra-tile optimize
T(S1): (zT3/16, zT4/2, zT5/16, zT3, zT4, zT5, i, k, j)
loop types (loop, loop, loop, loop, loop, loop, loop, loop, loop)
[pluto] using statement-wise -fs/-ls options: S1(4,9),
[Pluto] Output written to matmul.tiled.c
[pluto] Timing statistics
[pluto] SCoP extraction + dependence analysis time: 0.000710s
[pluto] Auto-transformation time: 0.002295s
[pluto] Tile size selection time: 0.000000s
[pluto] Total constraint solving time (LP/MIP/ILP) time: 0.000482s
[pluto] Code generation time: 0.023760s
[pluto] Other/Misc time: 0.075358s
[pluto] Total time: 0.102123s
[pluto] All times: 0.000710 0.002295 0.023760 0.075358
gcc -O3 -march=native -mtune=native -ffast-math -DTIME matmul.tiled.c -o tiled -lm
➜ matmul git:(master) ✗ ./tiled
3.056028s
5.62 GFLOPS
➜ matmul git:(master) ✗ gcc matmul.c -o matmul.gcc -ffast-math -lm -DTIME -O3 -march=native -mtune=native
➜ matmul git:(master) ✗ ./matmul.gcc
3.340085s
5.14 GFLOPS
and just to get the peak performance, below is the result of using openblas on the same machine
➜ matmul git:(master) ✗ ./openblas
0.080084s
214.52 GFLOPS
Please let me know in case any additional information is needed.
This is just a query to clarify if I am doing something wrong on compiling the examples.
So I was following this tutorial: https://github.com/bondhugula/llvm-project/blob/hop/mlir/docs/HighPerfCodeGen.md. In the section where we see comparison of openblas/mkl with gcc, clang and pluto, I was expecting to see a similar improvement of about 5x to 10x with the tiled schedule, but I do not see any performance improvements.
with pluto
with plain gcc O3 flag
and just to get the peak performance, below is the result of using openblas on the same machine
Please let me know in case any additional information is needed.