Skip to content

Matrix multiply algorithm

Myasuka edited this page Sep 7, 2014 · 2 revisions

The algorithm we use in the two large distributed matrices multiplication refers to the HAMA 0.1, here you can find more about the algorithm.

Basically, the algorithm is to split a large matrix into smaller ones, and after multiply these submatrices, combine them together.

It's MapReduce version interpretation can be seen below:

We collect the blocks to 'collectionTable' firstly using map/reduce. Rows are named as c(i, j) with sequential number ((N^2 * i) + ((j * N) + k) to avoid duplicated records. CollectionTable:

                        matrix A         matrix B
------------------------+-------------------------------             
block(0, 0)-0               block(0, 0)      block(0, 0)
block(0, 0)-1               block(0, 1)      block(1, 0)
block(0, 0)-2               block(0, 2)      block(2, 0)
...         N               ...
block(N-1, n-1)-(N^3-1)     block(N-1, N-1)  block(N-1, N-1)

Each row has a two sub matrices of a(i, k) and b(k, j) so that minimized data movement and network cost. Blocking jobs:

Collect the blocks to 'collectionTable' from A and B.

- A map task receives a row n as a key, and vector of each row as its value
- emit (blockID, sub-vector) pairs
- Reduce task merges block structures based on the information of blockID

Multiplication job:

- A map task receives a blockID n as a key, and two sub-matrices of A and B as its value
- Multiply two sub-matrices: a[i][j] * b[j][k]
- Reduce task computes sum of blocks 
- c[i][k] += multiplied blocks

We transform these two steps into Spark.

Clone this wiki locally