-
Notifications
You must be signed in to change notification settings - Fork 46
Matrix multiply algorithm
Myasuka edited this page Sep 7, 2014
·
2 revisions
The algorithm we use in the two large distributed matrices multiplication refers to the HAMA 0.1, here you can find more about the algorithm.
Basically, the algorithm is to split a large matrix into smaller ones, and after multiply these submatrices, combine them together.
It's MapReduce version interpretation can be seen below:
We collect the blocks to 'collectionTable' firstly using map/reduce. Rows are named as c(i, j) with sequential number ((N^2 * i) + ((j * N) + k) to avoid duplicated records. CollectionTable:
matrix A matrix B
------------------------+-------------------------------
block(0, 0)-0 block(0, 0) block(0, 0)
block(0, 0)-1 block(0, 1) block(1, 0)
block(0, 0)-2 block(0, 2) block(2, 0)
... N ...
block(N-1, n-1)-(N^3-1) block(N-1, N-1) block(N-1, N-1)
Each row has a two sub matrices of a(i, k) and b(k, j) so that minimized data movement and network cost. Blocking jobs:
Collect the blocks to 'collectionTable' from A and B.
- A map task receives a row n as a key, and vector of each row as its value
- emit (blockID, sub-vector) pairs
- Reduce task merges block structures based on the information of blockID
Multiplication job:
- A map task receives a blockID n as a key, and two sub-matrices of A and B as its value
- Multiply two sub-matrices: a[i][j] * b[j][k]
- Reduce task computes sum of blocks
- c[i][k] += multiplied blocks
We transform these two steps into Spark.