Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Matrix multiplication plays an important role in a number of applications. Two m

ID: 3622095 • Letter: M

Question

Matrix multiplication plays an important role in a number of applications. Two matrices can only be multiplied if the number of columns of the first matrix is equal to the number of rows in the second. Let's assume we have an m x n matrix A and we want to multiply it by an n times P matrix B. We can express their product as an m times P matrix denoted by AB(or A B). If we assign C = A B, and C denotes the entry in Cat position (i.j), then for each element i and /j with l i m and 1 j p. Now we want to see if we can parallelize the computation of C. Assume that matrices are laid out in memory sequentially as follows: a1.1 a2.1 a3.1 a4 l,..., etc. Assume that we are going to compute C on both a single core shared memory machine and a 4-core shared memory machine. Compute the speedup we would expect to obtain on the 4-core machine, ignoring any memory issues. Repeat Exercise 7.6.1, assuming that updates to C incur a ache miss due to false sharing when consecutive elements are in a row (i.e., index i) are updated. How would you fix the false sharing issue that can occur?

Explanation / Answer

Dear, 7.6.1 This problem presents an “embarrassingly parallel” computation and asks the student to fi nd the speed-up obtained on a 4-core system. The computations involved are: (m × p × n) multiplications and (m × p × (n - 1)) additions. The multiplications and additions associated with a single element in C are dependent (we cannot start summing up the results of the multiplications for a element until two products are available). So in this question, the speed-up should be very close to 4. 7.6.2 This question asks about how speed-up is affected due to cache misses caused by the 4 cores all working on different matrix elements that map to the same cache line. Each update would incur the cost of a cache miss, and so will reduce the speed-up obtained by a factor of 3 times the cost of servicing a cache miss. 7.6.3 In this question, we are asked how to fi x this problem. The easiest way to solve the false sharing problem is to compute the elements in C by traversing the matrix across columns instead of rows (i.e., using index-j instead of index-i). These elements will be mapped to different cache lines. Then we just need to make sure we processor the matrix index that is computed (i, j) and (i + 1, j) on the same core. This will eliminate false sharing. Hope this will help you