# cade.site -- Matrix Multiplication (#0) - Introduction

# Matrix Multiplication (#0) - Introduction

This is the first post covering matrix multiplication, and how to implement it.

### Usefulness

First of all, we should start by saying that tons of areas of science and math come down to a multiplication. As some of the people at ICL say, everything is a GEMM. The term “GEMM” means GEneral Matrix Multiplicatoin. It comes from BLAS. BLAS and LAPACK routines are sometimes hard to remember, but we will focus on GEMM for this series. Specifically, there are:

- SGEMM:
`float`

matrix multiplication - DGEMM:
`double`

matrix multiplication - CGEMM:
`complex float`

matrix multiplication - ZGEMM:
`complex double`

matrix multiplication

These refer to different precisions of elements within a matrix. For all our tests, we will be using “DGEMM”, which is for double precision arguments.

Lots of algorithms use GEMM to actually perform important computations. It is used in chemistry, physics, climate science, machine learning, and many more areas. But, we won’t focus on the applications, just the algorithm(s) themselves.

### Definition

Let’s define what the matrix product is, from a mechanical point of view. A matrix product is an operation between 2 matrices. The first, let’s call $A$, has a size of $M$ rows by $N$ columns (denoted `MxN`

, or `[M, N]`

). The second matrix, called $B$ must have the same number of rows as $A$ has columns. Therefore, it must be of size $N$ rows by $K$ columns (`NxK`

or `[N, K]`

). The result of the operation is another matrix, $C$, that has the number of rows of $A$ by the number of columns of $B$, so it has size `MxK`

or `[M, K]`

.

The actual contents of $C$ are: $C_{ij} = \sum_{k=0}^{N}{A_{ik} B_{kj}}$, which is to say a dot product of the corresponding row of $A$ with the corresponding column of $B$

In the figure above (it is 1-indexed, we will be using 0-indexing), the matrix product $C=AB$ is presented graphically. The resulting yellow element in $C$ is the dot product between a row of $A$ and the column of $B$

We may therefore define matrix multiplication in psuedo code as:

```
function matmul(Matrix A, Matrix B) {
Matrix C = new Matrix[A.rows, B.cols]
for i in [0, A.rows) {
for j in [0, B.cols) {
C[i, j] = 0
for k in [0, A.cols) {
C[i, j] += A[i, k] * B[k, j]
}
}
}
return C
}
```

All of our implementations will be in `C`

, and we’ll use OpenBLAS for reference. To install it on Ubuntu, you can run `sudo apt install libopenblas-dev`

We’ll only implement row-major based algorithms. This means, in memory, a matrix like:

\[\begin{bmatrix}a & b\\c & d\end{bmatrix}\]Will be stored as a 1 dimensional array like:

\[\begin{bmatrix}a & b & c & d\end{bmatrix}\]Therefore, to get element `M[i, j]`

, we calculate an index $i * cols + j$, and use that as the linear index into the 1 dimensional array.

We will generalize this and use `lda`

(leading dimension-of $A$), or in other words a “stride” between rows, such that each column is seperated by 1 element, and each row is seperated by `lda`

elements. So, the index of `M[i, j]`

will be $i * lda + j$. Matrices of the same size can have different strides, and this will be useful when we move into blocking algorithms.

### Implementation 0: Naive Kernel

```
/* mm0.c - */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
/* OpenBLAS */
#include <cblas.h>
/* Naive kernel, O(N^3) */
void my_dgemm(int M, int N, int K, double* A, int lda, double* B, int ldb, double* C, int ldc) {
int i, j, k;
for (i = 0; i < M; i++) {
for (j = 0; j < K; j++) {
int ij = i * ldc + j;
C[ij] = 0.0;
for (k = 0; k < N; ++k) {
C[ij] += A[i * lda + k] * B[k * ldb + j];
}
}
}
}
int main(int argc, char** argv) {
int i, N = argc >= 2 ? atoi(argv[1]) : 256, docheck = argc >= 3 ? atoi(argv[2]) : 1;
double *A = malloc(sizeof(double) * (N * N));
double *B = malloc(sizeof(double) * (N * N));
double *Cmine = malloc(sizeof(double) * (N * N));
double *Cref = malloc(sizeof(double) * (N * N));
/* initialize elements to random numbers between 0 and 1 */
srand(time(NULL));
for (i = 0; i < N * N; ++i) {
A[i] = rand() / (double)RAND_MAX;
B[i] = rand() / (double)RAND_MAX;
}
/* Use our implementation */
my_dgemm(N, N, N, A, N, B, N, Cmine, N);
if (docheck) {
/* Use reference (OpenBLAS) implementation */
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, N, N, N, 1.0, A, N, B, N, 0.0, Cref, N);
/* Compute error, |Mine_ij - Real_ij|^2 */
double sum_e2 = 0.0;
for (i = 0; i < N * N; ++i) {
double d = Cref[i] - Cmine[i];
sum_e2 += d * d;
}
printf("avgerr: %.2e\n", sum_e2 / (N * N));
}
}
```

Above is the first, naive implementation I will provide. It is not very fast, nor is it very elegant, but it does compute the correct result. It also includes a `main`

which uses OpenBLAS to calculate the true value for us to check our error with. I’ll be using the same `main`

function for all implementations in this post, so I’ll just include the kernels going forth. You can call the resulting binary with `./program [size] [check]`

, if `check==0`

, then it won’t run the OpenBLAS function and instead only time our function.

Let’s take a look at the kernel:

```
/* Naive kernel, O(N^3) */
void my_dgemm(int M, int N, int K, double* A, int lda, double* B, int ldb, double* C, int ldc) {
int i, j, k;
for (i = 0; i < M; i++) {
for (j = 0; j < K; j++) {
int ij = i * ldc + j;
C[ij] = 0.0;
for (k = 0; k < N; ++k) {
C[ij] += A[i * lda + k] * B[k * ldb + j];
}
}
}
}
```

This is a pretty straight forward implementation of the psuedo code given earlier. We just loop through the result elements of `C`

, and then calculate a dot product of the row of `A`

and column of `B`

.

Let’s compile and run this (I’ve put it in a file called `mm0.c`

):

```
$ cc -Ofast mm0.c -lopenblas -o mm0
$ time ./mm0 256
avgerr: 4.08e-29
./mm0 256 0.19s user 0.35s system 1376% cpu 0.039 total
$ time ./mm0 512
avgerr: 6.04e-27
./mm0 512 1.30s user 0.84s system 215% cpu 0.998 total
$ time ./mm0 512 0
./mm0 512 0 1.22s user 0.88s system 211% cpu 0.996 total
```

As you can see, the average error per entry is well below $10^{-20}$, which is definitely precise enough for a `double`

in C. However, this algorithm is slow; we are using $N^3$ multiplications and $N^3$ additions, so our FLOPS are $\frac{2 * 512^3}{0.996} \approx 269513510.040 \approx 270 Mflop/s$. This is very low, and as you can see by the time difference between the last two calls, OpenBLAS doesn’t add any significant amount of running time (so it is probably running 100x faster!).

The reason this implementation is so bad is because of a few different reasons:

- This code exhibits terrible cache locality; specifically the operation
`B[k * ldb + j]`

. Since`k`

changes every iteration, the memories accessed are`B+j`

, then`B+j+ldb`

, then`B+j+ldb*2`

, and so forth. Computers are optimized to read sequentially - We are using an $O(N^3)$ algorithm (although OpenBLAS is too). There do exist faster asymptotic algorithms (we’ll cover these in a future post), which can also take the running time down

#### Cache Locality

Cache locality will be extremely important for extracting performance in the rest of this series, so let’s explain it.

In (most) computers, memory (RAM) can be thought of as a big array of bytes (not counting segmented architectures, in which there are multiple arrays of bytes that are seperated). Accessing any random value in memory is a quite costly operation, so most CPUs implement caching. Typically, there are multiple layers of cache (`L1`

, `L2`

, etc), and each layer of cache is a different size, with the smaller ones tending to be faster and closer to the CPU/FPU. So, when you read a value from the main memory, it copies in all the bytes around it (normally in multiples of 4096 bytes) into one of these smaller caches. This way, if you access any values near it, it will only read from the cache instead of going on the costly journey out to the main memory.

In the diagram above, the `Core 1`

and `Core 2`

are CPU cores, and there are `L1`

, `L2`

, and `L3`

caches, as well as the main memory. When a memory address is requested, it checks each layer of cache, returning the value early if found.

The reason caching works is because it assumes that programs will use memory that is around other memory that was just used – normally this is true. Take a simple vector program where you iterate through a list of bytes (at address `x`

). The first byte you see is at `x`

, then `x+1`

, and then `x+2`

, and so on. So, once you request `x`

, the memory manager will load `x`

through `x+4096`

into a shared cache. So, whenever you request any of the next addresses in the list, it will return the value very fast.

However, in the above kernel, you do not loop through consecutive memory addresses, but rather through multiples of `ldb`

, which is 512 in our case. Therefore, instead of having to go out to main memory every `4096 / 8 == 512`

iterations like the code `A[i * lda + k]`

(`sizeof(double) == 8`

, so there are 8 bytes per element), it has to re-load a page every every `4096 / (8 * 512) == 8`

iterations. This is very bad indeed, and will hamper the performance as seen.

### Implementation 1: Loop Reorganization

To somewhat alleviate this problem, we’ll use a simple solution: re-organize the loops. Let’s take a look at the triply-nested for loop used in our first implementation:

```
for (i = 0; i < M; i++) {
for (j = 0; j < K; j++) {
for (k = 0; k < N; ++k) {
C[ij] += A[i * lda + k] * B[k * ldb + j];
}
}
}
```

Notice that the index operations `A[i * lda + k]`

and `B[k * ldb + j]`

are very different in their strides. Wouldn’t it be nice if both of those 2D index operations were more cache friendly? Remember that all we need to change is the sequence of operations so that offset `0`

, `1`

, and so forth are loaded relative to the first address to benefit from cache efficiency.

The index operation `A[i * lda + k]`

already is cache friendly, since `i`

varies in the outermost for loop and `k`

varies in the innermost, `k`

varies more quickly and therefore the expression `i * lda + k`

generates sequential indices between `i * lda`

and `i * (lda + 1)`

. On the other hand, `B[k * ldb + j]`

generates the indices `j`

, `j + ldb`

, `j + ldb * 2`

, and so forth. This is because `k`

is multiplied by `ldb`

, but it varies quicker than `j`

(since, again, `k`

is in the innermost loop).

We would like it so that both index expressions (which are of the form `M[x * y + z]`

) maintain the property that the loop that varies `z`

is inside of the loop that varies `x`

, so that way the values generated are consecutive and thus more cache friendly.

Thankfully, we can do exactly that by simply switching around the order of the for loop such that the code resembles:

```
for (i ... )
for (k ...)
for (j ...)
```

This way, `depth(iloop) < depth(kloop)`

and `depth(kloop) < depth(jloop)`

.

Here’s the new kernel, with a few adjustments (like initializing `C`

to zero since element accesses will no longer be in order for it)

```
/* Slightly better kernel, which reorganizes the loops */
void my_dgemm(int M, int N, int K, double* A, int lda, double* B, int ldb, double* C, int ldc) {
int i, j, k;
/* Initialize to zero (since now, accessing elements of 'C' will be out of order) */
for (i = 0; i < M * K; ++i) C[i] = 0.0;
for (i = 0; i < M; i++) {
for (k = 0; k < N; ++k) {
for (j = 0; j < K; j++) {
int ij = i * ldc + j;
C[ij] += A[i * lda + k] * B[k * ldb + j];
}
}
}
}
```

Here’s how it runs:

```
$ cc -Ofast mm1.c -lopenblas -o mm1
$ time ./mm1 256
avgerr: 4.00e-29
./mm1 256 0.07s user 0.15s system 1085% cpu 0.020 total
$ time ./mm1 512
avgerr: 6.06e-27
./mm1 512 0.35s user 0.91s system 1436% cpu 0.088 total
$ time ./mm1 512 0
./mm1 512 0 0.34s user 0.83s system 1482% cpu 0.079 total
```

Again, we are within the acceptable levels of error in our function, so the result is still correct.

Obviously, performance-wise, this is much better, but let’s graph it to see just how much better in comparison.

### Performance Review

Let’s actually graph the performance of these methods for sizes between 64 and 3072:

And here’s the graph of times under 3 seconds (more zoomed in, so it’s easier to read)

We can see that while the reference implementation is much faster, our `mm1`

implementation is much faster than `mm0`

, starting at `N=512`

. And, we quickly see a factor of 10x-20x speed up. So, we’ve successfully optimized it! But, we’re still a long way away

**matmul**post →

*
Cade Brown's personal blog, licensed under CC-BY 4.0.
*