cade.site -- Matrix Multiplication (#1) - Blocked Algorithms

Home | About | Contact | Archive

Matrix Multiplication (#1) - Blocked Algorithms

Block matrices are often used in linear algebra to simply formulae or to efficiently break a problem down into components (like, for example, recursive subdivision in FFTs). In this post, we implement blocked (sometimes called tiled) matrix multiplication, which will

If you’ve never heard of block matrices as a representation, you should read the above Wikipedia article and try it out on your own – you’ll notice things, like for example that if:

\(A = \begin{bmatrix} A_{00} & A_{01} \\ A_{10} & A_{11} \end{bmatrix}\) \(B = \begin{bmatrix} B_{00} & B_{01} \\ B_{10} & B_{11} \end{bmatrix}\)

Then the following is still true, whether the elements in each matrix are scalars or block matrices:

\[C = AB = \begin{bmatrix} A_{00} B_{00} + A_{01} B_{10} & A_{00} B_{01} + A_{01} B_{11} \\ A_{10} B_{00} + A_{11} B_{10} & A_{10} B_{01} + A_{11} B_{11} \end{bmatrix}\]

(where multiplication is either scalar multiplication or full matrix multiplication of the sub blocks).

Blocked GEMM

Here’s a wonderful resource explaining the algorithm we’ll be implementing: https://bt.nitk.ac.in/c/17b/co332/notes/6-Tiled-MM.pdf. (that link describes implementation on a GPU, which we may get to later in this series. But, CPU implementations may also have some benefits)

Essentially, the idea of this algorithm is to pre-load 2D tiles into local memory (in C, we’ll declare them as local variables, but in GPU programming, you’d declare them as shared between threads in a workgroup).


Cade Brown's personal blog, licensed under CC-BY 4.0.