The first step towards higher performance is employing blocking to optimize cache access patterns. By using a straightforward square partitioning of the input matrices (without resorting to specialized assembly kernels and instead relying on the native BQN idiom) speed-ups of approximately sixfold are achievable for matrices that exceed the machine's cache size:
mat‿mbt ← ⟨⋈˜2⥊500, ⋈˜5⥊600⟩ /¨⊸⊔¨ ma‿mb ← •rand.Range⟜0¨1e3×⟨1‿1, 3‿3⟩ >⟨ma‿ma‿mat, mb‿mb‿mbt⟩ {𝕎˜•_timed𝕩}¨¨˜ <⟨Dgemm, +˝∘×⎉1‿∞, ∾(+˝+˝∘×⎉1‿∞¨)⎉1‿∞⟩
┌─ ╵ 0.008988871 0.646108393 0.37081367400000004 0.16528436400000002 45.110128999000004 7.460860705000001 ┘
This performance gain requires only a modest 10-character leap in the code, from +˝∘×⎉1‿∞ to ∾(+˝+˝∘×⎉1‿∞¨)⎉1‿∞ . Let's abstract this logic into reusable code. For instance, the function below computes powers of a square matrix 𝕩 using blocks of size 𝕨 , padding with zeros as needed. This operation is particularly useful in domains like graph theory or analyzing Markov chains:
MPB ← {𝕩≢⊸↑∾(+˝+˝∘×⎉1‿∞¨)⎉1‿∞˜𝕩(⥊⟜𝕨¨∘⊢/¨⊸⊔𝕨⊸×↑⊣)⌈𝕨÷˜≢𝕩}
An empirical (naïve, really) search for the optimal block size yields:
(300+50×↕8) {𝕨⊸MPB•_timed𝕩}¨ <3e3‿3e3 •rand.Range 0
⟨ 8.30279774 10.112563361000001 9.781014477000001 9.670085717000001 7.556631647000001 10.970897867000001 7.570657628 10.231164773000001 ⟩
One might hypothesize that further performance could be gained by applying this blocking principle recursively to accommodate multiple levels of cache. This technique, known as nested tiling, can also be implemented easily, though experimentation shows it yields no improvement:
MPB2 ← {∾∾×_p¨_p¨(_p←{+˝∘𝔽⎉1‿∞})˜𝕩{𝕩⊔˜/¨⥊⟜𝕨¨⌈𝕨÷˜≢𝕩}´𝕨} ⟨10‿60, 4‿250, 3‿500⟩ {𝕨⊸MPB2•_timed𝕩}¨ <3e3‿3e3•rand.Range 0
⟨ 14.096323785000001 9.16644102 7.668334754000001 ⟩
Having seemingly reached the limits of performance gains by optimizing memory access patterns, the next logical step is to attack the problem from a different axis: reducing the algorithm's asymptotic complexity. Here is a little divide-and-conquer (and cache-oblivious) algorithm in its classic radix-2 form. It works for any square matrix, regardless of dimension: if it is odd, we pad with an extra row and column, and then take back the original.
_strassen_ ← {𝕘≥≠𝕩 ? 𝕨𝔽𝕩; [a‿b,c‿d]‿[e‿f,g‿h] ← (2⊸⥊¨∘⊢/¨⊸⊔2⊸×↑⊣)¨⟜(⌈2÷˜≢¨)𝕨‿𝕩 p1‿p2‿p3‿p4‿p5‿p6‿p7 ← 𝕊´¨⟨a+d,e+h⟩‿⟨c+d,e⟩‿⟨a,f-h⟩‿⟨d,g-e⟩‿⟨a+b,h⟩‿⟨c-a,e+f⟩‿⟨b-d,g+h⟩ 𝕩≢⊸↑∾⟨p1+p4+p7-p5, p3+p5⟩≍⟨p2+p4, p1+p3+p6-p2⟩ }
Let's go somewhat big for a solid 9x speed-up over the naive implementation:
⟨+˝∘×⎉1‿∞, 600⊸MPB, +˝∘×⎉1‿∞ _strassen_ 256, Dgemm _strassen_ 256, Dgemm⟩ {𝕎˜•_timed𝕩}¨ <4096‿4096•rand.Range 0
⟨ 121.21441014300001 23.299975492 13.688074838 2.1399266160000003 0.400549596 ⟩
To the best of my ability, this marks the limit of what can be achieved with a pure, single-threaded BQN implementation.