The first step towards higher performance is employing blocking to optimize cache access patterns. By using a straightforward square partitioning of the input matrices (without resorting to specialized assembly kernels and instead relying on the native BQN idiom) speed-ups of approximately sixfold are achievable for matrices that exceed the machine's cache size:
mat‿mbt ← ⟨⋈˜2⥊500, ⋈˜5⥊600⟩ /¨⊸⊔¨ ma‿mb ← •rand.Range⟜0¨1e3×⟨1‿1, 3‿3⟩ >⟨ma‿ma‿mat, mb‿mb‿mbt⟩ {𝕎˜•_timed𝕩}¨¨˜ <⟨Dgemm, +˝∘×⎉1‿∞, ∾(+˝+˝∘×⎉1‿∞¨)⎉1‿∞⟩
┌─ ╵ 0.008988871 0.646108393 0.37081367400000004 0.16528436400000002 45.110128999000004 7.460860705000001 ┘
This performance gain requires only a modest 10-character leap in the code, from +˝∘×⎉1‿∞ to ∾(+˝+˝∘×⎉1‿∞¨)⎉1‿∞ . Let's abstract this logic into reusable code. For instance, the function below computes powers of a square matrix 𝕩 using blocks of size 𝕨 , padding with zeros as needed. This operation is particularly useful in domains like graph theory or analyzing Markov chains:
MPB ← {𝕩≢⊸↑∾(+˝+˝∘×⎉1‿∞¨)⎉1‿∞˜𝕩(⥊⟜𝕨¨∘⊢/¨⊸⊔𝕨⊸×↑⊣)⌈𝕨÷˜≢𝕩}
An empirical (naïve, really) search for the optimal block size yields:
(300+50×↕8) {𝕨⊸MPB•_timed𝕩}¨ <3e3‿3e3 •rand.Range 0
⟨ 8.30279774 10.112563361000001 9.781014477000001 9.670085717000001 7.556631647000001 10.970897867000001 7.570657628 10.231164773000001 ⟩
One might hypothesize that further performance could be gained by applying this blocking principle recursively to accommodate multiple levels of cache. This technique, known as nested tiling, can also be implemented easily, though experimentation shows it yields no improvement:
MPB2 ← {∾∾×_p¨_p¨(_p←{+˝∘𝔽⎉1‿∞})˜𝕩{𝕩⊔˜/¨⥊⟜𝕨¨⌈𝕨÷˜≢𝕩}´𝕨} ⟨10‿60, 4‿250, 3‿500⟩ {𝕨⊸MPB2•_timed𝕩}¨ <3e3‿3e3•rand.Range 0
... continue reading