Blazing Matrix Products
The first step towards higher performance is employing blocking to optimize cache access patterns. By using a straightforward square partitioning of the input matrices (without resorting to specialized assembly kernels and instead relying on the native BQN idiom) speed-ups of approximately sixfold are achievable for matrices that exceed the machine's cache size: mat‿mbt ← ⟨⋈˜2⥊500, ⋈˜5⥊600⟩ /¨⊸⊔¨ ma‿mb ← •rand.Range⟜0¨1e3×⟨1‿1, 3‿3⟩ >⟨ma‿ma‿mat, mb‿mb‿mbt⟩ {𝕎˜•_timed𝕩}¨¨˜ <⟨Dgemm, +˝∘×⎉1‿∞, ∾(+