Skip to content
Tech News
← Back to articles

Matrix Orthogonalization Improves Memory in Recurrent Models

read original more articles
Why This Matters

This article highlights how matrix orthogonalization techniques, inspired by the Muon optimizer, can significantly enhance the memory capabilities of recurrent neural networks. Improving associative recall in RNNs offers a promising alternative to transformers for long-horizon tasks, especially where computational efficiency is critical. This advancement could lead to more effective and resource-efficient models in applications like reinforcement learning and long-term sequence modeling.

Key Takeaways

← Back to all posts | Home

Matrix Orthogonalization Improves Memory in Recurrent Models

06-30-2026

This work was funded by Paradigm.

Transformers exhibit remarkable associative recall (AR) abilities: attention provides each token direct access to those preceding it, a mechanism that has been hard for other architectures, like recurrent neural networks (RNNs), to match.

But for some domains, we can't afford the quadratic-attention overhead of transformers. One example is long-horizon RL, in the style of Dreamer. For these kinds of applications, we need to make recurrent neural networks work, but don't want to give up on associative recall.

The best known RNN for associative recall is mLSTM, a variant of LSTM that maintains a matrix memory. mLSTMs demonstrate substantially improved recall over baselines on one benchmark, MQAR. But pure recall may not be sufficient to measure recurrent performance. In fields where environment transitions can be noisy, a useful proxy test is noisy associative recall (NAR).

Since MQAR doesn't measure NAR, we can look at MAD's noisy AR task suite. Here's an example of what a task looks like:

0 9 3 10 12 13 15 14 0 9 5 8 2 9

Here, key 0 maps to value 9 , key 3 maps to value 10 , etc. The MAD generator uses distinct token ranges for keys, values, and distractors. So if keys are 0-5 , then tokens 12-15 are distractors. A model good at NAR should predict 9 in the 10th position, having seen 0 -> 9 at the start, while ignoring the interleaved distractor tokens.

... continue reading