Matrix Orthogonalization Improves Memory in Recurrent Models

← Back to all posts | Home

06-30-2026

This work was funded by Paradigm.

Transformers exhibit remarkable associative recall (AR) abilities: attention provides each token direct access to those preceding it, a mechanism that has been hard for other architectures, like recurrent neural networks (RNNs), to match.

But for some domains, we can't afford the quadratic-attention overhead of transformers. One example is long-horizon RL, in the style of Dreamer. For these kinds of applications, we need to make recurrent neural networks work, but don't want to give up on associative recall.

The best known RNN for associative recall is mLSTM, a variant of LSTM that maintains a matrix memory. mLSTMs demonstrate substantially improved recall over baselines on one benchmark, MQAR. But pure recall may not be sufficient to measure recurrent performance. In fields where environment transitions can be noisy, a useful proxy test is noisy associative recall (NAR).

Since MQAR doesn't measure NAR, we can look at MAD's noisy AR task suite. Here's an example of what a task looks like:

0 9 3 10 12 13 15 14 0 9 5 8 2 9

Here, key 0 maps to value 9 , key 3 maps to value 10 , etc. The MAD generator uses distinct token ranges for keys, values, and distractors. So if keys are 0-5 , then tokens 12-15 are distractors. A model good at NAR should predict 9 in the 10th position, having seen 0 -> 9 at the start, while ignoring the interleaved distractor tokens.

... continue reading