Show HN: I Parallelized RNN Training from O(T) to O(log T) Using CUDA

Resource Link Project Repository GitHub

For my final project in CS179: GPU Programming, I decided to implement the paper “Were RNNs All We Needed?” by Feng et al. The paper’s core claim is that by making minor simplifications to LSTMs and GRUs, their recurrence can be expressed in a form amenable to the parallel scan algorithm. This changes their training and inference from an $O(T)$ sequential process into an $O(\log T)$ parallel one, which helps with GPU acceleration.

My goal was to verify this claim by building both the simplified models (minGRU and minLSTM) and a custom CUDA implementation of the parallel scan to see how much of a speedup was actually achievable. The focus was less on the machine learning application and more on the raw computational performance and the experience of parallelizing a traditionally sequential algorithm.

The Sequential Bottleneck in Standard RNNs

Recurrent Neural Networks, by their very nature, process sequences one step at a time. The hidden state at time step $t$, denoted $h_t$, is a function of the input $x_t$ and the previous hidden state, $h_{t-1}$. This dependency is the fundamental barrier to parallelization.

Let’s look at a standard GRU. The update equations are: $r_t = \sigma(W_{ir}x_t + b_{ir} + W_{hr}h_{t-1} + b_{hr})$ $z_t = \sigma(W_{iz}x_t + b_{iz} + W_{hz}h_{t-1} + b_{hz})$ $n_t = \tanh(W_{in}x_t + b_{in} + r_t \odot (W_{hn}h_{t-1} + b_{hn}))$ $h_t = (1 - z_t) \odot n_t + z_t \odot h_{t-1}$

The reset gate ($r_t$) and update gate ($z_t$) both explicitly depend on $h_{t-1}$. You simply cannot compute the gates for the entire sequence in one shot, because each step requires the output from the one before it. This forces a sequential loop, which is notoriously inefficient on parallel hardware like GPUs.

GRU and LSTM –> minGRU and minLSTM

The crux of the paper is to remove this direct dependency. The simplified models, minGRU and minLSTM, redefine the gates to depend only on the current input, $x_t$.

For minGRU, the gates are simplified to:

... continue reading