Show HN: I Parallelized RNN Training from O(T) to O(log T) Using CUDA
Resource Link Project Repository GitHub For my final project in CS179: GPU Programming, I decided to implement the paper “Were RNNs All We Needed?” by Feng et al. The paper’s core claim is that by making minor simplifications to LSTMs and GRUs, their recurrence can be expressed in a form amenable to the parallel scan algorithm. This changes their training and inference from an $O(T)$ sequential process into an $O(\log T)$ parallel one, which helps with GPU acceleration. My goal was to verify t