How Neural Super Sampling Works: Architecture, Training, and Inference

This blog post is the second in our Neural Super Sampling (NSS) series. The post explores why we introduced NSS and explains its architecture, training, and inference components.

In August 2025, we announced Arm neural technology that will ship in Arm GPUs in 2026. The first use case of the technology is Neural Super Sampling (NSS). NSS is a next-generation, AI-powered upscaling solution. Developers can already start experimenting with NSS today, as discussed in the first post of this two-part series.

In this blog post, we take a closer look at how NSS works. We cover everything from training and network architecture to post-processing and inference. This deep dive is for ML engineers and mobile graphics developers. It explains how NSS works and how it can be deployed on mobile hardware.

Why we replaced heuristics with Neural Super Sampling

Temporal super sampling (TSS), also known as TAA, has become an industry standard solution for anti-aliasing over the last decade. TSS offers several benefits. It addresses all types of aliasing, is compute-efficient for deferred rendering, and extensible to upscaling. However, it is not without its challenges. Hand-tuned heuristics, commonly used in TSS approaches today, can be difficult to scale and require continual adjustment across varied content. Issues like ghosting, disocclusion artifacts, and temporal instability remain. These problems worsen when combined with upscaling.

NSS overcomes these limitations by using a trained neural model. Instead of relying on static rules, it learns from data. It generalizes across conditions and content types, adapting to motion dynamics and identifying aliasing patterns more effectively. These capabilities help NSS handle edge cases more reliably than approaches such as AMD’s FSR 2 and Arm ASR.

Training the NSS network: Recurrent learning with feedback

NSS is trained using sequences of 540p frames rendered at 1 sample per pixel. Each frame is paired with 1080p ground truth images rendered at 16spp. Sequences are about 100 frames to help the model understand how image content changes over time.

Inputs include rendered images, such as color, motion vectors, and depth, alongside engine metadata, such as jitter vectors, and camera matrices. The model is trained recurrently and runs forward across a sequence of multiple frames before performing each backpropagation. This approach lets the network propagate gradients through time and learn how to accumulate information.

The network is trained using a spatiotemporal loss function. It simultaneously penalizes errors in both spatial fidelity and temporal consistency. Spatial fidelity keeps each frame sharp, detailed, and visually accurate. It also preserves the edges, textures, and fine structures. Temporal stability discourages flickering, jittering, or other forms of temporal noise across consecutive frames.

... continue reading