Tech News
← Back to articles

Basic Facts about GPUs

read original related products more articles

Basic facts about GPUs

last updated: 2025-06-18

I’ve been trying to get a better sense of how GPUs work. I’ve read a lot online, but the following posts were particularly helpful:

This post collects various facts I learned from these resources.

Acknowledgements: Thanks to Alex McKinney for comments on independent thread scheduling.

Table of Contents

Compute and memory hierarchy

A GPU’s design creates an imbalance since it can compute much faster than it can access its main memory. An NVIDIA A100 GPU, for example, can perform 19.5 trillion 32-bit floating-point operations per second (TFLOPS), but its memory bandwidth is only about 1.5 terabytes per second (TB/s). In the time it takes to read a single 4-byte number, the GPU could have performed over 50 calculations.

Below is a diagram of the compute and memory hierarchy for an NVIDIA A100 GPU. The numbers I quote for flops/s and TB/s are exclusive to A100s.

+---------------------------------------------------------------------------------+ | Global Memory (VRAM) | | (~40 GB, ~1.5 TB/s on A100) | +----------------------------------------+----------------------------------------+ | (Slow off-chip bus) +----------------------------------------v----------------------------------------+ | Streaming Multiprocessor (SM) | | (1 of 108 SMs on an A100, each ~(19.5/108) TFLOPS) | | (2048 threads, 64 warps, 32 blocks) | | +-----------------------------------------------------------------------------+ | | | Shared Memory (SRAM) / L1 Cache | | | (~192 KB on-chip workbench, 19.5 TB/s) | | +-----------------------------------------------------------------------------+ | | | Register File (~256 KB, ? TB/s) | | +-----------------------------------------------------------------------------+ | | | | | | | //-- A "Block" of threads runs on one SM --// | | | | +--------------------------+ +------------------------+ | | | | | Warp 0 (32 thr) | | Warp 1 (32 thr) | ... (up to 32 warps)| | | | | +----------------------+ | +----------------------+ | | | | | | | Thread 0 Registers | | | Thread 32 Registers | | | | | | | | [reg0: float] | | | [reg0: float] | | | | | | | | [reg1: float] ... | | | [reg1: float] ... | | | | | | | +----------------------+ | +----------------------+ | | | | | +--------------------------+ +------------------------+ | | | | | | +---------------------------------------------------------------------------------+

... continue reading