Skip to content
Tech News
← Back to articles

Show HN: NanoEuler – GPT-2 scale model in pure C/CUDA from scratch

read original more articles
Why This Matters

NanoEuler demonstrates the feasibility of building a GPT-2 scale language model entirely from scratch in C/CUDA, emphasizing transparency and educational value. This project highlights the potential for low-level, self-engineered AI training pipelines, fostering deeper understanding and innovation in the tech industry. While not aimed at production, it serves as a valuable resource for researchers and enthusiasts interested in the fundamentals of neural network training and optimization.

Key Takeaways

nanoeuler

A GPT-2-class language model built entirely from scratch in C/CUDA — no PyTorch, no autograd, no ML libraries. The forward and backward passes are written and verified by hand, and the whole training pipeline lives in this repo: a hand-written byte-level BPE tokenizer, pretraining on a books + web corpus, and supervised fine-tuning into a chat model (RLHF/DPO planned). It runs on CPU ( libm + OpenMP) for a small showcase model, and a full from-scratch CUDA engine — cuBLAS matmuls, a hand-written FlashAttention, validated against a CPU reference by a full-model gradient check — trains a ~116M-parameter model on a single RTX 4070.

Status & honesty. This is a research/educational artifact, built in public. At ~116M parameters trained on a single consumer GPU, it is a text generator in the spirit of GPT-2-small: fluent-ish English, no real world knowledge. It is not a capable assistant — the chat model demonstrates that the pretrain→SFT pipeline works end to end, it is not a useful chatbot. The point of the project is the from-scratch engineering and the complete, understandable training pipeline.

make check # verify the backward pass (gradient check, double precision) make # build the training binary ./nanoeuler train # train the small showcase model (~0.76M params) ./nanoeuler train big # train the larger model (~10M params; meant for a GPU) ./nanoeuler chat # REPL: type a prompt, the model continues it

Why "Euler"?

A residual block computes

x = x + f(x)

Read it as a step of numerical integration. The forward-Euler method advances an ordinary differential equation dx/dt = f(x) by

x(t+Δt) = x(t) + Δt · f(x(t))

With step size Δt = 1 this is exactly the residual update. So a deep residual network is a discretized ODE: depth is integration time, and each layer integrates the hidden state forward by one Euler step. This is the view behind work like Neural ODEs (a ResNet is the Euler discretization of a continuous flow). The project is named after Leonhard Euler, who gave us that integration method.

... continue reading