๐ฆ Rust LLM from Scratch
RustGPT-demo-zoon.mp4
A complete Large Language Model implementation in pure Rust with no external ML frameworks. Built from the ground up using only ndarray for matrix operations.
๐ What This Is
This project demonstrates how to build a transformer-based language model from scratch in Rust, including:
Pre-training on factual text completion
on factual text completion Instruction tuning for conversational AI
for conversational AI Interactive chat mode for testing
for testing Full backpropagation with gradient clipping
with gradient clipping Modular architecture with clean separation of concerns
๐ Key Files to Explore
Start with these two core files to understand the implementation:
src/main.rs - Training pipeline, data preparation, and interactive mode
- Training pipeline, data preparation, and interactive mode src/llm.rs - Core LLM implementation with forward/backward passes and training logic
๐๏ธ Architecture
The model uses a transformer-based architecture with the following components:
Input Text โ Tokenization โ Embeddings โ Transformer Blocks โ Output Projection โ Predictions
Project Structure
src/ โโโ main.rs # ๐ฏ Training pipeline and interactive mode โโโ llm.rs # ๐ง Core LLM implementation and training logic โโโ lib.rs # ๐ Library exports and constants โโโ transformer.rs # ๐ Transformer block (attention + feed-forward) โโโ self_attention.rs # ๐ Multi-head self-attention mechanism โโโ feed_forward.rs # โก Position-wise feed-forward networks โโโ embeddings.rs # ๐ Token embedding layer โโโ output_projection.rs # ๐ฐ Final linear layer for vocabulary predictions โโโ vocab.rs # ๐ Vocabulary management and tokenization โโโ layer_norm.rs # ๐งฎ Layer normalization โโโ adam.rs # ๐ Adam optimizer implementation tests/ โโโ llm_test.rs # Tests for core LLM functionality โโโ transformer_test.rs # Tests for transformer blocks โโโ self_attention_test.rs # Tests for attention mechanisms โโโ feed_forward_test.rs # Tests for feed-forward layers โโโ embeddings_test.rs # Tests for embedding layers โโโ vocab_test.rs # Tests for vocabulary handling โโโ adam_test.rs # Tests for optimizer โโโ output_projection_test.rs # Tests for output layer
๐งช What The Model Learns
The implementation includes two training phases:
Pre-training: Learns basic world knowledge from factual statements "The sun rises in the east and sets in the west"
"Water flows downhill due to gravity"
"Mountains are tall and rocky formations" Instruction Tuning: Learns conversational patterns "User: How do mountains form? Assistant: Mountains are formed through tectonic forces..."
Handles greetings, explanations, and follow-up questions
๐ Quick Start
# Clone and run git clone < your-repo > cd llm cargo run # The model will: # 1. Build vocabulary from training data # 2. Pre-train on factual statements (100 epochs) # 3. Instruction-tune on conversational data (100 epochs) # 4. Enter interactive mode for testing
๐ฎ Interactive Mode
After training, test the model interactively:
Enter prompt: How do mountains form? Model output: Mountains are formed through tectonic forces or volcanism over long geological time periods Enter prompt: What causes rain? Model output: Rain is caused by water vapor in clouds condensing into droplets that become too heavy to remain airborne
๐งฎ Technical Implementation
Model Configuration
Vocabulary Size : Dynamic (built from training data)
: Dynamic (built from training data) Embedding Dimension : 128
: 128 Hidden Dimension : 256
: 256 Max Sequence Length : 80 tokens
: 80 tokens Architecture: 3 Transformer blocks + embeddings + output projection
Training Details
Optimizer : Adam with gradient clipping
: Adam with gradient clipping Pre-training LR : 0.0005 (100 epochs)
: 0.0005 (100 epochs) Instruction Tuning LR : 0.0001 (100 epochs)
: 0.0001 (100 epochs) Loss Function : Cross-entropy loss
: Cross-entropy loss Gradient Clipping: L2 norm capped at 5.0
Key Features
Custom tokenization with punctuation handling
with punctuation handling Greedy decoding for text generation
for text generation Gradient clipping for training stability
for training stability Modular layer system with clean interfaces
with clean interfaces Comprehensive test coverage for all components
๐ง Development
# Run all tests cargo test # Test specific components cargo test --test llm_test cargo test --test transformer_test cargo test --test self_attention_test # Build optimized version cargo build --release # Run with verbose output cargo test -- --nocapture
๐ง Learning Resources
This implementation demonstrates key ML concepts:
Transformer architecture (attention, feed-forward, layer norm)
(attention, feed-forward, layer norm) Backpropagation through neural networks
through neural networks Language model training (pre-training + fine-tuning)
(pre-training + fine-tuning) Tokenization and vocabulary management
and vocabulary management Gradient-based optimization with Adam
Perfect for understanding how modern LLMs work under the hood!
๐ Dependencies
ndarray - N-dimensional arrays for matrix operations
- N-dimensional arrays for matrix operations rand + rand_distr - Random number generation for initialization
No PyTorch, TensorFlow, or Candle - just pure Rust and linear algebra!
๐ค Contributing
Contributions are welcome! This project is perfect for learning and experimentation.
High Priority Features Needed
๐ช Model Persistence - Save/load trained parameters to disk (currently all in-memory)
- Save/load trained parameters to disk (currently all in-memory) โก Performance optimizations - SIMD, parallel training, memory efficiency
- SIMD, parallel training, memory efficiency ๐ฏ Better sampling - Beam search, top-k/top-p, temperature scaling
- Beam search, top-k/top-p, temperature scaling ๐ Evaluation metrics - Perplexity, benchmarks, training visualizations
Areas for Improvement
Advanced architectures (multi-head attention, positional encoding, RoPE)
(multi-head attention, positional encoding, RoPE) Training improvements (different optimizers, learning rate schedules, regularization)
(different optimizers, learning rate schedules, regularization) Data handling (larger datasets, tokenizer improvements, streaming)
(larger datasets, tokenizer improvements, streaming) Model analysis (attention visualization, gradient analysis, interpretability)
Getting Started
Fork the repository Create a feature branch: git checkout -b feature/model-persistence Make your changes and add tests Run the test suite: cargo test Submit a pull request with a clear description
Code Style
Follow standard Rust conventions ( cargo fmt )
) Add comprehensive tests for new features
Update documentation and README as needed
Keep the "from scratch" philosophy - avoid heavy ML dependencies
Ideas for Contributions
๐ Beginner : Model save/load, more training data, config files
: Model save/load, more training data, config files ๐ฅ Intermediate : Beam search, positional encodings, training checkpoints
: Beam search, positional encodings, training checkpoints โก Advanced: Multi-head attention, layer parallelization, custom optimizations
Questions? Open an issue or start a discussion!
No PyTorch, TensorFlow, or Candle - just pure Rust and linear algebra!