๐Ÿฆ€ Rust LLM from Scratch RustGPT-demo-zoon.mp4 A complete Large Language Model implementation in pure Rust with no external ML frameworks. Built from the ground up using only ndarray for matrix operations. ๐Ÿš€ What This Is This project demonstrates how to build a transformer-based language model from scratch in Rust, including: Pre-training on factual text completion on factual text completion Instruction tuning for conversational AI for conversational AI Interactive chat mode for testing for testing Full backpropagation with gradient clipping with gradient clipping Modular architecture with clean separation of concerns ๐Ÿ” Key Files to Explore Start with these two core files to understand the implementation: src/main.rs - Training pipeline, data preparation, and interactive mode - Training pipeline, data preparation, and interactive mode src/llm.rs - Core LLM implementation with forward/backward passes and training logic ๐Ÿ—๏ธ Architecture The model uses a transformer-based architecture with the following components: Input Text โ†’ Tokenization โ†’ Embeddings โ†’ Transformer Blocks โ†’ Output Projection โ†’ Predictions Project Structure src/ โ”œโ”€โ”€ main.rs # ๐ŸŽฏ Training pipeline and interactive mode โ”œโ”€โ”€ llm.rs # ๐Ÿง  Core LLM implementation and training logic โ”œโ”€โ”€ lib.rs # ๐Ÿ“š Library exports and constants โ”œโ”€โ”€ transformer.rs # ๐Ÿ”„ Transformer block (attention + feed-forward) โ”œโ”€โ”€ self_attention.rs # ๐Ÿ‘€ Multi-head self-attention mechanism โ”œโ”€โ”€ feed_forward.rs # โšก Position-wise feed-forward networks โ”œโ”€โ”€ embeddings.rs # ๐Ÿ“Š Token embedding layer โ”œโ”€โ”€ output_projection.rs # ๐ŸŽฐ Final linear layer for vocabulary predictions โ”œโ”€โ”€ vocab.rs # ๐Ÿ“ Vocabulary management and tokenization โ”œโ”€โ”€ layer_norm.rs # ๐Ÿงฎ Layer normalization โ””โ”€โ”€ adam.rs # ๐Ÿƒ Adam optimizer implementation tests/ โ”œโ”€โ”€ llm_test.rs # Tests for core LLM functionality โ”œโ”€โ”€ transformer_test.rs # Tests for transformer blocks โ”œโ”€โ”€ self_attention_test.rs # Tests for attention mechanisms โ”œโ”€โ”€ feed_forward_test.rs # Tests for feed-forward layers โ”œโ”€โ”€ embeddings_test.rs # Tests for embedding layers โ”œโ”€โ”€ vocab_test.rs # Tests for vocabulary handling โ”œโ”€โ”€ adam_test.rs # Tests for optimizer โ””โ”€โ”€ output_projection_test.rs # Tests for output layer ๐Ÿงช What The Model Learns The implementation includes two training phases: Pre-training: Learns basic world knowledge from factual statements "The sun rises in the east and sets in the west" "Water flows downhill due to gravity" "Mountains are tall and rocky formations" Instruction Tuning: Learns conversational patterns "User: How do mountains form? Assistant: Mountains are formed through tectonic forces..." Handles greetings, explanations, and follow-up questions ๐Ÿš€ Quick Start # Clone and run git clone < your-repo > cd llm cargo run # The model will: # 1. Build vocabulary from training data # 2. Pre-train on factual statements (100 epochs) # 3. Instruction-tune on conversational data (100 epochs) # 4. Enter interactive mode for testing ๐ŸŽฎ Interactive Mode After training, test the model interactively: Enter prompt: How do mountains form? Model output: Mountains are formed through tectonic forces or volcanism over long geological time periods Enter prompt: What causes rain? Model output: Rain is caused by water vapor in clouds condensing into droplets that become too heavy to remain airborne ๐Ÿงฎ Technical Implementation Model Configuration Vocabulary Size : Dynamic (built from training data) : Dynamic (built from training data) Embedding Dimension : 128 : 128 Hidden Dimension : 256 : 256 Max Sequence Length : 80 tokens : 80 tokens Architecture: 3 Transformer blocks + embeddings + output projection Training Details Optimizer : Adam with gradient clipping : Adam with gradient clipping Pre-training LR : 0.0005 (100 epochs) : 0.0005 (100 epochs) Instruction Tuning LR : 0.0001 (100 epochs) : 0.0001 (100 epochs) Loss Function : Cross-entropy loss : Cross-entropy loss Gradient Clipping: L2 norm capped at 5.0 Key Features Custom tokenization with punctuation handling with punctuation handling Greedy decoding for text generation for text generation Gradient clipping for training stability for training stability Modular layer system with clean interfaces with clean interfaces Comprehensive test coverage for all components ๐Ÿ”ง Development # Run all tests cargo test # Test specific components cargo test --test llm_test cargo test --test transformer_test cargo test --test self_attention_test # Build optimized version cargo build --release # Run with verbose output cargo test -- --nocapture ๐Ÿง  Learning Resources This implementation demonstrates key ML concepts: Transformer architecture (attention, feed-forward, layer norm) (attention, feed-forward, layer norm) Backpropagation through neural networks through neural networks Language model training (pre-training + fine-tuning) (pre-training + fine-tuning) Tokenization and vocabulary management and vocabulary management Gradient-based optimization with Adam Perfect for understanding how modern LLMs work under the hood! ๐Ÿ“Š Dependencies ndarray - N-dimensional arrays for matrix operations - N-dimensional arrays for matrix operations rand + rand_distr - Random number generation for initialization No PyTorch, TensorFlow, or Candle - just pure Rust and linear algebra! ๐Ÿค Contributing Contributions are welcome! This project is perfect for learning and experimentation. High Priority Features Needed ๐Ÿช Model Persistence - Save/load trained parameters to disk (currently all in-memory) - Save/load trained parameters to disk (currently all in-memory) โšก Performance optimizations - SIMD, parallel training, memory efficiency - SIMD, parallel training, memory efficiency ๐ŸŽฏ Better sampling - Beam search, top-k/top-p, temperature scaling - Beam search, top-k/top-p, temperature scaling ๐Ÿ“Š Evaluation metrics - Perplexity, benchmarks, training visualizations Areas for Improvement Advanced architectures (multi-head attention, positional encoding, RoPE) (multi-head attention, positional encoding, RoPE) Training improvements (different optimizers, learning rate schedules, regularization) (different optimizers, learning rate schedules, regularization) Data handling (larger datasets, tokenizer improvements, streaming) (larger datasets, tokenizer improvements, streaming) Model analysis (attention visualization, gradient analysis, interpretability) Getting Started Fork the repository Create a feature branch: git checkout -b feature/model-persistence Make your changes and add tests Run the test suite: cargo test Submit a pull request with a clear description Code Style Follow standard Rust conventions ( cargo fmt ) ) Add comprehensive tests for new features Update documentation and README as needed Keep the "from scratch" philosophy - avoid heavy ML dependencies Ideas for Contributions ๐Ÿš€ Beginner : Model save/load, more training data, config files : Model save/load, more training data, config files ๐Ÿ”ฅ Intermediate : Beam search, positional encodings, training checkpoints : Beam search, positional encodings, training checkpoints โšก Advanced: Multi-head attention, layer parallelization, custom optimizations Questions? Open an issue or start a discussion! No PyTorch, TensorFlow, or Candle - just pure Rust and linear algebra!