A CPU that runs entirely on GPU — registers, memory, flags, and program counter are all tensors.
Every ALU operation is a trained neural network.
Addition uses Kogge-Stone carry-lookahead. Multiplication uses a learned byte-pair lookup table.
Bitwise ops use neural truth tables. Shifts use attention-based bit routing. No hardcoded arithmetic.
Quick Start
pip install -e " .[dev] " # Run a program — all arithmetic through trained neural networks python main.py --program programs/sum_1_to_10.asm # Run with execution trace python main.py --program programs/fibonacci.asm --trace # Inline assembly python main.py --inline " MOV R0, 42; HALT " # GPU tensor mode (maximum speed, native tensor ops) python main.py --binary firmware.bin --fast
How It Works
The entire CPU lives on GPU. Registers, memory, flags, and the program counter are PyTorch tensors. Instruction decode, ALU dispatch, and state updates all happen on-device — nothing round-trips to the host CPU. Every ALU operation routes through a trained .pt model:
Instruction Neural Model How It Works ADD R0, R1, R2 arithmetic.pt + carry_combine.pt Kogge-Stone CLA (8 neural passes) SUB R0, R1, R2 arithmetic.pt + carry_combine.pt Two's complement + CLA MUL R0, R1, R2 multiply.pt Byte-pair LUT lookups (up to 64 pairs for 64-bit) DIV R0, R1, R2 arithmetic.pt Restoring division via neural subtraction AND R0, R1, R2 logical.pt Vectorized truth table (all 32 bits at once) OR R0, R1, R2 logical.pt Vectorized truth table XOR R0, R1, R2 logical.pt Vectorized truth table SHL R0, R1, 4 lsl.pt Attention-based bit routing per output position SHR R0, R1, 2 lsr.pt Attention-based bit routing CMP R0, R1 arithmetic.pt Neural subtraction → derive N/Z/C flags INC R0 arithmetic.pt Neural add 1 DEC R0 arithmetic.pt Neural subtract 1
Math functions (sin, cos, sqrt, exp, log, atan2) also wired through trained models.
... continue reading