Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU

NTransformer

High-efficiency C++/CUDA LLM inference engine. Runs Llama 70B on a single RTX 3090 (24GB VRAM) by streaming model layers through GPU memory via PCIe, with optional NVMe direct I/O that bypasses the CPU entirely.

Key Results

Model Mode Decode VRAM Notes Llama 3.1 8B Q8_0 Resident 48.9 tok/s 10.0 GB All layers in VRAM Llama 3.1 8B Q8_0 Tiered (auto) 48.8 tok/s 10.3 GB 32/32 layers auto-promoted to VRAM Llama 3.1 70B Q6_K Streaming (mmap) 0.006 tok/s 7.3 GB Page cache thrashing (53 GB > 48 GB RAM) Llama 3.1 70B Q6_K Tiered (auto) 0.2 tok/s 23.1 GB 29 VRAM + 51 RAM + 0 NVMe

3-tier adaptive caching auto-sizes from hardware: VRAM-resident layers (zero I/O) + pinned RAM (H2D only) + NVMe/mmap fallback. Achieves 33x speedup over mmap baseline for 70B on consumer hardware (RTX 3090 + 48 GB RAM).

Bottleneck is PCIe H2D bandwidth at Gen3 x8 (~6.5 GB/s). With Gen4 x16 (B550/X570), tier B layers would be compute-bound, yielding ~0.5 tok/s.

Features

Zero external dependencies beyond CUDA Toolkit (no PyTorch, no cuBLAS)

beyond CUDA Toolkit (no PyTorch, no cuBLAS) GGUF model format with Q4_0, Q8_0, Q4_K_M, Q6_K, F16, F32 quantization

with Q4_0, Q8_0, Q4_K_M, Q6_K, F16, F32 quantization 3-Tier Adaptive Caching : auto-sized VRAM resident + pinned RAM + NVMe/mmap tiers

... continue reading