TL;DR: Imported the full Linux kernel history into pgit. 1,428,882 commits, 24.4 million file versions, 20 years of development, stored in PostgreSQL with delta compression. Actual data: 2.7 GB (git gc --aggressive gets 1.95 GB). The import took 2 hours on a dedicated server. Then I started asking questions. 7 f-bombs in 1.4 million commit messages (all from 2 people). 665 bug fixes pointing at a single commit. A filesystem that took 13 years to merge. Here's what the Linux kernel looks like as a SQL database.
The import
This post builds on pgit: What If Your Git History Was a SQL Database? . If you haven't read it, start there. Short version: pgit is a Git-like CLI where everything lives in PostgreSQL instead of the filesystem. It uses pg-xpatch for transparent delta compression and makes your entire commit history SQL-queryable. After the pgit post hit the HN front page and got picked up by TLDR, console.dev, and dailydev, I teased that I was importing the Linux kernel. Here's what happened.
The Linux kernel is one of the largest actively developed repositories in the world. 1.4 million commits spanning 20 years, 171,000 files, 38,000 contributors. From what I've found, only a handful of VCS besides git have ever managed a full import of the kernel's history. Fossil (SQLite-based, by the SQLite team) never did. Darcs and Monotone attempted it with severe performance problems. Mercurial can do it. Correct me if I'm wrong on any of this.
pgit handled it.
Metric Value Commits 1,428,882 File versions (file refs) 24,384,844 Unique blobs 3,089,589 Unique paths 171,525 Path groups (delta chains) 137,600 Import time 2h 0m 48s
The import ran on a Hetzner dedicated server in Finland: AMD EPYC 7401P (24 cores / 48 threads), 512 GB DDR4 ECC RAM, 2×1.92 TB SSD in RAID 0. With a 350 GB xpatch content cache, the entire decoded repository fits in memory.
Full server setup, git baseline, and pgit configuration The server Hetzner Dedicated "Server Auction" from their Finland datacenter (HEL1): Component Spec CPU AMD EPYC 7401P (24 cores / 48 threads) RAM 16×32 GB DDR4 ECC reg. (512 GB total) Storage 2×Micron SSD SATA 1.92 TB Datacenter (RAID 0) NIC 1 Gbit Intel I350 Cost ~€272/month OS installation Hetzner installimage with Ubuntu 24.04 LTS. Two changes from the default config: RAID 0 ( SWRAIDLEVEL 0 ) for maximum throughput (no redundancy needed for ephemeral analysis work), and a simple partition layout: PART /boot ext3 1024M PART swap swap 4G PART / ext4 all This gives ~3.5 TB usable storage across the two 1.92 TB SSDs. OS tuning After booting into the installed image: apt update && apt upgrade -y apt install -y \ tmux btop htop iotop \ cpufrequtils numactl \ git curl wget unzip \ build-essential \ ufw \ linux-tools-common linux-tools-$( uname -r) for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance > " $cpu " done cat > /etc/default/cpufrequtils << 'EOF' GOVERNOR= "performance" EOF systemctl enable cpufrequtils systemctl restart cpufrequtils sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT="consoleblank=0"/GRUB_CMDLINE_LINUX_DEFAULT="consoleblank=0 mitigations=off"/' /etc/default/grub.d/hetzner.cfg update-grub cat >> /etc/sysctl.conf << 'EOF' vm.swappiness = 1 vm.dirty_ratio = 5 vm.dirty_background_ratio = 2 kernel.numa_balancing = 1 EOF sysctl -p echo never > /sys/kernel/mm/transparent_hugepage/enabled echo never > /sys/kernel/mm/transparent_hugepage/defrag cat > /etc/systemd/system/disable-thp.service << 'EOF' [Unit] Description=Disable Transparent Huge Pages DefaultDependencies=no After=sysinit.target local-fs.target Before=basic.target [Service] Type=oneshot ExecStart=/bin/sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/enabled && echo never > /sys/kernel/mm/transparent_hugepage/defrag' [Install] WantedBy=basic.target EOF systemctl daemon-reload systemctl enable disable-thp sed -i 's|relatime|noatime|g' /etc/fstab mount -o remount,noatime / ufw default deny incoming ufw default allow outgoing ufw allow ssh ufw --force enable wget https://go.dev/dl/go1.26.0.linux-amd64.tar.gz rm -rf /usr/local/go && tar -C /usr/local -xzf go1.26.0.linux-amd64.tar.gz rm go1.26.0.linux-amd64.tar.gz echo 'export PATH=$PATH:/usr/local/go/bin:$HOME/go/bin' >> ~/.bashrc source ~/.bashrc apt install -y docker.io systemctl enable docker systemctl start docker reboot pg-xpatch container Pulled the standard latest pg-xpatch Docker image: docker pull ghcr.io/imgajeed76/pg-xpatch:latest pgit version pgit v4 with a few local changes that weren't released at the time of the import. By the time you're reading this, they should be included in the latest version, so everything here is reproducible with a normal go install . The main change is a seq ordering fix that replaces a monotonic timestamp hack with an explicit seq INTEGER NOT NULL column for commit ordering. This makes delta chain decompression significantly faster for sequential scans. Full changelist: db/schema.go — Added seq INTEGER NOT NULL column, order_by => 'seq' in xpatch.configure()
— Added column, in db/commits.go — Added Seq field to struct, updated all INSERT/COPY statements
— Added field to struct, updated all INSERT/COPY statements cli/import.go — Populates Seq (1-indexed), removed monotonic timestamp hack
... continue reading