Learn how to optimize Go data structures for modern CPU architectures. We'll explore cache lines, false sharing, and data-oriented design to achieve significant performance improvements in real-world applications.
Key Takeaways Cache misses can slow down your code by 60x compared to L1 cache hits
False sharing occurs when multiple cores update different variables in the same cache line
Proper data structure padding can improve performance by 5-10x in specific scenarios
Data-oriented design beats object-oriented for high-performance systems
Always measure with benchmarks - cache effects are hardware-specific
The Numbers That Matter
L1 Cache: 4 cycles (~1ns) 32KB L2 Cache: 12 cycles (~3ns) 256KB L3 Cache: 40 cycles (~10ns) 8MB RAM: 200+ cycles (~60ns) 32GB Cache line size: 64 bytes (on x86_64)
Reading from RAM is approximately 60x slower than L1 cache. One cache miss equals 60 cache hits. This is why cache-friendly code can run significantly faster - often 5-10x in specific scenarios.
False Sharing: The Silent Killer
False sharing occurs when multiple CPU cores modify different variables that happen to share the same cache line. This forces cache line invalidation across cores, causing significant performance degradation.
The problem is subtle: your variables might be logically independent, but if they're physically adjacent in memory (within 64 bytes), updating one invalidates the cache for all others on that line.
In our metrics collection system, we noticed 10x slower performance during high concurrency. The issue was multiple goroutines updating different counters that were packed in the same cache line.
Detection requires careful benchmarking with concurrent access patterns. The performance drop isn't visible in single-threaded tests, only under parallel load.
// BROKEN: False sharing destroys performance type Counters struct { requests uint64 // 8 bytes errors uint64 // 8 bytes - SAME cache line! latency uint64 // 8 bytes - SAME cache line! } // Result: 45ns/op under contention
// FIXED: Padding prevents false sharing type Counters struct { requests uint64 _ [56]byte // Padding to 64 bytes (cache line) errors uint64 _ [56]byte latency uint64 _ [56]byte timeouts uint64 _ [56]byte } // Result: 7ns/op - 6.4x faster!
Measure Cache Misses (Linux)
# Profile cache misses perf stat -e cache-misses,cache-references ./myapp # Detailed cache analysis perf record -e cache-misses ./myapp perf report # Go benchmark with perf go test -bench=. -benchtime=10s & pid=$! perf stat -p $pid -e L1-dcache-load-misses,L1-dcache-loads
Data-Oriented Design: Array of Structs vs Struct of Arrays
// Array of Structs (AoS) - Cache UNFRIENDLY type Entity struct { ID uint64 // 8 bytes X, Y, Z float64 // 24 bytes Velocity float64 // 8 bytes Health int // 8 bytes Type string // 16 bytes // Total: 64+ bytes per entity } type World struct { Entities []Entity // Each entity in different cache lines } func (w *World) UpdatePositions(dt float64) { for i := range w.Entities { // Loads 64+ bytes but only uses 32 bytes (X,Y,Z,Velocity) // Wastes 50% of cache! w.Entities[i].X += w.Entities[i].Velocity * dt } } // Benchmark: 85ns per entity
// Struct of Arrays (SoA) - Cache FRIENDLY type World struct { IDs []uint64 Positions [][3]float64 // X,Y,Z together Velocities []float64 Healths []int Types []string } func (w *World) UpdatePositions(dt float64) { // Positions and Velocities are contiguous in memory // CPU loads 8 positions per cache line! for i := range w.Positions { w.Positions[i][0] += w.Velocities[i] * dt } } // Benchmark: 12ns per entity - 7x faster!
Prefetching: Help the CPU Predict
// Linear access - CPU prefetcher works great func SumLinear(data []int) int { sum := 0 for i := 0; i < len(data); i++ { sum += data[i] // CPU prefetches data[i+1], data[i+2]... } return sum } // 2.1ns per element // Random access - Prefetcher can't help func SumRandom(data []int, indices []int) int { sum := 0 for _, idx := range indices { sum += data[idx] // Cache miss likely } return sum } // 45ns per element - 20x slower! // Manual prefetching for known patterns func ProcessWithPrefetch(nodes []*Node) { for i := 0; i < len(nodes)-4; i++ { // Prefetch future nodes while processing current _ = nodes[i+4].Value // Trigger prefetch // Process current node (prefetched 4 iterations ago) expensive(nodes[i]) } }
Hot/Cold Data Splitting
// BROKEN: Hot and cold data mixed type User struct { ID uint64 // HOT: accessed frequently Score int // HOT: accessed frequently Name string // COLD: rarely accessed Email string // COLD: rarely accessed Address string // COLD: rarely accessed Bio string // COLD: rarely accessed // One User = 100+ bytes, but we often only need 16 bytes } func TopUsers(users []User) []uint64 { // Sorts load 100+ bytes per user but only use Score sort.Slice(users, func(i, j int) bool { return users[i].Score > users[j].Score // Cache thrashing! }) } // FIXED: Separate hot and cold data type UserHot struct { ID uint64 Score int Cold *UserCold // Pointer to cold data } type UserCold struct { Name string Email string Address string Bio string } // Now sorting only touches 24 bytes per user // 4x fewer cache misses!
NUMA-Aware Allocation
// Check NUMA topology // $ numactl --hardware // node 0 cpus: 0-23 // node 1 cpus: 24-47 // Pin goroutine to specific CPU func PinToCPU(cpuID int) { runtime.LockOSThread() var cpuSet unix.CPUSet cpuSet.Zero() cpuSet.Set(cpuID) tid := unix.Gettid() unix.SchedSetaffinity(tid, &cpuSet) } // NUMA-aware worker pool type NUMAPool struct { workers [][]chan Task // workers[numa_node][worker_id] } func (p *NUMAPool) Submit(task Task) { // Hash task to NUMA node for data locality node := hash(task.Key) % len(p.workers) worker := rand.Intn(len(p.workers[node])) p.workers[node][worker] <- task } func (p *NUMAPool) StartWorker(numaNode, workerID int) { // Pin to CPU in specific NUMA node cpuID := numaNode*24 + workerID PinToCPU(cpuID) for task := range p.workers[numaNode][workerID] { processTask(task) } }
Branch Prediction Friendly Code
// BROKEN: Unpredictable branches func CountCondition(data []int) int { count := 0 for _, v := range data { if v > 128 { // Random data = 50% misprediction count++ } } return count } // With random data: 8.2ns per element // FIXED: Sort first to make branches predictable func CountConditionSorted(data []int) int { sort.Ints(data) // Now branches are predictable! count := 0 for _, v := range data { if v > 128 { // First all false, then all true count++ } } return count } // Even with sort overhead: 3.1ns per element // BETTER: Branchless version func CountConditionBranchless(data []int) int { count := 0 for _, v := range data { // No branch, just arithmetic count += int((uint(v) >> 7) & 1) } return count } // Always fast: 2.3ns per element
Cache-Conscious Hash Table
// Standard map - random memory access m := make(map[uint64]uint64) // Cache misses everywhere // Cache-friendly Robin Hood hashing type RobinHoodMap struct { buckets []bucket mask uint64 } type bucket struct { key uint64 value uint64 distance uint8 // Distance from ideal position occupied bool } func (m *RobinHoodMap) Get(key uint64) (uint64, bool) { idx := key & m.mask distance := uint8(0) // Linear probing = cache-friendly access pattern for { b := &m.buckets[idx] if !b.occupied { return 0, false } if b.key == key { return b.value, true } // Robin Hood: poor keys don't travel far if b.distance < distance { return 0, false } idx = (idx + 1) & m.mask distance++ } } // Benchmarks (1M lookups): // map[uint64]uint64: 95ns/op // RobinHoodMap: 32ns/op - 3x faster!
Real Production Example: Analytics Pipeline
// BEFORE: Object-oriented design type Event struct { Timestamp time.Time UserID uint64 Action string Value float64 Tags map[string]string } func ProcessEvents(events []Event) { for _, e := range events { // Each event access = potential cache miss if e.Action == "purchase" { updateRevenue(e.Value) } } } // AFTER: Data-oriented design type EventBatch struct { Timestamps []int64 // Unix timestamps UserIDs []uint64 Actions []uint8 // Enum instead of string Values []float64 // Tags stored separately (cold data) TagIndices []uint32 TagKeys []string TagValues []string } func ProcessEventsBatch(batch *EventBatch) { // Process actions in cache-friendly way for i, action := range batch.Actions { if action == ActionPurchase { updateRevenue(batch.Values[i]) } } } // Results: // Before: 450ns per event, 78% cache misses // After: 31ns per event, 12% cache misses // 14.5x speedup!
SIMD-Friendly Layouts
// Enable SIMD processing with proper alignment type Vec3 struct { X, Y, Z float32 _ float32 // Padding for 16-byte alignment } // Process 4 vectors at once with SIMD func AddVectors(a, b []Vec3, result []Vec3) { // Compiler can vectorize this loop for i := 0; i < len(a); i++ { result[i].X = a[i].X + b[i].X result[i].Y = a[i].Y + b[i].Y result[i].Z = a[i].Z + b[i].Z } } // Force 64-byte alignment for cache lines type AlignedBuffer struct { _ [0]byte // Magic trick for alignment data [1024]float64 } var buffer = new(AlignedBuffer) // Guaranteed aligned to 64 bytes
Benchmarking Cache Performance
func BenchmarkCacheLineSize(b *testing.B) { // Detect cache line size experimentally data := make([]int64, 1024*1024) for stride := 1; stride <= 128; stride *= 2 { b.Run(fmt.Sprintf("stride_%d", stride), func(b *testing.B) { for i := 0; i < b.N; i++ { // Touch memory at stride intervals for j := 0; j < len(data); j += stride { data[j]++ } } }) } // Sharp performance drop at stride=8 (64 bytes) = cache line size }
Rules for Cache-Friendly Go
Pack hot data together: Frequently accessed fields in same cache line Pad for concurrent access: Separate goroutine data by cache lines Prefer arrays over linked lists: Sequential > random access Use smaller types: int32 vs int64 if range allows Sort before processing: Predictable branches and access patterns Pool allocations: Reuse memory = hot caches Profile, don't guess: Use perf, pprof, and benchmarks
The Performance Optimization Recipe
1. Profile: Find cache misses with perf 2. Restructure: AoS → SoA for hot paths 3. Pad: Eliminate false sharing 4. Pack: Group frequently accessed data 5. Prefetch: Linear access patterns 6. Measure: Verify with benchmarks Real results from production (specific workloads): - Analytics pipeline: Up to 14x faster for batch processing - Game physics engine: 8x faster for collision detection - Database indexing: 11x faster for sequential scans - These gains are workload-specific and vary by hardware
Remember: Modern CPUs are fast. Memory is slow. The gap grows every year. Cache-friendly code isn't premature optimization—it's necessary optimization.
Security Considerations: Be careful with memory alignment tricks - they can expose timing side channels
Cache-based attacks (Spectre, Meltdown) exploit predictable memory access patterns
Consider security vs performance trade-offs in sensitive applications
Use constant-time algorithms for cryptographic operations
Validate all array bounds to prevent buffer overflows
Testing Strategy
Testing Cache Optimizations: Use micro-benchmarks to measure specific optimizations
Test on different CPU architectures (Intel, AMD, ARM)
Measure with realistic data sizes and access patterns
Use performance counters to verify cache hit rates
Run tests with different core counts to detect false sharing
func TestCacheFriendlyStructure(t *testing.T) { tests := []struct { name string size int expected time.Duration }{ {"small_data", 1000, 100 * time.Microsecond}, {"medium_data", 100000, 10 * time.Millisecond}, {"large_data", 10000000, 1 * time.Second}, } for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { data := generateTestData(tt.size) start := time.Now() processCacheFriendly(data) elapsed := time.Since(start) if elapsed > tt.expected { t.Errorf("Processing took %v, expected < %v", elapsed, tt.expected) } // Verify correctness result := verifyCacheFriendlyResult(data) assert.True(t, result.IsValid()) }) } } func BenchmarkFalseSharing(b *testing.B) { // Test with and without padding b.Run("with_false_sharing", func(b *testing.B) { c := &CountersUnpadded{} b.RunParallel(func(pb *testing.PB) { for pb.Next() { atomic.AddUint64(&c.requests, 1) } }) }) b.Run("with_padding", func(b *testing.B) { c := &CountersPadded{} b.RunParallel(func(pb *testing.PB) { for pb.Next() { atomic.AddUint64(&c.requests, 1) } }) }) // Padded version should be significantly faster }
Optimization Technique Typical Improvement Best For Complexity False Sharing Elimination 2-10x Concurrent counters Low Data Structure Packing 1.5-3x Hot data paths Medium Array of Structs → Struct of Arrays 3-15x Batch processing High Prefetching 1.2-2x Predictable access Low Branch Prediction 2-5x Conditional logic Medium
Language Comparison
These optimizations aren't unique to Go, but Go's runtime characteristics affect their impact:
Optimization Go Impact C/C++ Impact Java Impact Rust Impact False Sharing High (GC scanning) Medium High (GC pressure) Medium Data Packing High (GC overhead) Low (manual memory) High (object overhead) Low (zero-cost) Branch Prediction Medium (similar CPU) High (direct control) Low (JIT optimizes) High (LLVM)
Go's garbage collector adds overhead to memory access patterns. Cache-friendly code reduces GC pressure by improving locality, making the collector's job easier. This double benefit explains why these optimizations often have higher impact in Go than manual memory management languages.