Wild performance tricks

Last week I had the pleasure of attending RustForge in Wellington, New Zealand. I gave a talk titled “Wild performance tricks”. You can watch a recording of my talk. If you’d prefer to read rather than watch, the rest of this post will cover more or less the same material. The talk shows some linker benchmarks, which I’ll skip here and focus instead on the optimisations, which I think are the more interesting part of the talk.

The tricks here are a few of my favourites that I’ve used in the Wild linker.

Mutable slicing for sharing between threads

In the linker, we have a type SymbolId defined as:

struct SymbolId ( u32 );

We need a way to store resolutions, where one SymbolId resolves (maps) to another SymbolId . If we need to look up which symbol SymbolId(5) maps to, we then look at index 5 in the Vec . Because every symbol maps to some other symbol (possibly itself), this means that we make use of the entire Vec . i.e. it’s dense, not sparse. For a sparse mapping, a HashMap might be preferable.

The Wild linker is very multi-threaded, so we want to be able to process symbols for our input objects in parallel. To achieve this, we make sure that all symbols for a given object get allocated adjacent to each other. i.e. each object has SymbolId s in a contiguous range. This is good for cache locality because when a thread is working with an object, all its symbols will be nearby in memory, so more likely to be in cache. It also lets us do things like this:

fn parallel_process_resolutions ( mut resolutions : & mut [ SymbolId ], objects : & [ Object ]) { objects .iter () .map (| obj | ( obj , resolutions .split_off_mut ( .. obj .num_symbols ) .unwrap ())) .par_bridge () .for_each (|( obj , object_resolutions )| { obj .process_resolutions ( object_resolutions ); }); }

Here, we’re using the Rayon crate to process the resolutions for all our objects in parallel from multiple threads. We start by iterating over our objects, then for each object, we use split_off_mut to split off a mutable slice of resolutions that contains the resolutions for that object. par_bridge converts this regular Rust iterator into a Rayon parallel iterator. The closure passed to for_each then runs in parallel on multiple threads, with each thread getting access to the object and a mutable slice of that object’s resolutions.

Parallel initialisation of the Vec

... continue reading