Going faster than memcpy

While profiling Shadesmar a couple of weeks ago, I noticed that for large binary unserialized messages (>512kB) most of the execution time is spent doing copying the message (using memcpy ) between process memory to shared memory and back.

I had a few hours to kill last weekend, and I tried to implement a faster way to do memory copies.

Autopsy of memcpy

Here’s the dumb of perf when running pub-sub for messages of sizes between 512kB and 2MB.

Children Self Shared Object Symbol + 99.86% 0.00% libc-2.27.so [.] __libc_start_main + 99.86% 0.00% [unknown] [k] 0x4426258d4c544155 + 99.84% 0.02% raw_benchmark [.] main + 98.13% 97.12% libc-2.27.so [.] __memmove_avx_unaligned_erms + 51.99% 0.00% raw_benchmark [.] shm::PublisherBin<16u>::publish + 51.98% 0.01% raw_benchmark [.] shm::Topic<16u>::write + 47.64% 0.01% raw_benchmark [.] shm::Topic<16u>::read

__memmove_avx_unaligned_erms is an implementation of memcpy for unaligned memory blocks that uses AVX to copy over 32 bytes at a time. Digging into the glibc source code, I found this:

#if IS_IN (libc) # define VEC_SIZE 32 # define VEC(i) ymm##i # define VMOVNT vmovntdq # define VMOVU vmovdqu # define VMOVA vmovdqa # define SECTION(p) p##.avx # define MEMMOVE_SYMBOL(p,s) p##_avx_##s # include "memmove-vec-unaligned-erms.S" #endif

Breaking down this function:

memmove : glibc implements memcpy as a memmove instead, here’s the relevant source code:

... continue reading