Going faster than memcpy
While profiling Shadesmar a couple of weeks ago, I noticed that for large binary unserialized messages (>512kB) most of the execution time is spent doing copying the message (using memcpy ) between process memory to shared memory and back.
I had a few hours to kill last weekend, and I tried to implement a faster way to do memory copies.
Autopsy of memcpy
Here’s the dumb of perf when running pub-sub for messages of sizes between 512kB and 2MB.
Children Self Shared Object Symbol + 99.86% 0.00% libc-2.27.so [.] __libc_start_main + 99.86% 0.00% [unknown] [k] 0x4426258d4c544155 + 99.84% 0.02% raw_benchmark [.] main + 98.13% 97.12% libc-2.27.so [.] __memmove_avx_unaligned_erms + 51.99% 0.00% raw_benchmark [.] shm::PublisherBin<16u>::publish + 51.98% 0.01% raw_benchmark [.] shm::Topic<16u>::write + 47.64% 0.01% raw_benchmark [.] shm::Topic<16u>::read
__memmove_avx_unaligned_erms is an implementation of memcpy for unaligned memory blocks that uses AVX to copy over 32 bytes at a time. Digging into the glibc source code, I found this:
#if IS_IN (libc) # define VEC_SIZE 32 # define VEC(i) ymm##i # define VMOVNT vmovntdq # define VMOVU vmovdqu # define VMOVA vmovdqa # define SECTION(p) p##.avx # define MEMMOVE_SYMBOL(p,s) p##_avx_##s # include "memmove-vec-unaligned-erms.S" #endif
Breaking down this function:
memmove : glibc implements memcpy as a memmove instead, here’s the relevant source code:
... continue reading