Memory optimizations to reduce CPU costs

Imagine that you are given the following task, with a file like this:

Name,Department,Salary,JoinDate John Smith,Marketing,75000,2023-01-15 Alice Johnson,Finance,82000,2022-06-22 Bob Lee,Sales,68000,2024-03-10 Emma Davis,HR,71000,2021-09-01

You want to turn that into a single list of all the terms in the (potentially very large) file.

In other words, you want to turn it into something like this:

[ { "term" : "Name" , "position" : 0 , "length" : 4 } , { "term" : "Department" , "position" : 5 , "length" : 10 } , .. . { "term" : "2021-09-01" , "position" : 160 , "length" : 10 } ]

In other words, there is a single continuous array that references the entire data, and it is pretty efficient to do so. Why we do that doesn’t actually matter, but the critical aspect is that we observed poor performance and high memory usage when using this approach.

Let’s assume that we have a total of 10 million rows, or 40,000,000 items. Each item costs us 24 bytes (8 bytes for the Field, 8 bytes for the Position, 4 bytes for the Length, and 4 bytes for padding). So we end up with about 1GB in memory just to store things.

We can use Data-Oriented programming and split the data into individual arrays, like so:

public string [ ] Fields ; public long [ ] Positions ; public int [ ] Lengths ; public Item Get ( int i ) = > new ( Fields [ i ] , Positions [ i ] , Lengths [ i ] ) ;

This saves us about 200 MB of memory, because we can now skip the padding costs by splitting the Item into its component parts.

... continue reading