Columnar Storage Is Normalization

Something I didn't understand for a while is that the process of turning row-oriented data into column-oriented data isn't a totally bespoke, foreign concept in the realm of databases. It's still of the relational abstraction. Or can be.

As an example, say we have this data:

data = [ { "name" : "Smudge" , "colour" : "black" }, { "name" : "Sissel" , "colour" : "grey" }, { "name" : "Hamlet" , "colour" : "black" } ]

This represents a table in a relational database. Let's assume this was a table in a relational database and we had to do all sorts of disk-access, whatever, to access any particular part of the data. This representation has some nice properties.

It's easy to add a new row: we can just construct a row:

{ "name" : "Petee" , "colour" : "black" }

and add it to the end of our already-existing list. On disk, we probably only have to touch a couple pages to do it. And if our row were really wide, in that it had a whole bunch of columns, that wouldn't really change. It would still have that nice property.

This is also true of looking up a row. Since all of a row's columns are stored next to each other, it's very fast to just pull that row out from wherever its stored.

Conversely, if we were to want to, say, compute a histogram of the different pet colours, we have to read quite a lot of data we don't care about in order to do so.

This is a row-oriented representation of the data. A column-oriented representation would look something like this:

... continue reading