SQL Needed Structure

Published 2025-09-04 Here are two pages from the internet movie database: There are two things to note about these pages. The data on the page is presented in a hierarchichal structure. The movie page contains a director, a list of genres, a list of actors, and each actor in the list contains a list of characters they played in the movie. You can't sensibly fit all of this into a single flat structure like a relation. The order of the hierarchy isn't the same on both pages. On one page we have movie->actors and on the other page we have actor->movies. So you can't just directly store the hierarchy in your database - you need to be able to traverse relationships in both directions. So we store all our data in a relational database in flat tables and then whenever we need to render some UI we transform the flat data into whatever hierarchy we need. Doing this transformation by hand is tedious and error-prone. We call this tedium "the object-relational mismatch" but it isn't really about objects or relations. The fundamental problem is that fitting complex relationships to human vision usually requires constructing some visual hierarchy, but different tasks require different hierarchies. Whatever database and programming language you use, you will have to deal with this. But it's particularly painful in sql because sql wasn't designed to produce hierarchical data. sql wasn't built to yield structure Let's grab the imdb public dataset and try to reproduce the source data for that movie page (or at least a subset of it, because I didn't bother importing all the tables). We want to see an output that looks like this: { "title" : "Baby Driver" , "director" : [ "Edgar Wright" ] , "writer" : [ "Edgar Wright" ] "genres" : [ "Action" , "Crime" , "Drama" ] , "actors" : [ { "name" : "Ansel Elgort" , "characters" : [ "Baby" ]} , { "name" : "Jon Bernthal" , "characters" : [ "Griff" ]} , { "name" : "Jon Hamm" , "characters" : [ "Buddy" ]} , { "name" : "Eiza González" , "characters" : [ "Darling" ]} , { "name" : "Micah Howard" , "characters" : [ "Barista" ]} , { "name" : "Lily James" , "characters" : [ "Debora" ]} , { "name" : "Morgan Brown" , "characters" : [ "Street Preacher" ]} , { "name" : "Kevin Spacey" , "characters" : [ "Doc" ]} , { "name" : "Morse Diggs" , "characters" : [ "Morse Diggs" ]} , { "name" : "CJ Jones" , "characters" : [ "Joseph" ]} ] , } Let's grab the title first: postgres=# select primaryTitle from title where tconst = 'tt3890160'; primarytitle -------------- Baby Driver And now we need the director: postgres=# select primaryTitle, person.primaryName from title, principal, person where title.tconst = 'tt3890160' and title.tconst = principal.tconst and principal.nconst = person.nconst and principal.category = 'director'; primarytitle | primaryname --------------+-------------- Baby Driver | Edgar Wright And the writer: postgres=# select primaryTitle, director.primaryName as director, writer.primaryName as writer from title, principal as principal_director, person as director, principal as principal_writer, person as writer where title.tconst = 'tt3890160' and title.tconst = principal_director.tconst and principal_director.nconst = director.nconst and principal_director.category = 'director' and title.tconst = principal_writer.tconst and principal_writer.nconst = writer.nconst and principal_writer.category = 'writer'; primarytitle | director | writer --------------+--------------+-------------- Baby Driver | Edgar Wright | Edgar Wright We're already in trouble. If this movie had 2 directors and 2 writers, this query would return 4 rows: primarytitle | director | writer --------------+--------------+-------------- Baby Driver | Edgar Wright | Edgar Wright Baby Driver | Edgar Wright | A. Writer Baby Driver | A. Director | Edgar Wright Baby Driver | A. Director | A. Writer If there was no director in the database then this query would return 0 rows, no matter how many writers there were. Now we don't even know what the movie is called. primarytitle | director | writer --------------+--------------+-------------- We can't sensibly fit the data we want into a single relation, and we can't return more than one relation per query. So we have to issue multiple queries: postgres=# select primaryTitle from title where tconst = 'tt3890160'; primarytitle -------------- Baby Driver postgres=# select person.primaryName from title, principal, person where title.tconst = 'tt3890160' and title.tconst = principal.tconst and principal.nconst = person.nconst and principal.category = 'director'; primaryname -------------- Edgar Wright postgres=# select person.primaryName from title, principal, person where title.tconst = 'tt3890160' and title.tconst = principal.tconst and principal.nconst = person.nconst and principal.category = 'writer'; primaryname -------------- Edgar Wright postgres=# select person.nconst, person.primaryName from title, principal, person where title.tconst = 'tt3890160' and title.tconst = principal.tconst and principal.nconst = person.nconst and principal.category = 'actor' limit 10; nconst | primaryname -----------+--------------- nm5052065 | Ansel Elgort nm1256532 | Jon Bernthal nm0358316 | Jon Hamm nm2555462 | Eiza González nm8328714 | Micah Howard nm4141252 | Lily James nm3231814 | Morgan Brown nm0000228 | Kevin Spacey nm1065096 | Morse Diggs nm1471085 | CJ Jones postgres=# select principal_character.nconst, principal_character.character from title, principal, principal_character where title.tconst = 'tt3890160' and title.tconst = principal.tconst and principal.nconst = person.nconst and principal.category = 'actor' and principal_character.tconst = principal.tconst and principal_character.nconst = principal.nconst; nconst | character -----------+--------------------- nm5052065 | Baby nm8328714 | Barista nm0358316 | Buddy nm2555462 | Darling nm4141252 | Debora nm0000228 | Doc nm1256532 | Griff nm1471085 | Joseph nm1065096 | Morse Diggs nm3231814 | Street Preacher Through the magic of joins we have retrieved all the data we need and it only required holding a transaction open for 4 network roundtrips. All that's left to do now is... the same joins, but inside the backend web server. Because we have to re-assemble these flat outputs into the structure of the page. Also note that fully half of the data returned is the nconst column which we didn't even want in the output. We only returned it because we need it as a key so we can repeat the joins that we already did in the database. The more paths you traverse, the more useless join keys you need to send to the backend web server. All of this is pretty tedious so we invented ORMs to automate it. But: Almost all ORMs end up sending multiple queries for the output that we want. If you have a good ORM and you use it carefully it'll send one query per path in the output, like the raw sql above. If you're less careful you might get one query per actor in the film. Many ORMs also make a mess of consistency by lazily loading data in separate transactions. So we might generate a page where different parts of the data come from different points in time, which is confusing for users. Using an ORM locks you into only using one specific programming language. What if you need to query your data from a different language? You'll probably end up talking to the same ORM through a microservice. old dogs can sort of learn new tricks These days sql actually can produce structured data from queries. A lot of people are mad about this. Whenever I talk about it they reflexively yell things like "structured data doesn't belong in the database" as if there was a universal system of morality that uniquely determined the locations of various data processing tasks. But I can't help but note again that the structure has to happen somewhere because that's what the output page looks like and that doing it outside the database isn't working very well. Whenever we're building a UI for humans, whether on the web or native, the main use of the query language is to turn relational data into structured data for the client to render. So it would be really nice if the query language was actually able to produce structured data. Like this: select jsonb_agg(result) from ( select primaryTitle as title, genres, ( select jsonb_agg(actor) from ( select (select primaryName from person where person.nconst = principal.nconst) as name, ( select jsonb_agg(character) from principal_character where principal_character.tconst = principal.tconst and principal_character.nconst = principal.nconst ) as characters from principal where principal.tconst = title.tconst and category = 'actor' order by ordering limit 10 ) as actor ) as actors, ( select jsonb_agg(primaryName) from principal, person where principal.tconst = title.tconst and person.nconst = principal.nconst and category = 'director' ) as director, ( select jsonb_agg(primaryName) from principal, person where principal.tconst = title.tconst and person.nconst = principal.nconst and category = 'writer' ) as writer from title where tconst = $1 ) as result; It's not perfect. You can definitely see the duct tape, and the query plan often suffers from the lack of decorrelation. But we can grab all the data needed for the entire page in a single query, with one network roundtrip. Whether you use these features directly or as the output of your ORM, this is a sizable improvement for one of the main usecases of relational databases! It doesn't matter that this isn't the way things have always worked. Sql is not relational algebra and relational algebra is not math, and neither was carved into stone tablets handed down from Codd. We make tools to serve our purposes, and our purposes have changed a hell of a lot since the 70s, when the main client of a database was a human typing sql character by character into an interactive transaction on a teletype connected to a mainframe with 500kb of RAM, almost 20 years before the invention of the world wide web. Maybe it's ok for our tools to evolve to meet new demands. And we can evolve with them.

SQL Needed Structure

Share this article

Related Articles