Is Data Modeling Dead?

Ok, not going to lie, I rarely find anything of value in the dregs of r/dataengineering, mostly I fear, because it’s %90 freshers with little to no experience. These green behind the ear know-it-all engineers who’ve never written a line of Perl, SSH’d into a server, and have no idea what a LAMP stack is. Weak. Sad. We used to program our way to glory, up hill both ways in the snow. All you do is script kiddy some Python code through Cursor. A recent post on Data Modeling, specifically that data modeling is dead, caught my eye. A rare piece of gold mixed in the usual pile of crap. It some truth being spoken on the interwebs, hold onto your panties you bright eyed data zealot. I agree %100 with this sentiment. DATA MODELING IS DEAD. How is Data Modeling Dead? Well, because this generation of milk toast Data Engineers were raised on a diet of Data Lakes and Lake Houses, from our uncaring and tyrant mothers called Databricks, Snowflake, and AWS. You think those purveyors of compute and ideals were interested academically in Data Modeling as a fundamental truth of life (unless you were talking about an RDS instance, and then only maybe). Data Modeling died a slow and painful death, ignored by the community, all the while they pandered and fawned over their Modern Data Stack. What say you? Oh, give me my beautiful Notebook attached to a never ending stream of Spark Compute. Oh, the joys, the Lake House architecture, that sweet and delicious Medallion Architecture in which we can dump a never ending stream of data without a second though about nit picky things like “normalization” or joining tables in a snowflake or star schema way. Overwhelming noise from Saas Vendors and Missing Voices. Why is Data Modeling dead? Because in the era of the Data Warehouse we (collectively) worshiped at the feed of Kimball and the Data Warehouse Toolkit. Sure, we still argued about Facts and Dimensions, but overall, we (collectively) agreed with a spit on the hand that Kimball data modeling was the north star towards which to march. As well, there were volumes published on the technical details around HOW to implement ideals from the Data Warehouse Toolkit. Like technical details that you could say “this is right” or “this is wrong,” or maybe “you should do third normal form.” In the last decade or two, the only thing we have to look to as a north star is Joe Reis and Fundamentals of Data Engineering. This has been one of the only unbiased reasonable voices in “modern” Data Engineering. But, we still lack a voice of clarity (without Vendors messing around in the mix) when it comes to Modern Data Modeling. There is no bible. There is no truth. Technologies are shifting fast and hard, AND those technologies have rightly changed HOW some of the technical details of data modeling should be done. Should we still follow Kimball? Should we drink the Databricks Medallion juice? Is there even any other options? There are not any other options. You just wing it these days. Most of the recommended AND technical information published on data modeling relates to Relational Databases (ala Postgres, MySQL, SQLServer, etc), and only minimally overlaps with a modern Data Lake. You think I lie? Have at thee knave. How dare you say I don’t know about what I speak. What, you just want to throw some ideas in the air and see where the land, just the ole’ “make it work mentality?” Look friend, I DO use Kimball style data modeling in my multi-TB Lake House. It’s not the same as it used to be, there are a lot of different concepts. I DO make it work. But, for example. Let’s just pull a random fundamental concept from Kimball about data modeling and see if it has changed or morphed in the Lake House world. Fact Table Granularity Same idea, but data is partitioned/clustering instead of purely indexed. Example: sales fact table partitioned by date and Z-ORDERed by customer_id in Delta Lake. In the old days, we lived and died (and some still do) based on ACTUAL primary keys, composite or not, to discuss things like the grain of a table. Heck I still do that with my engineers. I enforce them to have a calculated primary key on every table that at least logically says what is the grain, although it is NOT ENFORCED by the technology. This is a big deal technically. Details matter. And, clustering and partitioning was not a thing for Kimball, at least not in the form it appears today in places like Apache Iceberg and Delta Lake. We wait for the promised data modeling messiah. So, here I sit, in a world upside down. In a data world where people on Reddit believe that Medallion Architecture is directly related to Data Contracts. I am waiting for the prophesied Data Modeling messiah to come and save us from our collective Lake House sins. Someone needs to enter the SaaS temple and overturn the tables of those wicked money changers, preaching their slippery made up marketing speak data modeling. We need truth. What is the answer to data modeling is dead? There is no answer. As a technical skill it died about 8 years ago. It’s been relegated to the same importance as Data Quality, given lip service, nothing more. I want someone smart to come along and lay down the law. I want the right answers written in a book that is worshipped and paraded around the data community like Caesar Augustus. I want marble statues hewn of this book, the laws and commandments written in it to be used to beat those freshers down into the dust when the transgress.

Is Data Modeling Dead?

Share this article

Related Articles