650GB of Data (Delta Lake on S3). Polars vs. DuckDB vs. Daft vs. Spark

I recently tried to light the tinder for what I hoped would be a revolt — the Single Node Rebellion — but, of course, it sputtered out immediately. Truth be told, it was one of the most popular articles I’ve written about in some time, purely based on the stats.

The fact that I even sold t-shirts, tells me I have born a few acolytes into this troubled Lake House world.

Without rehashing the entire article, it’s clear that there is what I would call “cluster fatigue.” We all know it, but never talk about it … much … running SaaS Lake Houses is expensive emotionally and financially. All well and good during the peak Covid days when we had our mini dot-com bubble, but the air has gone out of that one.

Not only is it not cheap to crunch 650 GB of data on a Spark cluster —piling up DBUs, truth be told — but it’s not complicated either; they’ve made it easy to spend money. Especially when you simply don’t need a cluster anymore for *most datasets and workloads.

Sure, in the days of Pandas when that was our only non-Spark option, we didn’t have a choice, but DuckDB, Polars, and Daft (also known as D.P.D. because why not) … have laid that argument to rest in a shallow grave.

Cluster fatigue is real

D.P.D. can work on LTM (larger than memory) datasets

D.P.D. is extremely fast.

Sometimes I feel like I must overcome skepticism with a little bit of show-and-tell, proof is in the pudding, as they say. If you want proof, I will provide it.

Look, it ain’t always easy, but always rewarding.

... continue reading