Why DuckDB is my first choice for data processing

Originally posted: 2025-03-16 . View source code for this page here.

Over the past few years, I've found myself using DuckDB more and more for data processing, to the point where I now use it almost exclusively, usually from within Python.

We're moving towards a simpler world where most tabular data can be processed on a single large machine1 and the era of clusters is coming to an end for all but the largest datasets.2

This post sets out some of my favourite features of DuckDB that set it apart from other SQL-based tools. In a nutshell, it's simple to install, ergonomic, fast, and more fully featured.

An earlier post explains why I favour SQL over other APIs such as Polars, pandas or dplyr.

DuckDB is an open source in-process SQL engine that is optimised for analytics queries.

'In-process' means it's similar to SQLite in that it runs within your application. You don't need to start a separate service such as Postgres to run it.

'Optimised for analytics queries' means that it's designed for operations like joins and aggregations involving large numbers of rows, as opposed to atomic transactions.

The performance difference of analytics-optimised engines (OLAP) vs. transactions-optimised engines (OLTP) should not be underestimated. A query running in DuckDB can be 100 or even 1,000 times faster than exactly the same query running in (say) SQLite or Postgres.

A core use-case of DuckDB is where you have one or more large datasets on disk in formats like csv , parquet or json which you want to batch process. You may want to perform cleaning, joins, aggregation, derivation of new columns - that sort of thing.

... continue reading