You can connect with me on LinkedIn to discuss collaborations and work opportunities. You can also follow me on Twitter, Bluesky and Mastodon.
The problem
We want to find out which are the top #5 American airports with the largest average (mean) delay on domestic flights.
Data
We will be using the Data Expo 2009: Airline on time data dataset from the Harvard Dataverse. The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is around 120 million records, divided in 22 CSV files, one per year, and 4 auxiliary CSV files that we will not use here. The total size on disk of the dataset is around 13 Gb. The original data comes compressed, but the decompression part is not considered part of the pipeline here.
Environment
The available hardware to do the job are a single computer with the next specs:
Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
Memory: LPDDR3 15820512 kB (16 Gb) 2133 MT/s, no swap
Disk: KXG50ZNV512G NVMe TOSHIBA 512GB (ext4 non-encrypted partition)
... continue reading