It’s been roughly two years of work on this and I think things are in a good enough state that it’s worth calling this v1.
Features
Typed dataframes
We got there eventually and I think we got there in a way that still looks nice. There is now a DataFrame.Typed API that tracks the entire schema of the dataframe - column names, misapplied operations etc are now compile time failures and you can easily move between exploratory and pipeline work. This is in large part thanks to maxigit and mcoady (Github user names) for their feedback.
$(DT.deriveSchemaFromCsvFile "Housing" "./data/housing.csv") main :: IO () main = do df <- D.readCsv "./data/housing.csv" let df' = either (error . show) id (DT.freezeWithError @Housing df) let df'' = df' & DT.derive @"rooms_per_household" (DT.col @"total_rooms" / DT.col @"households") & DT.impute @"total_bedrooms" 0 & DT.derive @"bedrooms_per_household" (DT.col @"total_bedrooms" / DT.col @"households") & DT.derive @"population_per_household" (DT.col @"population" / DT.col @"households") print df''
Calling dataframe from Python
There’s an implementation of Apache Arrow’s C Data interface along with an example of how to pass dataframes between polars and haskell.
Find that here
Getting data from hugging face
You can explore huggingface datasets. Example:
... continue reading