Skip to content
Tech News
← Back to articles

Dataframe 1.0.0.0

read original more articles
Why This Matters

Dataframe 1.0.0.0 marks a significant milestone by introducing typed dataframes with schema validation, seamless integration with Python via Apache Arrow, and improved performance for large datasets. These advancements enhance data reliability, interoperability, and efficiency, benefiting both developers and data consumers in the tech industry.

Key Takeaways

It’s been roughly two years of work on this and I think things are in a good enough state that it’s worth calling this v1.

Features

Typed dataframes

We got there eventually and I think we got there in a way that still looks nice. There is now a DataFrame.Typed API that tracks the entire schema of the dataframe - column names, misapplied operations etc are now compile time failures and you can easily move between exploratory and pipeline work. This is in large part thanks to maxigit and mcoady (Github user names) for their feedback.

$(DT.deriveSchemaFromCsvFile "Housing" "./data/housing.csv") main :: IO () main = do df <- D.readCsv "./data/housing.csv" let df' = either (error . show) id (DT.freezeWithError @Housing df) let df'' = df' & DT.derive @"rooms_per_household" (DT.col @"total_rooms" / DT.col @"households") & DT.impute @"total_bedrooms" 0 & DT.derive @"bedrooms_per_household" (DT.col @"total_bedrooms" / DT.col @"households") & DT.derive @"population_per_household" (DT.col @"population" / DT.col @"households") print df''

Calling dataframe from Python

There’s an implementation of Apache Arrow’s C Data interface along with an example of how to pass dataframes between polars and haskell.

Find that here

Getting data from hugging face

You can explore huggingface datasets. Example:

... continue reading