Smallpond – A lightweight data processing framework built on DuckDB and 3FS
Published on: 2025-07-06 06:56:35
smallpond
A lightweight data processing framework built on DuckDB and 3FS.
Features
🚀 High-performance data processing powered by DuckDB
🌍 Scalable to handle PB-scale datasets
🛠️ Easy operations with no long-running services
Installation
Python 3.8 to 3.12 is supported.
pip install smallpond
Quick Start
# Download example data wget https://duckdb.org/data/prices.parquet
import smallpond # Initialize session sp = smallpond . init () # Load data df = sp . read_parquet ( "prices.parquet" ) # Process data df = df . repartition ( 3 , hash_by = "ticker" ) df = sp . partial_sql ( "SELECT ticker, min(price), max(price) FROM {0} GROUP BY ticker" , df ) # Save results df . write_parquet ( "output/" ) # Show results print ( df . to_pandas ())
Documentation
For detailed guides and API reference:
Performance
We evaluated smallpond using the GraySort benchmark (script) on a cluster comprising 50 compute nodes and 25 storage nodes running 3FS. The benchmark sorted 110.5TiB of data in 3
... Read full article.