How large should your sample size be?
Published on: 2025-06-08 19:26:54
Aug 4 2015
I read a recent interview with Hadley Wickham. Two things stood out to me.
The first is how down-to-earth he seems, even given how well-known he is in the data science community.
The second was this quote:
Big data problems [are] actually small data problems, once you have the right subset/sample/summary. Inventing numbers on the spot, I’d say 90% of big data problems fall into this category.
##The technical side of sampling
Even if you don’t have huge data sets (defined for me personally as anything over 10GB or 5 million rows, whichever comes first), you usually run into issues where even a fast computer will process too slowly in memory (especially if you’re using R). It will go even slower if you’re processing data remotely, as is usually the case with pulling down data from relational databases (I’m not considering Hadoop or other NoSQL solutions in this post, since they’re a different animal entirely.)
In the cases where pulling down data takes longer than runni
... Read full article.