Skip to content
Tech News
← Back to articles

Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m

read original get Hacker News Data Archive → more articles
Why This Matters

This comprehensive and live-updated Hacker News archive provides researchers, developers, and industry analysts with real-time access to a vast dataset of over 47 million items dating back to 2006. Its availability in Parquet format and frequent updates enable efficient querying and analysis of technology trends, community discussions, and influential topics, making it a valuable resource for understanding the evolution of the tech industry and consumer interests.

Key Takeaways

Hacker News - Complete Archive

Every Hacker News item since 2006, live-updated every 5 minutes

What is it?

This dataset contains the complete Hacker News archive: every story, comment, Ask HN, Show HN, job posting, and poll ever submitted to the site. Hacker News is one of the longest-running and most influential technology communities on the internet, operated by Y Combinator since 2007. It has become the de facto gathering place for founders, engineers, researchers, and technologists to share and discuss what matters in technology.

The archive currently spans from 2006-10 to 2026-03-16 23:55 UTC, with 47,358,772 items committed. New items are fetched every 5 minutes and committed directly as individual Parquet files through an automated live pipeline, so the dataset stays current with the site itself.

We believe this is one of the most complete and regularly updated mirrors of Hacker News data available on Hugging Face. The data is stored as monthly Parquet files sorted by item ID, making it straightforward to query with DuckDB, load with the datasets library, or process with any tool that reads Parquet.

What is being released?

The dataset is organized as one Parquet file per calendar month, plus 5-minute live files for today's activity. Every 5 minutes, new items are fetched from the source and committed directly as a single Parquet block. At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory.

data/ 2006/2006-10.parquet first month with HN data 2006/2006-12.parquet 2007/2007-01.parquet ... 2026/2026-03.parquet most recent complete month 2026/2026-03.parquet current month, ongoing til 2026-03-15 today/ 2026/03/16/00/00.parquet 5-min live blocks (YYYY/MM/DD/HH/MM.parquet) 2026/03/16/00/05.parquet ... 2026/03/16/23/55.parquet most recent committed block stats.csv one row per committed month stats_today.csv one row per committed 5-min block

Along with the Parquet files, we include stats.csv which tracks every committed month with its item count, ID range, file size, fetch duration, and commit timestamp. This makes it easy to verify completeness and track the pipeline's progress.

... continue reading