Hacker News vector search dataset
The Hacker News dataset contains 28.74 million postings and their vector embeddings. The embeddings were generated using SentenceTransformers model all-MiniLM-L6-v2. The dimension of each embedding vector is 384 .
This dataset can be used to walk through the design, sizing and performance aspects for a large scale, real world vector search application built on top of user generated, textual data.
The complete dataset with vector embeddings is made available by ClickHouse as a single Parquet file in a S3 bucket
We recommend users first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the documentation.