Tech News
← Back to articles

28M Hacker News comments as vector embedding search dataset

read original related products more articles

Hacker News vector search dataset

The Hacker News dataset contains 28.74 million postings and their vector embeddings. The embeddings were generated using SentenceTransformers model all-MiniLM-L6-v2. The dimension of each embedding vector is 384 .

This dataset can be used to walk through the design, sizing and performance aspects for a large scale, real world vector search application built on top of user generated, textual data.

The complete dataset with vector embeddings is made available by ClickHouse as a single Parquet file in a S3 bucket

We recommend users first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the documentation.