Skip to content
Tech News
← Back to articles

28M Hacker News comments as vector embedding search dataset

read original get Vector Embedding → more articles

Hacker News vector search dataset

The Hacker News dataset contains 28.74 million postings and their vector embeddings. The embeddings were generated using SentenceTransformers model all-MiniLM-L6-v2. The dimension of each embedding vector is 384 .

This dataset can be used to walk through the design, sizing and performance aspects for a large scale, real world vector search application built on top of user generated, textual data.

The complete dataset with vector embeddings is made available by ClickHouse as a single Parquet file in a S3 bucket

We recommend users first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the documentation.