Building a web search engine from scratch in two months with 3 billion neural embeddings
A while back, I decided to undertake a project to challenge myself: build a web search engine from scratch. Aside from the fun deep dive opportunity, there were two motivators:
Search engines seemed to be getting worse, with more SEO spam and less relevant quality content.
Transformer-based text embedding models were taking off and showing amazing natural comprehension of language.
A simple question I had was: why couldn't a search engine always result in top quality content? Such content may be rare, but the Internet's tail is long, and better quality results should rank higher than the prolific inorganic content and engagement bait you see today.
Another pain point was that search engines often felt underpowered, closer to keyword matching than human-level intelligence. A reasonably complex or subtle query couldn't be answered by most search engines at all, but the ability to would be powerful:
Search engines cover broad areas of computer science, linguistics, ontology, NLP, ML, distributed systems, performance engineering, and so on. I thought it'd be interesting to see how much I could learn and cover in a short period. Plus, it'd be cool to have my own search engine. Given all these points, I dived right in.
In this post, I go over the 2-month journey end-to-end, starting from no infra, bootstrapped data, or any experience around building a web search engine. Some highlights:
A cluster of 200 GPUs generated a combined 3 billion SBERT embeddings.
At peak, hundreds of crawlers ingested 50K pages per second, culminating in an index of 280 million.
... continue reading