Tech News
← Back to articles

Hamming Distance for Hybrid Search in SQLite

read original related products more articles

Hamming Distance for Hybrid Search in SQLite

This article shows how I implemented semantic search in SQLite using binary embeddings and Hamming distance, enabling hybrid search without external vector databases.

SQLite's FTS5 extension provides excellent text search with BM25 ranking. However, it doesn't support semantic search, which means you can't combine keyword matching with meaning-based retrieval, a technique known as hybrid search.

Binary embeddings and Hamming distance

Semantic search works by converting text into numerical vectors (embeddings) that capture meaning (in my case, I do that via an API). Similar texts (i.e., talking about the same things) produce similar vectors. Embeddings generally use float32 values, that is, a 1024-dimensional embedding requires 4KiB per document.

Binary embeddings quantize each dimension to a single bit (0 or 1). This reduces storage dramatically, because 1024 dimensions become only 128 bytes. The similarity metric changes from cosine distance to Hamming distance, so it also allows using fast simple bit operations vs more expensive floating-point arithmetic.

The tradeoff is accuracy: binary quantization loses information. For many applications (including mine), this is acceptable given the storage and speed benefits, especially if it's combined with another classic text search algorithm like BM25.

Hamming Distance

Hamming distance counts the number of bit positions where two binary vectors differ.

Example with 8-bit vectors:

... continue reading