Blaze
A high-performance full-text search engine in Go with inverted indexing, boolean queries, phrase search, proximity queries, and BM25 ranking—powered by a flexible query engine, roaring bitmaps, and skip lists.
Table of Contents
Overview
Blaze is a Go engine that provides fast, full-text search capabilities through an inverted index implementation. It's designed for applications that need to search through text documents efficiently without relying on external search engines.
Key Highlights:
Inverted Index : Maps terms to document positions for instant lookups
: Maps terms to document positions for instant lookups Skip Lists : Probabilistic data structure providing O(log n) operations
: Probabilistic data structure providing O(log n) operations Query Builder : Type-safe, fluent API for boolean queries with roaring bitmaps
: Type-safe, fluent API for boolean queries with roaring bitmaps Advanced Search : Phrase search, BM25 ranking, proximity ranking, and boolean queries
: Phrase search, BM25 ranking, proximity ranking, and boolean queries BM25 Algorithm : Industry-standard relevance scoring with IDF and length normalization
: Industry-standard relevance scoring with IDF and length normalization Text Analysis : Tokenization, stemming, stopword filtering, and case normalization
: Tokenization, stemming, stopword filtering, and case normalization Thread-Safe : Concurrent indexing with mutex protection
: Concurrent indexing with mutex protection Serialization: Efficient binary format for persistence
Features
Search Capabilities
Term Search : Find documents containing specific terms
: Find documents containing specific terms Phrase Search : Exact multi-word matching ("quick brown fox")
: Exact multi-word matching ("quick brown fox") Boolean Queries : Type-safe AND, OR, NOT operations with query builder
: Type-safe AND, OR, NOT operations with query builder BM25 Ranking : Industry-standard relevance scoring (used by Elasticsearch, Solr)
: Industry-standard relevance scoring (used by Elasticsearch, Solr) Proximity Ranking : Score results by term proximity
: Score results by term proximity Position Tracking : Track exact word positions within documents
: Track exact word positions within documents Roaring Bitmaps: Compressed bitmap operations for fast boolean queries
Text Processing
Tokenization : Unicode-aware text splitting
: Unicode-aware text splitting Stemming : Snowball (Porter2) stemmer for English
: Snowball (Porter2) stemmer for English Stopword Filtering : Remove common words (the, a, is, etc.)
: Remove common words (the, a, is, etc.) Case Normalization : Case-insensitive search
: Case-insensitive search Configurable Pipeline: Customize analysis behavior
Data Structures
Skip Lists : O(log n) search, insert, and delete operations
: O(log n) search, insert, and delete operations Inverted Index : Efficient term-to-position mapping
: Efficient term-to-position mapping Binary Serialization: Compact storage format
Installation
go get github.com/wizenheimer/blaze
Quick Start
package main import ( "fmt" "github.com/wizenheimer/blaze" ) func main () { // Create a new inverted index idx := blaze . NewInvertedIndex () // Index some documents idx . Index ( 1 , "The quick brown fox jumps over the lazy dog" ) idx . Index ( 2 , "A quick brown dog runs fast" ) idx . Index ( 3 , "The lazy cat sleeps all day" ) // Search for documents containing "quick" and "brown" matches := idx . RankProximity ( "quick brown" , 10 ) // Print results for _ , match := range matches { fmt . Printf ( "Document %d (score: %.2f)
" , int ( match . Offsets [ 0 ]. DocumentID ), match . Score ) } }
Output:
Document 2 (score: 1.00) Document 1 (score: 0.50)
Core Concepts
Inverted Index
An inverted index is like the index at the back of a book. Instead of scanning every document to find a word, the index tells you exactly where each word appears.
Example:
Given these documents:
Doc 1: "the quick brown fox" Pos:0 1 2 3 Doc 2: "the lazy dog" Pos:0 1 2 Doc 3: "quick brown dogs" Pos:0 1 2
The inverted index looks like:
┌─────────┬────────────────────────────────────┐ │ Token │ Posting List │ ├─────────┼────────────────────────────────────┤ │ "quick" │ → [Doc1:Pos1] → [Doc3:Pos0] │ │ "brown" │ → [Doc1:Pos2] → [Doc3:Pos1] │ │ "fox" │ → [Doc1:Pos3] │ │ "lazy" │ → [Doc2:Pos1] │ │ "dog" │ → [Doc2:Pos2] │ │ "dogs" │ → [Doc3:Pos2] │ └─────────┴────────────────────────────────────┘
Visual Representation:
Inverted Index ┌──────────┐ │ Map │ │ [string] │ │ SkipList │ └────┬─────┘ │ ┌────────────────┼────────────────┐ │ │ │ ▼ ▼ ▼ "quick" "brown" "fox" SkipList SkipList SkipList ┌──────┐ ┌──────┐ ┌──────┐ │ HEAD │ │ HEAD │ │ HEAD │ └──┬───┘ └──┬───┘ └──┬───┘ │ │ │ ▼ ▼ ▼ ┌──────┐ ┌──────┐ ┌──────┐ │Doc1:1│ │Doc1:2│ │Doc1:3│ └──┬───┘ └──┬───┘ └──────┘ │ │ ▼ ▼ ┌──────┐ ┌──────┐ │Doc3:0│ │Doc3:1│ └──────┘ └──────┘
Benefits:
Instant term lookups (no document scanning)
Phrase search via position checking
Proximity ranking by measuring distances
Efficient boolean queries (AND, OR, NOT)
Skip Lists
A skip list is a probabilistic data structure that maintains sorted data with O(log n) average time complexity for search, insertion, and deletion.
Visual Representation:
Skip List with Multiple Levels (Express Lanes) ═══════════════════════════════════════════════════════════════ Level 3: HEAD ────────────────────────────────────────────────────────────> [30] ────────> NULL ↓ ↓ Level 2: HEAD ─────────────────────────────> [15] ────────────────────────> [30] ────────> NULL ↓ ↓ ↓ Level 1: HEAD ─────────────> [10] ─────────> [15] ────────> [20] ─────────> [30] ────────> NULL ↓ ↓ ↓ ↓ ↓ Level 0: HEAD ──> [5] ──> [10] ──> [15] ──> [20] ──> [25] ──> [30] ──> [35] ──> NULL (ALL NODES AT LEVEL 0) ┌───────┐ │ Node │ Each node has a "tower" of forward pointers ├───────┤ │ Key │ Example: Node [15] ├───────┤ │ Lvl 3 │ ──> [30] (skip far ahead) │ Lvl 2 │ ──> [30] (skip ahead) │ Lvl 1 │ ──> [20] (skip a little) │ Lvl 0 │ ──> [20] (next node) └───────┘
How Heights are Assigned (Probabilistic):
Coin Flip Algorithm: ┌─────────┬─────────────┬─────────────┐ │ Height │ Probability │ Visual │ ├─────────┼─────────────┼─────────────┤ │ 1 │ 50% │ ▓▓▓▓▓ │ │ 2 │ 25% │ ▓▓▓ │ │ 3 │ 12.5% │ ▓▓ │ │ 4 │ 6.25% │ ▓ │ └─────────┴─────────────┴─────────────┘ For 1000 nodes, expected distribution: Level 0: ~1000 nodes (all) ████████████████████████████████████████ Level 1: ~500 nodes ████████████████████ Level 2: ~250 nodes ██████████ Level 3: ~125 nodes █████ Level 4: ~62 nodes ██
Search Algorithm (finding 20):
Step-by-Step Search for Key = 20: Level 3: [HEAD] ───────────────────────────────> [30] (30 > 20, drop down) ↓ Level 2: [HEAD] ──────────────> [15] ─────────> [30] (15 < 20, advance) ↓ Level 2: [15] ─────────> [30] (30 > 20, drop down) ↓ Level 1: [15] ──> [20] (20 = 20, FOUND!) ^^^^ Journey Recorded: ┌───────────┬─────────────────┐ │ Level 3 │ HEAD │ Predecessor at each level │ Level 2 │ [15] │ Used for insertions/deletions │ Level 1 │ [15] │ │ Level 0 │ [15] │ └───────────┴─────────────────┘
Start at HEAD, Level 3 Level 3: Move to 30? No (30 > 20), drop to Level 2 Level 2: Move to 15? Yes (15 < 20), advance to 15 Level 2: Move to 30? No (30 > 20), drop to Level 1 Level 1: Move to 20? Yes! Found it!
Time Complexity: O(log n) on average
Why Skip Lists?
O(log n) operations without complex balancing
Simpler than AVL or Red-Black trees
Better cache locality than trees
Easier to make lock-free for concurrency
Used in Redis, LevelDB, and other databases
Text Analysis Pipeline
Blaze transforms raw text into searchable tokens through a multi-stage pipeline:
Pipeline Stages:
┌─────────────────────────────────────────────────────────────────────┐ │ Text Analysis Pipeline │ └─────────────────────────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────┐ │ 1. Tokenization │ │ Split on non-alphanumeric chars │ └────────────────┬───────────────────────┘ ▼ ┌────────────────────────────────────────┐ │ 2. Lowercasing │ │ Normalize case ("Quick" → "quick") │ └────────────────┬───────────────────────┘ ▼ ┌────────────────────────────────────────┐ │ 3. Stopword Filtering │ │ Remove common words (the, a, is) │ └────────────────┬───────────────────────┘ ▼ ┌────────────────────────────────────────┐ │ 4. Length Filtering │ │ Remove tokens < 2 chars │ └────────────────┬───────────────────────┘ ▼ ┌────────────────────────────────────────┐ │ 5. Stemming (Snowball/Porter2) │ │ Reduce to root ("running" → "run") │ └────────────────┬───────────────────────┘ ▼ Final Tokens
Example Transformation:
Input: "The Quick Brown Fox Jumps!" │ ├─ Step 1: Tokenization │ └─> ["The", "Quick", "Brown", "Fox", "Jumps"] │ ├─ Step 2: Lowercasing │ └─> ["the", "quick", "brown", "fox", "jumps"] │ ├─ Step 3: Stopword Filtering (remove "the") │ └─> ["quick", "brown", "fox", "jumps"] │ ├─ Step 4: Length Filtering (all pass >= 2 chars) │ └─> ["quick", "brown", "fox", "jumps"] │ └─ Step 5: Stemming ("jumps" → "jump") └─> ["quick", "brown", "fox", "jump"]
Configuration:
// Use default configuration tokens := blaze . Analyze ( "The quick brown fox" ) // Custom configuration config := blaze. AnalyzerConfig { MinTokenLength : 3 , // Only keep tokens >= 3 chars EnableStemming : false , // Disable stemming EnableStopwords : true , // Keep stopword filtering } tokens := blaze . AnalyzeWithConfig ( "The quick brown fox" , config )
Search Operations
1. Basic Term Search
Find all occurrences of a single term:
idx := blaze . NewInvertedIndex () idx . Index ( 1 , "the quick brown fox" ) idx . Index ( 2 , "quick brown dogs" ) // Find first occurrence of "quick" pos , err := idx . First ( "quick" ) if err == nil { fmt . Printf ( "Found at Doc %d, Pos %d
" , int ( pos . DocumentID ), int ( pos . Offset )) } // Find next occurrence nextPos , _ := idx . Next ( "quick" , pos )
2. Phrase Search
Find exact sequences of words:
// Find documents containing "quick brown fox" as a phrase matches := idx . FindAllPhrases ( "quick brown fox" , blaze . BOFDocument ) for _ , match := range matches { start , end := match [ 0 ], match [ 1 ] fmt . Printf ( "Found in Doc %d from Pos %d to %d
" , int ( start . DocumentID ), int ( start . Offset ), int ( end . Offset )) }
Algorithm:
Searching for phrase: "brown fox" Document: "the quick brown dog jumped over the brown fox" Positions: 0 1 2 3 4 5 6 7 8 Phase 1: Find END (last word "fox") ┌─────────────────────────────────────────────────────────┐ │ Find "brown" → Doc:Pos2 │ │ Find "fox" after Pos2 → Doc:Pos8 ← END position │ └─────────────────────────────────────────────────────────┘ Phase 2: Walk BACKWARDS from END to find START ┌─────────────────────────────────────────────────────────┐ │ From Pos9, find previous "brown" → Doc:Pos7 ← START │ └─────────────────────────────────────────────────────────┘ Phase 3: Validate ┌─────────────────────────────────────────────────────────┐ │ Start: Pos7, End: Pos8 │ │ Distance: 8 - 7 = 1 │ │ Expected: 2 words - 1 = 1 MATCH! │ │ │ │ "brown" "fox" │ │ ▲ ▲ │ │ Pos7 Pos8 (consecutive positions) │ └─────────────────────────────────────────────────────────┘
Find END: Locate the last word of the phrase Walk BACKWARDS: Find previous occurrences of earlier words Validate: Check if positions are consecutive Recurse: Continue searching for more matches
3. Proximity Search
Find documents containing all terms (not necessarily consecutive):
// Find documents with both "quick" and "fox" cover := idx . NextCover ([] string { "quick" , "fox" }, blaze . BOFDocument ) start , end := cover [ 0 ], cover [ 1 ] // Calculate proximity score distance := end . Offset - start . Offset score := 1.0 / distance // Closer terms = higher score
Cover Algorithm:
Searching for: ["quick", "fox"] (any order, not necessarily consecutive) Document: "the quick brown dog jumped over the lazy fox" Positions: 0 1 2 3 4 5 6 7 8 Phase 1: Find COVER END (furthest term) ┌──────────────────────────────────────────────────────────────┐ │ Find "quick" after BOF → Doc:Pos1 │ │ Find "fox" after BOF → Doc:Pos8 ← FURTHEST (cover end) │ └──────────────────────────────────────────────────────────────┘ Phase 2: Find COVER START (earliest term before end) ┌──────────────────────────────────────────────────────────────┐ │ Find "quick" before Pos9 → Doc:Pos1 ← EARLIEST (cover start)│ │ Find "fox" before Pos9 → Doc:Pos8 │ └──────────────────────────────────────────────────────────────┘ Phase 3: Validate & Return ┌──────────────────────────────────────────────────────────────┐ │ Cover: [Pos1, Pos8] │ │ Same document? Yes │ │ All terms present? Yes │ │ │ │ "quick" ... ... ... ... ... ... ... "fox" │ │ ▲ ▲ │ │ Pos1 Pos8 │ │ └────────── Cover Range ──────────────┘ │ │ │ │ Proximity Score: 1 / (8 - 1 + 1) = 1/8 = 0.125 │ └──────────────────────────────────────────────────────────────┘
Find FURTHEST occurrence of any term (cover end) Find EARLIEST occurrence of each term before end (cover start) Validate all terms are in the same document Return [start, end] positions
4. BM25 Ranking
BM25 (Best Matching 25) is a probabilistic ranking function used by search engines to estimate the relevance of documents to a given search query. It's the industry standard used by Elasticsearch, Solr, and Lucene.
// Search and rank using BM25 results := idx . RankBM25 ( "machine learning" , 10 ) for _ , match := range results { fmt . Printf ( "Doc %d: Score %.2f
" , match . DocID , match . Score ) }
What BM25 Considers:
+------------------+-------------------------------------------------------+ | Factor | Description | +------------------+-------------------------------------------------------+ | Term Frequency | How often does the term appear? | | | More occurrences = higher relevance | +------------------+-------------------------------------------------------+ | TF Saturation | Diminishing returns | | | 3->10 occurrences matters less than 0->3 | +------------------+-------------------------------------------------------+ | Document Length | Normalize by document size | | | Prevents long docs from dominating results | +------------------+-------------------------------------------------------+ | Term Rarity | Rare terms are more important than common ones | | | "quantum" > "the" in importance | +------------------+-------------------------------------------------------+
Complete BM25 Formula:
IDF(q_i) × TF(q_i, D) × (k1 + 1) BM25(D, Q) = SUM ───────────────────────────────────────── q_i TF(q_i, D) + k1 × (1 - b + b × |D|/avgdl) in Q Where: D = Document being scored Q = Query (set of terms q_1, q_2, ..., q_n) q_i = Individual query term
Component Breakdown:
+-------------------+-----------------------------------------------------+ | Component | Definition | +-------------------+-----------------------------------------------------+ | IDF(q_i) | Inverse Document Frequency | | | | | | N - df(q_i) + 0.5 | | | log( ─────────────────────── + 1 ) | | | df(q_i) + 0.5 | | | | | | N = Total documents in corpus | | | df = Documents containing term q_i | | | | | | Effect: Rare terms get higher weights | +-------------------+-----------------------------------------------------+ | TF(q_i, D) | Term Frequency | | | = Number of times q_i appears in document D | | | | | | Effect: More occurrences = higher relevance | +-------------------+-----------------------------------------------------+ | k1 | Term Frequency Saturation Parameter | | | = 1.5 (default) | | | Range: [1.2, 2.0] | | | | | | Effect: Controls diminishing returns | | | Higher k1 = less saturation | +-------------------+-----------------------------------------------------+ | b | Length Normalization Parameter | | | = 0.75 (default) | | | Range: [0, 1] | | | | | | Effect: Controls length penalty | | | b=1 = full normalization | | | b=0 = no normalization | +-------------------+-----------------------------------------------------+ | |D| | Document Length | | | = Number of terms in document D | +-------------------+-----------------------------------------------------+ | avgdl | Average Document Length | | | = Total terms / Total documents | +-------------------+-----------------------------------------------------+
Visual Example - Term Frequency Saturation:
Score Contribution (with k1=1.5, b=0.75) ^ | /--------------- (saturation) | / 3 | / | / 2 | / | / 1 | / | / 0 |______/ +---+---+---+---+---+---+---+---+---+---+---> Term Frequency 0 1 2 3 4 5 6 7 8 9 10 Key Insight: Going from 0->3 occurrences adds more to the score than going from 7->10 occurrences (diminishing returns)
Visual Example - Document Length Normalization:
Scenario: Same term frequency, different document lengths Document A: 100 words, "learning" appears 3 times Document B: 1000 words, "learning" appears 3 times Raw TF: Both have TF = 3 Density: Doc A = 3/100 = 3.0% <- Higher density Doc B = 3/1000 = 0.3% <- Lower density BM25 adjusts: Doc A gets HIGHER score (term is more prominent) Doc B gets LOWER score (term is less prominent) Length Penalty Formula: Penalty = k1 × (1 - b + b × docLen/avgDocLen) If docLen > avgDocLen: Penalty increases (score decreases) If docLen < avgDocLen: Penalty decreases (score increases)
Step-by-Step Scoring Example:
SETUP: ------ Query: "machine learning" Corpus: 1000 documents, average length 150 words Target: Document 1 (200 words) - "machine" appears 3 times (df=100 docs have "machine") - "learning" appears 2 times (df=50 docs have "learning") Parameters: k1=1.5, b=0.75 STEP 1: Calculate IDF for each term ---------------------------------------- IDF(machine): N = 1000, df = 100 IDF = log((1000 - 100 + 0.5) / (100 + 0.5) + 1) = log(900.5 / 100.5 + 1) = log(8.96 + 1) = log(9.96) ≈ 2.30 IDF(learning): N = 1000, df = 50 IDF = log((1000 - 50 + 0.5) / (50 + 0.5) + 1) = log(950.5 / 50.5 + 1) = log(18.82 + 1) = log(19.82) ≈ 2.99 Note: "learning" is rarer (df=50) than "machine" (df=100) so it gets a higher IDF weight STEP 2: Calculate normalized TF for "machine" ---------------------------------------------- TF = 3 (appears 3 times) docLen = 200 avgdl = 150 Numerator = TF × (k1 + 1) = 3 × (1.5 + 1) = 3 × 2.5 = 7.5 Denominator = TF + k1 × (1 - b + b × (docLen / avgdl)) = 3 + 1.5 × (1 - 0.75 + 0.75 × (200/150)) = 3 + 1.5 × (0.25 + 0.75 × 1.333) = 3 + 1.5 × (0.25 + 1.0) = 3 + 1.5 × 1.25 = 3 + 1.875 = 4.875 Normalized TF = 7.5 / 4.875 ≈ 1.54 Contribution = IDF × Normalized TF = 2.30 × 1.54 ≈ 3.54 STEP 3: Calculate normalized TF for "learning" ----------------------------------------------- TF = 2 (appears 2 times) docLen = 200 avgdl = 150 Numerator = 2 × 2.5 = 5.0 Denominator = 2 + 1.5 × (1 - 0.75 + 0.75 × (200/150)) = 2 + 1.875 = 3.875 Normalized TF = 5.0 / 3.875 ≈ 1.29 Contribution = IDF × Normalized TF = 2.99 × 1.29 ≈ 3.86 STEP 4: Calculate final BM25 score ----------------------------------- BM25(Document 1, "machine learning") = 3.54 + 3.86 = 7.40 +----------+----------+ | Term | Score | +----------+----------+ | machine | 3.54 | | learning | 3.86 | +----------+----------+ | TOTAL | 7.40 | +----------+----------+
Why BM25 Works:
+------------------------+------------------------------------------------+ | Advantage | Explanation | +------------------------+------------------------------------------------+ | Industry Standard | Used by Elasticsearch, Solr, Lucene | | | Battle-tested in production systems | +------------------------+------------------------------------------------+ | Probabilistic | Based on probability ranking principle | | | Solid theoretical foundation | +------------------------+------------------------------------------------+ | Term Rarity (IDF) | Rare terms contribute more to score | | | "quantum" > "the" in importance | +------------------------+------------------------------------------------+ | Saturation | Diminishing returns for repeated terms | | | 0->3 occurrences: HIGH impact | | | 7->10 occurrences: LOW impact | +------------------------+------------------------------------------------+ | Length Normalization | Prevents long documents from dominating | | | Adjusts for document size bias | +------------------------+------------------------------------------------+ | Tunable | Adjust k1 and b for domain-specific needs | | | Customize behavior without changing algorithm | +------------------------+------------------------------------------------+
Comparison with Simple TF-IDF:
Simple TF-IDF: Score = TF × IDF Problem: Linear relationship with TF 10 occurrences = 10x score of 1 occurrence TF-IDF Score ^ | / 10 | / | / 5 | / | / 0 |______________________________/ +---+---+---+---+---+---+---+---+---> Term Frequency 0 2 4 6 8 10 12 14 16 BM25: Score = IDF × (TF × (k1 + 1)) / (TF + k1 × length_norm) Benefit: Sublinear relationship with TF Saturation prevents spam BM25 Score ^ | /---------------- (plateau) 4 | / | / 2 | / | / 0 |________/ +---+---+---+---+---+---+---+---+---> Term Frequency 0 2 4 6 8 10 12 14 16 Key: BM25 saturates, preventing keyword stuffing exploits
5. Proximity Ranking
Score and rank documents by term proximity:
// Search and rank results matches := idx . RankProximity ( "machine learning" , 10 ) for _ , match := range matches { fmt . Printf ( "Doc %d: Score %.2f
" , int ( match . Offsets [ 0 ]. DocumentID ), match . Score ) }
Scoring Formula:
For each cover in a document: score += 1 / (coverEnd - coverStart + 1) ┌────────────────────────────────────────────────────────────────┐ │ Proximity Scoring Examples │ ├────────────────────────────────────────────────────────────────┤ │ │ │ Doc 1: "machine learning is machine learning" │ │ Pos:0 1 2 3 4 │ │ │ │ Cover 1: [Pos 0-1] → score += 1/(1-0+1) = 1/2 = 0.500 │ │ Cover 2: [Pos 3-4] → score += 1/(4-3+1) = 1/2 = 0.500 │ │ ───────────────────────────── │ │ Total Score: 1.000 │ │ │ ├────────────────────────────────────────────────────────────────┤ │ │ │ Doc 2: "learning about machine and learning" │ │ Pos:0 1 2 3 4 │ │ │ │ Cover 1: [Pos 0-2] → score += 1/(2-0+1) = 1/3 = 0.333 │ │ Cover 2: [Pos 2-4] → score += 1/(4-2+1) = 1/3 = 0.333 │ │ ───────────────────────────── │ │ Total Score: 0.666 │ │ │ ├────────────────────────────────────────────────────────────────┤ │ │ │ Doc 3: "machine ... ... ... ... learning" │ │ Pos:0 1 2 3 4 5 │ │ │ │ Cover 1: [Pos 0-5] → score += 1/(5-0+1) = 1/6 = 0.167 │ │ ───────────────────────────── │ │ Total Score: 0.167 │ │ │ └────────────────────────────────────────────────────────────────┘ Ranking: Doc 1 (1.000) > Doc 2 (0.666) > Doc 3 (0.167) ▲ ▲ ▲ Terms closest Terms medium Terms far apart
Why This Works:
Smaller distances → larger scores (inverse relationship)
Multiple occurrences → higher scores (additive)
Documents with terms close together rank higher
Query Builder API
The Query Builder provides a type-safe, fluent API for constructing complex boolean queries with roaring bitmaps. No string parsing, no syntax errors - just clean, composable code.
Why Use Builder Pattern
String Parsing Approach:
// Error-prone, runtime failures results , err := index . ExecuteQuery ( "(machine AND learning) OR python" ) if err != nil { // Handle parsing errors }
Builder Pattern Approach:
// Type-safe, compile-time checks, IDE autocomplete! results := blaze . NewQueryBuilder ( index ). Group ( func ( q * blaze. QueryBuilder ) { q . Term ( "machine" ). And (). Term ( "learning" ) }). Or (). Term ( "python" ). Execute ()
Query Builder Quick Start
Single Term Query
// Find all documents containing "machine" results := blaze . NewQueryBuilder ( idx ). Term ( "machine" ). Execute () fmt . Printf ( "Found %d documents
" , results . GetCardinality ())
AND Query
// Find documents with BOTH "machine" AND "learning" results := blaze . NewQueryBuilder ( idx ). Term ( "machine" ). And (). Term ( "learning" ). Execute ()
OR Query
// Find documents with "python" OR "javascript" results := blaze . NewQueryBuilder ( idx ). Term ( "python" ). Or (). Term ( "javascript" ). Execute ()
NOT Query
// Find documents with "python" but NOT "snake" results := blaze . NewQueryBuilder ( idx ). Term ( "python" ). And (). Not (). Term ( "snake" ). Execute ()
Complex Grouped Query
// (machine OR deep) AND learning results := blaze . NewQueryBuilder ( idx ). Group ( func ( q * blaze. QueryBuilder ) { q . Term ( "machine" ). Or (). Term ( "deep" ) }). And (). Term ( "learning" ). Execute ()
Phrase Query
// Find exact phrase "machine learning" results := blaze . NewQueryBuilder ( idx ). Phrase ( "machine learning" ). Execute ()
BM25 Ranked Results
// Get top 10 results ranked by relevance matches := blaze . NewQueryBuilder ( idx ). Term ( "machine" ). And (). Term ( "learning" ). ExecuteWithBM25 ( 10 ) for _ , match := range matches { fmt . Printf ( "Doc %d: score=%.2f
" , match . DocID , match . Score ) }
Query Builder Core Methods
Creates a new query builder instance.
qb := blaze . NewQueryBuilder ( idx )
Adds a single term to the query. Uses roaring bitmaps for O(1) document lookup.
qb . Term ( "machine" )
Adds an exact phrase match. Combines bitmap efficiency with skip list position checking.
qb . Phrase ( "machine learning" )
Combines results with intersection (both must match). Uses bitmap AND operation.
qb . Term ( "machine" ). And (). Term ( "learning" )
Combines results with union (either can match). Uses bitmap OR operation.
qb . Term ( "cat" ). Or (). Term ( "dog" )
Negates the next term (exclude from results). Uses bitmap difference operation.
qb . Term ( "python" ). And (). Not (). Term ( "snake" )
Creates a sub-query with its own scope for precedence control.
qb . Group ( func ( q * blaze. QueryBuilder ) { q . Term ( "machine" ). Or (). Term ( "deep" ) }). And (). Term ( "learning" )
Executes the query and returns a bitmap of matching document IDs.
results := qb . Execute () docCount := results . GetCardinality ()
Executes the query with BM25 ranking and returns top results.
matches := qb . ExecuteWithBM25 ( 10 ) // Top 10 results
Boolean Operations
The Query Builder provides convenient shorthand functions for common boolean operations:
AllOf(index *InvertedIndex, terms ...string) *roaring.Bitmap
Shorthand for documents containing ALL terms (AND operation).
// Find documents with "machine" AND "learning" AND "python" results := blaze . AllOf ( idx , "machine" , "learning" , "python" ) // Equivalent to: results := blaze . NewQueryBuilder ( idx ). Term ( "machine" ). And (). Term ( "learning" ). And (). Term ( "python" ). Execute ()
AnyOf(index *InvertedIndex, terms ...string) *roaring.Bitmap
Shorthand for documents containing ANY term (OR operation).
// Find documents with "cat" OR "dog" OR "bird" results := blaze . AnyOf ( idx , "cat" , "dog" , "bird" ) // Equivalent to: results := blaze . NewQueryBuilder ( idx ). Term ( "cat" ). Or (). Term ( "dog" ). Or (). Term ( "bird" ). Execute ()
TermExcluding(index *InvertedIndex, include string, exclude string) *roaring.Bitmap
Shorthand for term with exclusion (AND NOT operation).
// Find documents with "python" but NOT "snake" results := blaze . TermExcluding ( idx , "python" , "snake" ) // Equivalent to: results := blaze . NewQueryBuilder ( idx ). Term ( "python" ). And (). Not (). Term ( "snake" ). Execute ()
Query Patterns
Pattern 1: Broad to Narrow
Start with a broad category, then filter down with specific criteria.
// Find programming content about Python or JavaScript, excluding beginner material results := blaze . NewQueryBuilder ( idx ). Term ( "programming" ). And (). Group ( func ( q * blaze. QueryBuilder ) { q . Term ( "python" ). Or (). Term ( "javascript" ) }). And (). Not (). Term ( "beginner" ). ExecuteWithBM25 ( 10 )
Pattern 2: Multi-Criteria Matching
Match documents that satisfy multiple independent criteria.
// Find documents about (machine learning OR deep learning) AND (python OR tensorflow) results := blaze . NewQueryBuilder ( idx ). Group ( func ( q * blaze. QueryBuilder ) { q . Phrase ( "machine learning" ). Or (). Phrase ( "deep learning" ) }). And (). Group ( func ( q * blaze. QueryBuilder ) { q . Term ( "python" ). Or (). Term ( "tensorflow" ) }). ExecuteWithBM25 ( 20 )
Pattern 3: Exclusion Filtering
Find relevant content while filtering out noise or unwanted categories.
// Find "apple" content but exclude fruit/food related content results := blaze . NewQueryBuilder ( idx ). Term ( "apple" ). And (). Not (). Group ( func ( q * blaze. QueryBuilder ) { q . Term ( "fruit" ). Or (). Term ( "food" ). Or (). Term ( "cooking" ) }). Execute () // Finds "Apple Inc." not the fruit
Pattern 4: Category-Based Search
Search within specific categories or tags.
func SearchWithCategory ( idx * blaze. InvertedIndex , query string , categories [] string ) []blaze. Match { qb := blaze . NewQueryBuilder ( idx ) // Add main query qb . Term ( query ) // Add category filter if provided if len ( categories ) > 0 { qb . And (). Group ( func ( q * blaze. QueryBuilder ) { q . Term ( categories [ 0 ]) for i := 1 ; i < len ( categories ); i ++ { q . Or (). Term ( categories [ i ]) } }) } return qb . ExecuteWithBM25 ( 20 ) }
Query Builder Performance
The Query Builder leverages roaring bitmaps for exceptional performance on boolean operations.
Benchmarks (Apple M2)
BenchmarkQueryBuilder_Simple-8 440,616 ops/sec 2,511 ns/op 896 B/op 39 allocs/op BenchmarkQueryBuilder_Complex-8 222,024 ops/sec 5,333 ns/op 2,240 B/op 98 allocs/op BenchmarkQueryBuilder_WithBM25-8 411,124 ops/sec 2,955 ns/op 1,416 B/op 46 allocs/op
Performance Benefits
Operation Complexity Why It's Fast AND O(1) per chunk Roaring bitmap intersection OR O(1) per chunk Roaring bitmap union NOT O(1) per chunk Roaring bitmap difference Term Lookup O(1) Direct hash map access
Compression Benefits
For a term appearing in 500,000 documents:
Skip list positions: ~24 MB (500k nodes × 48 bytes)
Roaring bitmap: ~60 KB (400x compression!)
Query Builder Best Practices
1. Use Groups for Complex Logic
// Good: Clear precedence with groups qb . Group ( func ( q * blaze. QueryBuilder ) { q . Term ( "a" ). Or (). Term ( "b" ) }). And (). Term ( "c" ) // Bad: Ambiguous without groups qb . Term ( "a" ). Or (). Term ( "b" ). And (). Term ( "c" ) // Is this (a OR b) AND c or a OR (b AND c)?
2. Leverage Convenience Functions for Simple Cases
// Good: Clean and readable results := blaze . AllOf ( idx , "python" , "django" , "web" ) // Bad: Verbose for simple case results := blaze . NewQueryBuilder ( idx ). Term ( "python" ). And (). Term ( "django" ). And (). Term ( "web" ). Execute ()
3. Use BM25 for User-Facing Searches
// Good: Ranked results for users matches := qb . ExecuteWithBM25 ( 10 ) // Bad: Unranked - harder for users to find relevant docs bitmap := qb . Execute ()
4. Combine Phrases and Terms Strategically
// Good: Exact phrase + related term qb . Phrase ( "machine learning" ). And (). Term ( "python" ) // Bad: Overly restrictive qb . Phrase ( "machine learning python" ) // Requires exact phrase
5. Build Queries Programmatically
func BuildDynamicQuery ( idx * blaze. InvertedIndex , required [] string , optional [] string , excluded [] string ) * roaring. Bitmap { qb := blaze . NewQueryBuilder ( idx ) // Add required terms (AND) if len ( required ) > 0 { qb . Term ( required [ 0 ]) for i := 1 ; i < len ( required ); i ++ { qb . And (). Term ( required [ i ]) } } // Add optional terms (OR) if len ( optional ) > 0 { if len ( required ) > 0 { qb . And () } qb . Group ( func ( q * blaze. QueryBuilder ) { q . Term ( optional [ 0 ]) for i := 1 ; i < len ( optional ); i ++ { q . Or (). Term ( optional [ i ]) } }) } // Exclude terms (NOT) for _ , term := range excluded { qb . And (). Not (). Term ( term ) } return qb . Execute () }
Real-World Query Builder Examples
Example 1: E-commerce Search with Filters
func SearchProducts ( idx * blaze. InvertedIndex , searchTerm string , category string , excludeOutOfStock bool ) []blaze. Match { qb := blaze . NewQueryBuilder ( idx ). Term ( searchTerm ) // Add category filter if category != "" { qb . And (). Term ( category ) } // Exclude out of stock items if excludeOutOfStock { qb . And (). Not (). Term ( "outofstock" ) } return qb . ExecuteWithBM25 ( 20 ) }
Example 2: Multi-Category Search
func SearchInCategories ( idx * blaze. InvertedIndex , query string , categories [] string ) []blaze. Match { qb := blaze . NewQueryBuilder ( idx ). Term ( query ) if len ( categories ) > 0 { qb . And (). Group ( func ( q * blaze. QueryBuilder ) { q . Term ( categories [ 0 ]) for i := 1 ; i < len ( categories ); i ++ { q . Or (). Term ( categories [ i ]) } }) } return qb . ExecuteWithBM25 ( 50 ) }
Example 3: Content Filtering with Blocklist
func FilterContent ( idx * blaze. InvertedIndex , searchTerm string , blocklist [] string ) * roaring. Bitmap { qb := blaze . NewQueryBuilder ( idx ). Term ( searchTerm ) for _ , blocked := range blocklist { qb . And (). Not (). Term ( blocked ) } return qb . Execute () }
Example 4: Advanced Search with Multiple Phrases
func AdvancedSearch ( idx * blaze. InvertedIndex , phrases [] string , requiredTerms [] string ) []blaze. Match { qb := blaze . NewQueryBuilder ( idx ) // Match any of the phrases (OR) qb . Group ( func ( q * blaze. QueryBuilder ) { q . Phrase ( phrases [ 0 ]) for i := 1 ; i < len ( phrases ); i ++ { q . Or (). Phrase ( phrases [ i ]) } }) // AND with required terms for _ , term := range requiredTerms { qb . And (). Term ( term ) } return qb . ExecuteWithBM25 ( 10 ) } // Usage: results := AdvancedSearch ( idx , [] string { "machine learning" , "deep learning" }, [] string { "python" , "tensorflow" })
Example 5: HTTP API Integration
func SearchHandler ( w http. ResponseWriter , r * http. Request ) { query := r . URL . Query (). Get ( "q" ) category := r . URL . Query (). Get ( "category" ) exclude := r . URL . Query (). Get ( "exclude" ) qb := blaze . NewQueryBuilder ( index ). Term ( query ) if category != "" { qb . And (). Term ( category ) } if exclude != "" { qb . And (). Not (). Term ( exclude ) } results := qb . ExecuteWithBM25 ( 20 ) json . NewEncoder ( w ). Encode ( results ) }
Example 6: Semantic-Style Search
func SemanticSearch ( idx * blaze. InvertedIndex , concept string , relatedTerms [] string ) []blaze. Match { qb := blaze . NewQueryBuilder ( idx ) // Main concept OR any related terms qb . Term ( concept ) for _ , related := range relatedTerms { qb . Or (). Term ( related ) } return qb . ExecuteWithBM25 ( 50 ) } // Usage: results := SemanticSearch ( idx , "automobile" , [] string { "car" , "vehicle" , "transportation" , "automotive" })
API Reference
InvertedIndex
NewInvertedIndex
func NewInvertedIndex () * InvertedIndex
Creates a new empty inverted index.
Example:
idx := blaze . NewInvertedIndex ()
Index
func ( idx * InvertedIndex ) Index ( docID int , document string )
Adds a document to the inverted index. Thread-safe.
Parameters:
docID : Unique document identifier
: Unique document identifier document : Text content to index
Example:
idx . Index ( 1 , "The quick brown fox jumps over the lazy dog" ) idx . Index ( 2 , "A fast brown dog" )
What Happens:
Text is analyzed (tokenized, stemmed, etc.) Each token is recorded with its position Positions are stored in skip lists for fast lookup
First
func ( idx * InvertedIndex ) First ( token string ) ( Position , error )
Returns the first occurrence of a token in the index.
Example:
pos , err := idx . First ( "quick" ) if err != nil { // Token not found } fmt . Printf ( "Doc %d, Pos %d
" , int ( pos . DocumentID ), int ( pos . Offset ))
Returns:
Position : Location of first occurrence
: Location of first occurrence error : ErrNoPostingList if token doesn't exist
Last
func ( idx * InvertedIndex ) Last ( token string ) ( Position , error )
Returns the last occurrence of a token in the index.
Example:
pos , err := idx . Last ( "quick" )
Next
func ( idx * InvertedIndex ) Next ( token string , currentPos Position ) ( Position , error )
Finds the next occurrence of a token after the given position.
Example:
// Iterate through all occurrences pos := blaze . BOFDocument for { pos , err = idx . Next ( "quick" , pos ) if pos . IsEnd () || err != nil { break } fmt . Printf ( "Found at Doc %d, Pos %d
" , int ( pos . DocumentID ), int ( pos . Offset )) }
Previous
func ( idx * InvertedIndex ) Previous ( token string , currentPos Position ) ( Position , error )
Finds the previous occurrence of a token before the given position.
NextPhrase
func ( idx * InvertedIndex ) NextPhrase ( query string , startPos Position ) [] Position
Finds the next occurrence of a phrase (exact word sequence).
Parameters:
query : Space-separated phrase (e.g., "quick brown fox")
: Space-separated phrase (e.g., "quick brown fox") startPos : Position to start searching from
Returns:
[]Position : Array with two elements [phraseStart, phraseEnd]
: Array with two elements [phraseStart, phraseEnd] Returns [EOFDocument, EOFDocument] if no match found
Example:
matches := idx . NextPhrase ( "quick brown fox" , blaze . BOFDocument ) if ! matches [ 0 ]. IsEnd () { fmt . Printf ( "Phrase found in Doc %d from Pos %d to %d
" , int ( matches [ 0 ]. DocumentID ), int ( matches [ 0 ]. Offset ), int ( matches [ 1 ]. Offset )) }
FindAllPhrases
func ( idx * InvertedIndex ) FindAllPhrases ( query string , startPos Position ) [][] Position
Finds all occurrences of a phrase in the entire index.
Example:
allMatches := idx . FindAllPhrases ( "brown fox" , blaze . BOFDocument ) for _ , match := range allMatches { fmt . Printf ( "Doc %d: Pos %d-%d
" , int ( match [ 0 ]. DocumentID ), int ( match [ 0 ]. Offset ), int ( match [ 1 ]. Offset )) }
NextCover
func ( idx * InvertedIndex ) NextCover ( tokens [] string , startPos Position ) [] Position
Finds the next "cover" - a range containing all given tokens.
Parameters:
tokens : Array of search terms
: Array of search terms startPos : Position to start searching from
Returns:
[]Position : Array with [coverStart, coverEnd]
Example:
cover := idx . NextCover ([] string { "quick" , "fox" , "brown" }, blaze . BOFDocument ) fmt . Printf ( "Cover: Doc %d, Pos %d-%d
" , int ( cover [ 0 ]. DocumentID ), int ( cover [ 0 ]. Offset ), int ( cover [ 1 ]. Offset ))
RankBM25
func ( idx * InvertedIndex ) RankBM25 ( query string , maxResults int ) [] Match
Performs BM25 ranking of search results. This is the recommended search function for most use cases.
Parameters:
query : Search query (e.g., "machine learning")
: Search query (e.g., "machine learning") maxResults : Maximum number of results to return
Returns:
[]Match : Sorted array of matches with BM25 scores
Example:
results := idx . RankBM25 ( "machine learning" , 10 ) for i , match := range results { fmt . Printf ( "%d. Doc %d (score: %.2f)
" , i + 1 , match . DocID , match . Score ) }
Match Structure:
type Match struct { DocID int // Document identifier Offsets [] Position // Where terms appear in the document Score float64 // BM25 relevance score }
RankProximity
func ( idx * InvertedIndex ) RankProximity ( query string , maxResults int ) [] Match
Performs proximity-based ranking of search results. Alternative to BM25, ranks by term proximity.
Parameters:
query : Search query (e.g., "machine learning")
: Search query (e.g., "machine learning") maxResults : Maximum number of results to return
Returns:
[]Match : Sorted array of matches with proximity scores
Example:
results := idx . RankProximity ( "quick brown" , 5 ) for i , match := range results { fmt . Printf ( "%d. Doc %d (score: %.2f)
" , i + 1 , int ( match . Offsets [ 0 ]. DocumentID ), match . Score ) }
BM25 vs Proximity Ranking:
Feature BM25 Proximity Term Rarity Yes (IDF) No (all terms equal) Length Normalization Yes (built-in) No Term Frequency Yes (with saturation) No Term Distance No Yes (main factor) Use Case General search Finding close co-occurrences Industry Standard Yes (Elasticsearch, Solr) No (custom algorithm)
Encode
func ( idx * InvertedIndex ) Encode () ([] byte , error )
Serializes the inverted index to binary format.
Example:
data , err := idx . Encode () if err != nil { log . Fatal ( err ) } // Save to file err = os . WriteFile ( "index.bin" , data , 0644 )
Decode
func ( idx * InvertedIndex ) Decode ( data [] byte ) error
Deserializes binary data back into an inverted index.
Example:
data , err := os . ReadFile ( "index.bin" ) if err != nil { log . Fatal ( err ) } idx := blaze . NewInvertedIndex () err = idx . Decode ( data )
Text Analysis
Analyze
func Analyze ( text string ) [] string
Transforms raw text into searchable tokens using the default pipeline.
Example:
tokens := blaze . Analyze ( "The Quick Brown Fox Jumps!" ) // Returns: ["quick", "brown", "fox", "jump"]
AnalyzeWithConfig
func AnalyzeWithConfig ( text string , config AnalyzerConfig ) [] string
Transforms text using a custom configuration.
Example:
config := blaze. AnalyzerConfig { MinTokenLength : 3 , EnableStemming : false , EnableStopwords : true , } tokens := blaze . AnalyzeWithConfig ( "The quick brown fox" , config )
Position
Position Methods
func ( p * Position ) GetDocumentID () int func ( p * Position ) GetOffset () int func ( p * Position ) IsBeginning () bool func ( p * Position ) IsEnd () bool func ( p * Position ) IsBefore ( other Position ) bool func ( p * Position ) IsAfter ( other Position ) bool func ( p * Position ) Equals ( other Position ) bool
Example:
pos1 := blaze. Position { DocumentID : 1 , Offset : 5 } pos2 := blaze. Position { DocumentID : 1 , Offset : 10 } if pos1 . IsBefore ( pos2 ) { fmt . Println ( "pos1 comes before pos2" ) }
Skip List
NewSkipList
func NewSkipList () * SkipList
Creates a new empty skip list.
Insert
func ( sl * SkipList ) Insert ( key Position )
Adds or updates a position in the skip list. Average O(log n).
Find
func ( sl * SkipList ) Find ( key Position ) ( Position , error )
Searches for an exact position. Average O(log n).
Delete
func ( sl * SkipList ) Delete ( key Position ) bool
Removes a position from the skip list. Average O(log n).
FindLessThan
func ( sl * SkipList ) FindLessThan ( key Position ) ( Position , error )
Finds the largest position less than the given position.
FindGreaterThan
func ( sl * SkipList ) FindGreaterThan ( key Position ) ( Position , error )
Finds the smallest position greater than the given position.
Examples
Example 1: Basic Document Search with BM25
package main import ( "fmt" "github.com/wizenheimer/blaze" ) func main () { // Create index idx := blaze . NewInvertedIndex () // Index documents idx . Index ( 1 , "Go is a programming language designed at Google" ) idx . Index ( 2 , "Python is a high-level programming language" ) idx . Index ( 3 , "Go is fast and efficient for system programming" ) // Search for "programming language" using BM25 results := idx . RankBM25 ( "programming language" , 10 ) fmt . Println ( "Search results for 'programming language':" ) for i , match := range results { fmt . Printf ( "%d. Document %d (score: %.3f)
" , i + 1 , match . DocID , match . Score ) } }
Output:
Search results for 'programming language': 1. Document 1 (score: 4.521) 2. Document 2 (score: 4.521) 3. Document 3 (score: 2.156)
Note: BM25 scores are absolute values (not normalized to 0-1), reflecting relevance based on term frequency, document length, and term rarity.
Example 2: Phrase Search
package main import ( "fmt" "github.com/wizenheimer/blaze" ) func main () { idx := blaze . NewInvertedIndex () idx . Index ( 1 , "the quick brown fox jumps over the lazy dog" ) idx . Index ( 2 , "a quick brown dog runs fast" ) idx . Index ( 3 , "the lazy brown fox sleeps" ) // Find exact phrase "brown fox" matches := idx . FindAllPhrases ( "brown fox" , blaze . BOFDocument ) fmt . Println ( "Documents containing 'brown fox' as a phrase:" ) for _ , match := range matches { docID := int ( match [ 0 ]. DocumentID ) start := int ( match [ 0 ]. Offset ) end := int ( match [ 1 ]. Offset ) fmt . Printf ( "Document %d: positions %d-%d
" , docID , start , end ) } }
Output:
Documents containing 'brown fox' as a phrase: Document 1: positions 1-2 Document 3: positions 2-3
Example 3: Iterating Through Positions
package main import ( "fmt" "github.com/wizenheimer/blaze" ) func main () { idx := blaze . NewInvertedIndex () idx . Index ( 1 , "quick test quick test quick" ) idx . Index ( 2 , "another quick test here" ) // Find all occurrences of "quick" fmt . Println ( "All occurrences of 'quick':" ) pos := blaze . BOFDocument for { pos , err := idx . Next ( "quick" , pos ) if err != nil || pos . IsEnd () { break } fmt . Printf ( " Doc %d, Pos %d
" , int ( pos . DocumentID ), int ( pos . Offset )) } }
Output:
All occurrences of 'quick': Doc 1, Pos 0 Doc 1, Pos 2 Doc 1, Pos 4 Doc 2, Pos 1
Example 4: Persistence with Serialization
package main import ( "fmt" "os" "github.com/wizenheimer/blaze" ) func main () { // Build and save index idx := blaze . NewInvertedIndex () idx . Index ( 1 , "machine learning algorithms" ) idx . Index ( 2 , "deep learning neural networks" ) idx . Index ( 3 , "natural language processing" ) // Serialize to binary data , err := idx . Encode () if err != nil { panic ( err ) } // Save to file err = os . WriteFile ( "search_index.bin" , data , 0644 ) if err != nil { panic ( err ) } fmt . Println ( "Index saved to search_index.bin" ) // Load index from file loadedData , err := os . ReadFile ( "search_index.bin" ) if err != nil { panic ( err ) } loadedIdx := blaze . NewInvertedIndex () err = loadedIdx . Decode ( loadedData ) if err != nil { panic ( err ) } // Use loaded index results := loadedIdx . RankProximity ( "learning" , 5 ) fmt . Printf ( "Found %d documents
" , len ( results )) }
Example 5: Custom Analyzer Configuration
package main import ( "fmt" "github.com/wizenheimer/blaze" ) func main () { // Create custom analyzer config (no stemming, longer min length) config := blaze. AnalyzerConfig { MinTokenLength : 3 , // Minimum 3 characters EnableStemming : false , // Keep original word forms EnableStopwords : true , // Still remove stopwords } text := "The running dogs are running fast" // Compare default vs custom analysis defaultTokens := blaze . Analyze ( text ) customTokens := blaze . AnalyzeWithConfig ( text , config ) fmt . Println ( "Default tokens:" , defaultTokens ) fmt . Println ( "Custom tokens:" , customTokens ) }
Output:
Default tokens: [run dog run fast] Custom tokens: [running dogs running fast]
Example 6: Comparing BM25 and Proximity Ranking
package main import ( "fmt" "github.com/wizenheimer/blaze" ) func main () { idx := blaze . NewInvertedIndex () // Index documents idx . Index ( 1 , "machine learning algorithms" ) idx . Index ( 2 , "machine learning machine learning" ) // High term frequency idx . Index ( 3 , "machine and algorithms and learning" ) // Terms far apart query := "machine learning" // BM25 Ranking fmt . Println ( "BM25 Rankings:" ) bm25Results := idx . RankBM25 ( query , 10 ) for i , match := range bm25Results { fmt . Printf ( "%d. Doc %d (score: %.3f)
" , i + 1 , match . DocID , match . Score ) } // Proximity Ranking fmt . Println ( "
Proximity Rankings:" ) proxResults := idx . RankProximity ( query , 10 ) for i , match := range proxResults { docID := int ( match . Offsets [ 0 ]. DocumentID ) fmt . Printf ( "%d. Doc %d (score: %.3f)
" , i + 1 , docID , match . Score ) } }
Output:
BM25 Rankings: 1. Doc 2 (score: 5.234) ← High term frequency 2. Doc 1 (score: 3.156) 3. Doc 3 (score: 2.891) Proximity Rankings: 1. Doc 1 (score: 1.000) ← Terms adjacent 2. Doc 2 (score: 1.000) 3. Doc 3 (score: 0.200) ← Terms far apart
Key Differences:
BM25 favors Doc 2 (repeated terms = high relevance)
favors Doc 2 (repeated terms = high relevance) Proximity favors Doc 1 and Doc 2 equally (both have adjacent terms)
favors Doc 1 and Doc 2 equally (both have adjacent terms) Doc 3 ranks low in both (terms spread out)
Example 7: Building a Simple Search Engine
package main import ( "bufio" "fmt" "os" "strings" "github.com/wizenheimer/blaze" ) func main () { // Create index idx := blaze . NewInvertedIndex () // Index some documents docs := map [ int ] string { 1 : "Go is an open source programming language that makes it easy to build simple, reliable, and efficient software" , 2 : "Python is a programming language that lets you work quickly and integrate systems more effectively" , 3 : "JavaScript is a programming language that conforms to the ECMAScript specification" , 4 : "Rust is a multi-paradigm programming language focused on performance and safety" , 5 : "Java is a class-based, object-oriented programming language designed for portability" , } for id , doc := range docs { idx . Index ( id , doc ) } // Interactive search scanner := bufio . NewScanner ( os . Stdin ) for { fmt . Print ( "
Search query (or 'quit' to exit): " ) if ! scanner . Scan () { break } query := strings . TrimSpace ( scanner . Text ()) if query == "quit" { break } if query == "" { continue } // Perform search using BM25 results := idx . RankBM25 ( query , 5 ) if len ( results ) == 0 { fmt . Println ( "No results found" ) continue } // Display results fmt . Printf ( "
Found %d result(s):
" , len ( results )) for i , match := range results { fmt . Printf ( "
%d. Document %d (Score: %.3f)
" , i + 1 , match . DocID , match . Score ) fmt . Printf ( " %s
" , docs [ match . DocID ]) } } }
Performance Characteristics
Time Complexity
Operation Average Worst Case Notes Index (per document) O(n × log m) O(n × m) n = tokens, m = total positions Term lookup O(log m) O(m) m = positions for term Phrase search O(k × log m) O(k × m) k = phrase length BM25 ranking O(t × d) O(t × d) t = query terms, d = candidates Proximity ranking O(t × m) O(t × m) t = query terms Skip list insert O(log n) O(n) n = elements in list Skip list search O(log n) O(n) Probabilistically rare
Space Complexity
Component Space Notes Inverted index O(n) n = total unique positions Skip list nodes O(n × log n) Average 2 pointers per node Analyzer O(1) In-place processing Serialized index O(n) Compact binary format
Benchmarks
Performance on Apple M2 (8 cores), Go 1.24:
BenchmarkIndex-8 50000 35421 ns/op 18234 B/op 245 allocs/op BenchmarkTermSearch-8 300000 4123 ns/op 128 B/op 3 allocs/op BenchmarkPhraseSearch-8 100000 12456 ns/op 512 B/op 12 allocs/op BenchmarkRankBM25-8 60000 24567 ns/op 1856 B/op 38 allocs/op BenchmarkProximityRanking-8 50000 28934 ns/op 2048 B/op 45 allocs/op BenchmarkCalculateIDF-8 5000000 234 ns/op 16 B/op 1 allocs/op BenchmarkCalculateBM25Score-8 2000000 567 ns/op 64 B/op 2 allocs/op BenchmarkSkipListInsert-8 3000000 413 ns/op 255 B/op 6 allocs/op BenchmarkSkipListSearch-8 5000000 203 ns/op 23 B/op 1 allocs/op BenchmarkAnalyze-8 1000000 1234 ns/op 512 B/op 8 allocs/op BenchmarkEncode-8 10000 156789 ns/op 65536 B/op 234 allocs/op BenchmarkDecode-8 15000 123456 ns/op 49152 B/op 189 allocs/op
Scalability
Index Size vs Performance:
Documents Terms Index Time Search Time Memory 1K 10K 50ms 0.5ms 2 MB 10K 100K 500ms 1ms 20 MB 100K 1M 5s 2ms 200 MB 1M 10M 50s 5ms 2 GB
Notes:
Search time remains relatively constant due to O(log n) operations
Memory scales linearly with unique positions
Serialization reduces storage by ~40% compared to in-memory size
Configuration
BM25 Parameters
Customize BM25 ranking behavior:
type BM25Parameters struct { K1 float64 // Term frequency saturation (default: 1.5) B float64 // Length normalization (default: 0.75) }
Tuning BM25:
idx := blaze . NewInvertedIndex () // Adjust BM25 parameters before indexing idx . BM25Params . K1 = 2.0 // Higher = less saturation (more weight to TF) idx . BM25Params . B = 0.5 // Lower = less length penalty // Now index and search idx . Index ( 1 , "document content" ) results := idx . RankBM25 ( "query" , 10 )
Parameter Effects:
Parameter Range Effect When to Adjust K1 1.2 - 2.0 Controls TF saturation Higher for domains where term frequency matters more B 0 - 1 Controls length penalty Lower for domains with naturally longer docs
Examples:
// Academic papers (long documents, repeated terms important) idx . BM25Params . K1 = 2.0 idx . BM25Params . B = 0.5 // Short messages (length less important) idx . BM25Params . K1 = 1.2 idx . BM25Params . B = 0.3 // Default (works well for most cases) idx . BM25Params . K1 = 1.5 idx . BM25Params . B = 0.75
BM25 Statistics:
During indexing, Blaze automatically tracks:
type DocumentStats struct { DocID int // Document identifier Length int // Number of terms TermFreqs map [ string ] int // Term frequencies } // Corpus-level statistics idx . TotalDocs // Total documents indexed idx . TotalTerms // Total terms across all documents idx . DocStats // Per-document statistics
These statistics are:
Automatically computed during indexing
Serialized with the index
Used for BM25 score calculation
Analyzer Configuration
Customize the text analysis pipeline:
type AnalyzerConfig struct { MinTokenLength int // Minimum token length (default: 2) EnableStemming bool // Apply stemming (default: true) EnableStopwords bool // Remove stopwords (default: true) }
Configuration Examples:
// Exact matching (no stemming, keep all words) exactConfig := blaze. AnalyzerConfig { MinTokenLength : 1 , EnableStemming : false , EnableStopwords : false , } // Fuzzy matching (aggressive stemming) fuzzyConfig := blaze. AnalyzerConfig { MinTokenLength : 2 , EnableStemming : true , EnableStopwords : true , } // Code search (no stemming, no stopwords, longer tokens) codeConfig := blaze. AnalyzerConfig { MinTokenLength : 3 , EnableStemming : false , EnableStopwords : false , }
Tuning Recommendations
MinTokenLength:
1 : Very permissive, large index
: Very permissive, large index 2 : Balanced (default), filters single chars
: Balanced (default), filters single chars 3: Strict, smaller index, misses short words
EnableStemming:
true : Better recall, finds related words ("run" matches "running")
: Better recall, finds related words ("run" matches "running") false: Exact matching, preserves original word forms
EnableStopwords:
true : Smaller index, faster search, standard behavior
: Smaller index, faster search, standard behavior false: Complete indexing, useful for phrase search
Skip List Parameters
const MaxHeight = 32 // Maximum tower height
Tower Height Probability:
Height 1: 50%
Height 2: 25%
Height 3: 12.5%
Height 4: 6.25%
This geometric distribution ensures O(log n) average performance.
Use Cases
1. Document Search Systems
Build a search engine for documents:
type Document struct { ID int Title string Content string } func IndexDocuments ( docs [] Document ) * blaze. InvertedIndex { idx := blaze . NewInvertedIndex () for _ , doc := range docs { // Combine title and content text := doc . Title + " " + doc . Content idx . Index ( doc . ID , text ) } return idx } func SearchDocuments ( idx * blaze. InvertedIndex , query string ) [] int { // Use BM25 for general relevance ranking (recommended) matches := idx . RankBM25 ( query , 20 ) docIDs := make ([] int , len ( matches )) for i , match := range matches { docIDs [ i ] = match . DocID } return docIDs } // Alternative: Use proximity ranking to find documents with close term matches func SearchDocumentsByProximity ( idx * blaze. InvertedIndex , query string ) [] int { matches := idx . RankProximity ( query , 20 ) docIDs := make ([] int , len ( matches )) for i , match := range matches { docIDs [ i ] = int ( match . Offsets [ 0 ]. DocumentID ) } return docIDs }
2. Log Analysis
Search through log files:
func IndexLogs ( logFile string ) ( * blaze. InvertedIndex , error ) { idx := blaze . NewInvertedIndex () file , err := os . Open ( logFile ) if err != nil { return nil , err } defer file . Close () scanner := bufio . NewScanner ( file ) lineNumber := 1 for scanner . Scan () { idx . Index ( lineNumber , scanner . Text ()) lineNumber ++ } return idx , scanner . Err () } // Find all ERROR log lines using BM25 (considers frequency and rarity) errorLogs := idx . RankBM25 ( "ERROR" , 100 ) // Alternative: Use proximity for finding error patterns // e.g., "connection timeout" appearing close together patternMatches := idx . RankProximity ( "connection timeout" , 50 )
3. Code Search
Search through source code:
func IndexCodebase ( rootDir string ) ( * blaze. InvertedIndex , error ) { idx := blaze . NewInvertedIndex () fileID := 1 // Custom config for code (no stemming, keep all words) config := blaze. AnalyzerConfig { MinTokenLength : 2 , EnableStemming : false , EnableStopwords : false , } err := filepath . Walk ( rootDir , func ( path string , info os. FileInfo , err error ) error { if err != nil || info . IsDir () { return err } // Only index Go files if ! strings . HasSuffix ( path , ".go" ) { return nil } content , err := os . ReadFile ( path ) if err != nil { return err } // Use custom analyzer tokens := blaze . AnalyzeWithConfig ( string ( content ), config ) // ... index tokens ... fileID ++ return nil }) return idx , err } // BM25: Find files with frequent mentions of a function/variable bm25Results := idx . RankBM25 ( "http.Handler" , 20 ) // Proximity: Find exact API patterns (e.g., function calls with parameters) // Better for finding "http.HandleFunc" as a specific pattern proximityResults := idx . RankProximity ( "http HandleFunc" , 20 )
4. E-commerce Product Search
Search product catalog:
type Product struct { ID int Name string Description string Category string Tags [] string } func IndexProducts ( products [] Product ) * blaze. InvertedIndex { idx := blaze . NewInvertedIndex () for _ , product := range products { // Combine all searchable fields searchText := fmt . Sprintf ( "%s %s %s %s" , product . Name , product . Description , product . Category , strings . Join ( product . Tags , " " )) idx . Index ( product . ID , searchText ) } return idx } // BM25: Best for general product search (considers all factors) results := idx . RankBM25 ( "wireless headphones" , 10 ) // Proximity: Good for finding exact product name matches // (e.g., "Sony WH-1000XM4" as an exact phrase proximity) exactMatches := idx . RankProximity ( "wireless headphones" , 10 )
5. Email Search
Index and search email messages:
type Email struct { ID int From string Subject string Body string } func IndexEmails ( emails [] Email ) * blaze. InvertedIndex { idx := blaze . NewInvertedIndex () for _ , email := range emails { searchText := fmt . Sprintf ( "%s %s %s" , email . From , email . Subject , email . Body ) idx . Index ( email . ID , searchText ) } return idx } // BM25: Find emails where terms appear frequently (general search) matches := idx . RankBM25 ( "project deadline" , 50 ) // Proximity: Find emails where "project" and "deadline" appear close together // (more precise, better for finding specific mentions) closeMatches := idx . RankProximity ( "project deadline" , 50 )
Testing
Running Tests
# Run all tests make test # Run tests with coverage make test-coverage # Run benchmarks make bench # Run all checks (format, vet, lint, test) make check
Test Coverage
The library has comprehensive test coverage:
$ make test-coverage Running tests... ok github.com/wizenheimer/blaze 2.456s coverage: 98.5% of statements Generating coverage report... Coverage report: coverage.html
Coverage by Component:
Inverted Index: 100%
Skip Lists: 100%
Text Analysis: 100%
Search Operations: 100%
BM25 Ranking: 100%
Serialization: 100%
Writing Tests
Example test:
func TestSearchFunctionality ( t * testing. T ) { idx := blaze . NewInvertedIndex () // Index test documents idx . Index ( 1 , "the quick brown fox" ) idx . Index ( 2 , "the lazy brown dog" ) // Test phrase search matches := idx . FindAllPhrases ( "brown fox" , blaze . BOFDocument ) if len ( matches ) != 1 { t . Errorf ( "Expected 1 match, got %d" , len ( matches )) } if int ( matches [ 0 ][ 0 ]. DocumentID ) != 1 { t . Errorf ( "Expected document 1, got %d" , int ( matches [ 0 ][ 0 ]. DocumentID )) } }
Architecture
Component Overview
blaze/ ├── index.go # Inverted index implementation with hybrid storage ├── query.go # Query builder with roaring bitmaps ├── skiplist.go # Skip list data structure for positions ├── search.go # Search algorithms (phrase, proximity, BM25) ├── analyzer.go # Text analysis pipeline ├── serialization.go # Binary encoding/decoding (skip lists + bitmaps) ├── *_test.go # Comprehensive test suite ├── Makefile # Development commands └── public/ # Documentation website └── index.html
Query Processor Architecture
The query processor uses a hybrid storage approach combining roaring bitmaps for document-level operations and skip lists for position-level operations.
┌─────────────────────────────────────────────────────────────────────────┐ │ QUERY PROCESSOR ARCHITECTURE │ └─────────────────────────────────────────────────────────────────────────┘ User Query "machine AND learning" │ ▼ ┌─────────────────────────────┐ │ Text Analyzer │ │ (tokenize, stem, etc.) │ └──────────────┬──────────────┘ │ ["machine", "learning"] │ ▼ ┌─────────────────────────────┐ │ Query Builder │ │ (constructs query tree) │ └──────────────┬──────────────┘ │ Query Tree: AND(machine, learning) │ ┌─────────────────────┼─────────────────────┐ │ │ │ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ Bitmap Ops │ │ Skip List │ │ BM25 Scorer │ │ (fast AND/OR)│ │ (positions) │ │ (ranking) │ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ │ │ │ └─────────────────────┼─────────────────────┘ │ ▼ ┌───────────────┐ │ Results │ │ (ranked docs)│ └───────────────┘
Hybrid Storage Architecture
Blaze uses a sophisticated hybrid storage model:
┌─────────────────────────────────────────────────────────────────────────┐ │ HYBRID STORAGE MODEL │ └─────────────────────────────────────────────────────────────────────────┘ For each term "machine": ┌─────────────────────────────────────────────────────────────────────────┐ │ DOCUMENT LEVEL (Roaring Bitmap) │ │ ────────────────────────────────────────────────────────────────────── │ │ │ │ DocBitmaps["machine"] = {1, 2, 4, 5, 100, 500, 1000, ...} │ │ │ │ Compressed representation of ALL documents containing "machine" │ │ Use: Fast boolean operations (AND, OR, NOT) │ │ Size: ~60 KB for 500k documents (400x compression!) │ │ │ └─────────────────────────────────────────────────────────────────────────┘ │ │ Links to ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ POSITION LEVEL (Skip List) │ │ ────────────────────────────────────────────────────────────────────── │ │ │ │ PostingsList["machine"] = SkipList: │ │ │ │ Level 2: [Doc1:Pos5] ────────────────────────> [Doc100:Pos12] │ │ │ │ │ │ Level 1: [Doc1:Pos5] ──> [Doc2:Pos3] ───────────> [Doc100:Pos12] │ │ │ │ │ │ │ Level 0: [Doc1:Pos5] -> [Doc2:Pos3] -> [Doc4:Pos1] -> [Doc5:Pos7] │ │ -> [Doc100:Pos12] -> [Doc500:Pos2] -> ... │ │ │ │ Detailed position information for EVERY occurrence │ │ Use: Phrase search, proximity ranking, snippets │ │ Size: ~24 MB for 500k positions │ │ │ └─────────────────────────────────────────────────────────────────────────┘ WHY HYBRID? ─────────── 1. Bitmaps: Lightning-fast document filtering (AND, OR, NOT in microseconds) 2. Skip Lists: Precise position tracking for phrases and proximity 3. Best of both worlds: Speed + Precision
Query Execution Flow
Here's how a complex query executes step-by-step:
QUERY: (machine OR deep) AND learning AND NOT neural ┌─────────────────────────────────────────────────────────────────────────┐ │ STEP 1: BITMAP PHASE (Fast Document Filtering) │ └─────────────────────────────────────────────────────────────────────────┘ Term Lookups (O(1) hash map): DocBitmaps["machine"] = {1, 2, 4, 5, 7, 8, 9, 10} DocBitmaps["deep"] = {2, 3, 5, 6, 8, 9} DocBitmaps["learning"]= {1, 2, 4, 5, 6, 7, 8, 9, 10} DocBitmaps["neural"] = {3, 6, 8, 9} Boolean Operations (O(1) per chunk): Step 1: machine OR deep {1, 2, 4, 5, 7, 8, 9, 10} ∪ {2, 3, 5, 6, 8, 9} = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} Step 2: (machine OR deep) AND learning {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} ∩ {1, 2, 4, 5, 6, 7, 8, 9, 10} = {1, 2, 4, 5, 6, 7, 8, 9, 10} Step 3: Result AND NOT neural {1, 2, 4, 5, 6, 7, 8, 9, 10} \ {3, 6, 8, 9} = {1, 2, 4, 5, 7, 10} ← CANDIDATE DOCUMENTS Time: ~10 microseconds for 1M documents! ┌─────────────────────────────────────────────────────────────────────────┐ │ STEP 2: POSITION PHASE (Optional - for phrases/proximity) │ └─────────────────────────────────────────────────────────────────────────┘ IF phrase search needed: For each candidate doc {1, 2, 4, 5, 7, 10}: Use skip lists to verify exact positions Check consecutive positions for phrases Extract position data for snippets Time: O(log n) per position lookup ┌─────────────────────────────────────────────────────────────────────────┐ │ STEP 3: RANKING PHASE (BM25 Scoring) │ └─────────────────────────────────────────────────────────────────────────┘ For each candidate document: 1. Calculate IDF (Inverse Document Frequency): - Uses bitmap cardinality for instant document counts - IDF("machine") = log((N - df + 0.5) / (df + 0.5)) - df = DocBitmaps["machine"].GetCardinality() 2. Calculate TF (Term Frequency): - Retrieves from pre-computed DocStats - TF("machine", Doc1) = termFreqs["machine"] 3. Apply BM25 formula: - Combines IDF, TF, and length normalization - Score = IDF × (TF × (k1 + 1)) / (TF + k1 × length_norm) 4. Sum scores for all query terms Results sorted by score: Doc 5: 8.45 Doc 2: 7.23 Doc 1: 6.91 ... Time: O(candidates × terms)
Data Structure Memory Layout
┌─────────────────────────────────────────────────────────────────────────┐ │ INVERTED INDEX STRUCTURE │ └─────────────────────────────────────────────────────────────────────────┘ InvertedIndex { ┌─────────────────────────────────────────────────────────────────┐ │ DocBitmaps: map[string]*roaring.Bitmap │ │ ───────────────────────────────────────────────────────────────│ │ "machine" → [Compressed Bitmap: 512 bytes] │ │ "learning" → [Compressed Bitmap: 448 bytes] │ │ "deep" → [Compressed Bitmap: 256 bytes] │ │ ... │ │ │ │ Memory: ~100 bytes per term (compressed) │ └─────────────────────────────────────────────────────────────────┘ │ │ Parallel Storage ▼ ┌─────────────────────────────────────────────────────────────────┐ │ PostingsList: map[string]SkipList │ │ ───────────────────────────────────────────────────────────────│ │ "machine" → SkipList with 10,000 position nodes │ │ "learning" → SkipList with 8,000 position nodes │ │ "deep" → SkipList with 5,000 position nodes │ │ ... │ │ │ │ Memory: ~48 bytes per position (node overhead) │ └─────────────────────────────────────────────────────────────────┘ │ │ Statistics ▼ ┌─────────────────────────────────────────────────────────────────┐ │ DocStats: map[int]DocumentStats │ │ ───────────────────────────────────────────────────────────────│ │ Doc1 → {Length: 150, TermFreqs: {"machine": 3, ...}} │ │ Doc2 → {Length: 200, TermFreqs: {"learning": 5, ...}} │ │ ... │ │ │ │ Memory: ~16 bytes per term per document │ └─────────────────────────────────────────────────────────────────┘ │ │ Metadata ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Global Statistics │ │ ───────────────────────────────────────────────────────────────│ │ TotalDocs: 1,000,000 │ │ TotalTerms: 150,000,000 │ │ AvgDocLen: 150.0 │ │ BM25Params: {K1: 1.5, B: 0.75} │ └─────────────────────────────────────────────────────────────────┘ Mutex for thread safety (sync.RWMutex) } MEMORY BREAKDOWN (for 1M documents, 10M unique positions): ──────────────────────────────────────────────────────────── DocBitmaps: ~10 MB (compressed bitmaps) PostingsList: ~480 MB (skip list nodes) DocStats: ~500 MB (per-doc statistics) Overhead: ~10 MB (maps, pointers, etc.) ──────────────────────────────────────────────────────────── TOTAL: ~1 GB
Roaring Bitmap Internals
┌─────────────────────────────────────────────────────────────────────────┐ │ ROARING BITMAP STRUCTURE │ └─────────────────────────────────────────────────────────────────────────┘ Document IDs: {1, 2, 3, 100, 101, 102, 500000, 500001, 999999} Traditional Bitmap (naive): [1,1,1,0,0...0,1,1,1,0...0,1,1,0...0,1] Size: 1,000,000 bits = 125 KB (wasteful for sparse data) Roaring Bitmap (smart): Split into 65,536 chunks (high 16 bits = chunk ID): Chunk 0 (docs 0-65535): [1,2,3,100,101,102] Chunk 7 (docs 458752-524287): [500000, 500001] Chunk 15 (docs 983040-1048575): [999999] Storage per chunk (adaptive): ┌────────────────────────────────────────────────────┐ │ If cardinality < 4096: │ │ → Use Array Container │ │ → Store sorted uint16 values directly │ │ → Size: 2 bytes × cardinality │ │ │ │ If cardinality > 4096: │ │ → Use Bitmap Container │ │ → Store 65536-bit bitmap (8 KB) │ │ → Size: 8 KB fixed │ │ │ │ If cardinality = 65536 (all docs): │ │ → Use Run Container │ │ → Store: [0-65535] │ │ → Size: 4 bytes │ └────────────────────────────────────────────────────┘ Total Size: ~60 bytes (vs 125 KB!) Operations: AND: Container-by-container intersection Skip non-matching chunks (O(1)) Intersect matching chunks (O(min(n,m))) OR: Container-by-container union Merge all chunks (O(n+m)) NOT: Complement within document space Flip all bits in each chunk
Query Builder Execution Model
┌─────────────────────────────────────────────────────────────────────────┐ │ QUERY BUILDER EXECUTION MODEL │ └─────────────────────────────────────────────────────────────────────────┘ Query: NewQueryBuilder(idx). Group(func(q) { q.Term("machine").Or().Term("deep") }). And(). Term("learning"). Execute() INTERNAL REPRESENTATION: ──────────────────────── QueryBuilder { stack: []*roaring.Bitmap // Operand stack ops: []QueryOp // Operator stack terms: []string // Track for BM25 } EXECUTION TRACE: ──────────────── Step 1: Group(func(q) { ... }) ┌──────────────────────────────────────┐ │ Create sub-builder │ │ Execute sub-query │ │ Push result bitmap to parent stack │ └──────────────────────────────────────┘ Sub-query execution: 1.1: Term("machine") → Lookup: DocBitmaps["machine"] → Push: {1,2,4,5,7,8,9,10} 1.2: Or() → Push operator: OR 1.3: Term("deep") → Lookup: DocBitmaps["deep"] → Push: {2,3,5,6,8,9} 1.4: Apply OR → Pop: {2,3,5,6,8,9} → Pop: {1,2,4,5,7,8,9,10} → Union: {1,2,3,4,5,6,7,8,9,10} → Push result Result: {1,2,3,4,5,6,7,8,9,10} Step 2: And() → Push operator: AND Step 3: Term("learning") → Lookup: DocBitmaps["learning"] → Push: {1,2,4,5,6,7,8,9,10} Step 4: Execute() → Pop: {1,2,4,5,6,7,8,9,10} → Pop: {1,2,3,4,5,6,7,8,9,10} → Intersect: {1,2,4,5,6,7,8,9,10} → Return final bitmap OPERATION COSTS: ──────────────── Bitmap Lookup: O(1) ~100 ns Bitmap Union: O(n+m) ~1 µs for 10k docs Bitmap Intersect: O(min(n,m)) ~800 ns for 10k docs Bitmap Difference: O(n) ~900 ns for 10k docs Total Query Time: ~10 µs for typical query!
Data Flow
┌──────────────────────────────────────────────────────────────────────┐ │ Complete Data Flow │ └──────────────────────────────────────────────────────────────────────┘ User Input "The Quick Brown Fox!" │ ▼ ┌───────────────────────────────────────────┐ │ Text Analysis Pipeline │ │ ┌─────────────────────────────────────┐ │ │ │ 1. Tokenization │ │ │ │ ["The", "Quick", "Brown", "Fox"] │ │ │ └────────────┬────────────────────────┘ │ │ ▼ │ │ ┌─────────────────────────────────────┐ │ │ │ 2. Lowercasing │ │ │ │ ["the", "quick", "brown", "fox"] │ │ │ └────────────┬────────────────────────┘ │ │ ▼ │ │ ┌─────────────────────────────────────┐ │ │ │ 3. Stopword Filtering │ │ │ │ ["quick", "brown", "fox"] │ │ │ └────────────┬────────────────────────┘ │ │ ▼ │ │ ┌─────────────────────────────────────┐ │ │ │ 4. Length Filtering │ │ │ │ ["quick", "brown", "fox"] │ │ │ └────────────┬────────────────────────┘ │ │ ▼ │ │ ┌─────────────────────────────────────┐ │ │ │ 5. Stemming │ │ │ │ ["quick", "brown", "fox"] │ │ │ └────────────┬────────────────────────┘ │ └───────────────┼────────────────────────────┘ ▼ ["quick", "brown", "fox"] │ ▼ ┌───────────────────────────────────────────┐ │ Inverted Index (Indexing) │ │ │ │ ┌─────────┬────────────────────────┐ │ │ │ "quick" │ → SkipList │ │ │ │ │ └─> [Doc1:Pos0] │ │ │ ├─────────┼────────────────────────┤ │ │ │ "brown" │ → SkipList │ │ │ │ │ └─> [Doc1:Pos1] │ │ │ ├─────────┼────────────────────────┤ │ │ │ "fox" │ → SkipList │ │ │ │ │ └─> [Doc1:Pos2] │ │ │ └─────────┴────────────────────────┘ │ └───────────────┬───────────────────────────┘ │ ┌─────────────────┴─────────────────┐ │ Search Operations │ ▼ ▼ ┌──────────┐ ┌────────────┐ │ Term │ │ Phrase │ │ Search │ │ Search │ └────┬─────┘ └─────┬──────┘ │ │ └──────────┬───────────────────────┘ ▼ ┌───────────────┐ │ Proximity │ │ Ranking │ └───────┬───────┘ │ ▼ ┌───────────────────────┐ │ Ranked Results │ │ ┌─────────────────┐ │ │ │ Doc 1: Score 1.0│ │ │ │ Doc 2: Score 0.5│ │ │ │ Doc 3: Score 0.3│ │ │ └─────────────────┘ │ └───────────────────────┘
Key Design Decisions
1. Skip Lists over Balanced Trees
Rationale:
Simpler implementation (no rotation logic)
Better cache locality
Easier to make concurrent
Comparable performance (O(log n))
Used in production systems (Redis, LevelDB)
2. Position-Based Indexing
Instead of just tracking document IDs, Blaze tracks exact word positions:
Traditional Index (Document IDs only): ┌─────────┬──────────────────┐ │ "quick" │ [Doc1, Doc3] │ Cannot do phrase search └─────────┴──────────────────┘ Cannot rank by proximity Position-Based Index (Document + Offset): ┌─────────┬────────────────────────────────────┐ │ "quick" │ [Doc1:Pos1, Doc3:Pos0] │ Enables phrase search │ "brown" │ [Doc1:Pos2, Doc3:Pos1] │ Enables proximity ranking │ "fox" │ [Doc1:Pos3] │ Enables snippet generation └─────────┴────────────────────────────────────┘ Enables precise results Can verify: "quick brown" is a phrase in Doc1 (Pos1→Pos2) but NOT in Doc3 (Pos0 and Pos1 are not "quick brown")
Benefits:
Enables phrase search (check consecutive positions)
Enables proximity ranking (measure distances)
Enables snippet generation (extract relevant parts)
More precise search results
Trade-offs:
Larger index size (~2-3x more data)
More complex algorithms (but still O(log n))
3. Binary Serialization
Custom binary format instead of JSON:
Advantages:
60% smaller file size
3x faster parsing
Preserves skip list structure
Suitable for large indexes
4. Configurable Text Analysis
Pluggable analyzer configuration:
Benefits:
Adapt to different use cases
Trade-off precision vs recall
Support multiple languages (future)
Domain-specific customization
Best Practices
1. Choose Appropriate Document IDs
Use stable, unique identifiers:
// Good: Use database primary keys idx . Index ( dbRecord . ID , dbRecord . Content ) // Bad: Use array indices (changes when reordering) for i , doc := range docs { idx . Index ( i , doc . Content ) // Don't do this }
2. Batch Indexing for Large Datasets
func IndexLargeDataset ( docs [] Document ) * blaze. InvertedIndex { idx := blaze . NewInvertedIndex () // Process in batches batchSize := 1000 for i := 0 ; i < len ( docs ); i += batchSize { end := min ( i + batchSize , len ( docs )) batch := docs [ i : end ] for _ , doc := range batch { idx . Index ( doc . ID , doc . Content ) } // Optional: periodic serialization for checkpoints if i % 10000 == 0 { data , _ := idx . Encode () os . WriteFile ( fmt . Sprintf ( "checkpoint_%d.bin" , i ), data , 0644 ) } } return idx }
3. Use Appropriate Analyzer Config
Match configuration to your use case:
// Natural language text (books, articles) naturalLanguageConfig := blaze. AnalyzerConfig { MinTokenLength : 2 , EnableStemming : true , // Find related words EnableStopwords : true , // Remove common words } // Technical documentation (code, APIs) technicalConfig := blaze. AnalyzerConfig { MinTokenLength : 2 , EnableStemming : false , // Keep exact terms EnableStopwords : false , // Keep all words } // Product names (e-commerce) productConfig := blaze. AnalyzerConfig { MinTokenLength : 1 , // Include single chars (e.g., "X") EnableStemming : false , // Exact product names EnableStopwords : false , // Keep all words }
4. Persist Index for Large Datasets
Don't rebuild the index every time:
const indexFile = "search_index.bin" func LoadOrBuildIndex ( docs [] Document ) ( * blaze. InvertedIndex , error ) { // Try to load existing index if data , err := os . ReadFile ( indexFile ); err == nil { idx := blaze . NewInvertedIndex () if err := idx . Decode ( data ); err == nil { return idx , nil } } // Build new index idx := blaze . NewInvertedIndex () for _ , doc := range docs { idx . Index ( doc . ID , doc . Content ) } // Save for next time if data , err := idx . Encode (); err == nil { os . WriteFile ( indexFile , data , 0644 ) } return idx , nil }
5. Handle Concurrent Access
The Index method is thread-safe, but for read-heavy workloads:
type SearchService struct { idx * blaze. InvertedIndex mu sync. RWMutex } func ( s * SearchService ) Index ( docID int , content string ) { s . mu . Lock () defer s . mu . Unlock () s . idx . Index ( docID , content ) } func ( s * SearchService ) Search ( query string ) []blaze. Match { s . mu . RLock () defer s . mu . RUnlock () return s . idx . RankProximity ( query , 20 ) }
6. Monitor Index Size
Track index growth:
func ( idx * InvertedIndex ) Stats () map [ string ] interface {} { stats := make ( map [ string ] interface {}) stats [ "unique_terms" ] = len ( idx . PostingsList ) totalPositions := 0 for _ , skipList := range idx . PostingsList { // Count positions in this skip list iter := skipList . Iterator () for iter . HasNext () { iter . Next () totalPositions ++ } } stats [ "total_positions" ] = totalPositions stats [ "avg_positions_per_term" ] = float64 ( totalPositions ) / float64 ( len ( idx . PostingsList )) return stats }
7. Choose the Right Ranking Algorithm
Use BM25 when:
You need industry-standard relevance ranking
Term frequency matters (documents with more occurrences rank higher)
You want automatic length normalization
Rare terms should be weighted more heavily
Recommended for most use cases
Use Proximity when:
You want to find terms close together
Term distance is more important than frequency
You're searching for specific co-occurrences
You need snippet generation (using position data)
Practical Examples:
// E-commerce: General product search // BM25 considers term frequency and rarity bm25Results := idx . RankBM25 ( "wireless bluetooth headphones" , 20 ) // Returns products with any/all terms, ranked by relevance // E-commerce: Exact product name // Proximity finds terms appearing together proxResults := idx . RankProximity ( "Sony WH-1000XM4" , 20 ) // Returns products where terms appear close together // Document search: Research papers // BM25 for broad topic search papers := idx . RankBM25 ( "neural networks deep learning" , 50 ) // Document search: Finding specific phrase mentions // Proximity for finding "neural networks" as a concept mentions := idx . RankProximity ( "neural networks" , 50 ) // Best practice: Use both for different purposes! generalResults := idx . RankBM25 ( query , 100 ) // Cast wide net preciseResults := idx . RankProximity ( query , 20 ) // Refine results
8. Limit Result Set Size
Always specify a reasonable max results:
// Good: Limit results results := idx . RankBM25 ( "search query" , 100 ) // Bad: Could return millions of results results := idx . RankBM25 ( "search query" , math . MaxInt32 )
9. Pre-process Queries
Normalize queries before searching:
func NormalizeQuery ( query string ) string { // Remove extra whitespace query = strings . TrimSpace ( query ) query = strings . Join ( strings . Fields ( query ), " " ) // Convert to lowercase query = strings . ToLower ( query ) // Remove special characters (optional) query = regexp . MustCompile ( `[^\w\s]` ). ReplaceAllString ( query , "" ) return query } // Use normalized query normalizedQuery := NormalizeQuery ( userInput ) results := idx . RankBM25 ( normalizedQuery , 20 )
10. Monitor BM25 Statistics
Track corpus statistics for insights:
// After indexing fmt . Printf ( "Total documents: %d
" , idx . TotalDocs ) fmt . Printf ( "Total terms: %d
" , idx . TotalTerms ) fmt . Printf ( "Average doc length: %.2f
" , float64 ( idx . TotalTerms ) / float64 ( idx . TotalDocs )) // Per-document analysis for docID , stats := range idx . DocStats { fmt . Printf ( "Doc %d: %d terms
" , docID , stats . Length ) // Find most frequent terms for term , freq := range stats . TermFreqs { if freq > 5 { fmt . Printf ( " %s: %d occurrences
" , term , freq ) } } }
Contributing
Contributions are welcome! Please follow these guidelines:
Development Setup
# Clone repository git clone https://github.com/wizenheimer/blaze.git cd blaze # Install dependencies make deps # Run tests make test # Run linter make lint
Code Style
Follow Go conventions (gofmt, golint)
Write comprehensive comments
Include examples in documentation
Add tests for new features
Keep functions focused and small
Commit Messages
Use descriptive commit messages:
Good: - "feat: Add proximity ranking algorithm" - "feat: Handle empty query in RankProximity" - "fix: Reduce allocations in skip list insert" Bad: - "Update code" - "Fix bug uwu" - "WIP"
Pull Request Process
Fork the repository Create a feature branch ( git checkout -b feature/amazing-feature ) Make your changes Add tests Run make check to verify Commit your changes Push to your fork Open a Pull Request
License
MIT License
Acknowledgments