The first version of my search engine was slower after I added an index.
That sounds backwards, but it is a real failure mode. A bad index can turn a simple sequential scan into a pile of cache misses, hash lookups, tiny heap allocations, branchy score calculations, and random memory walks. The CPU stops doing search and starts waiting for memory.
The target I wanted was intentionally unreasonable: a local search engine for a few hundred thousand short technical records that could answer ranked queries inside an interactive UI budget, while keeping the hot path small enough to stay friendly to an L3 cache.
Not “web search”. Not “vector database”. Not “throw Elasticsearch at it”.
Something closer to this:
$ tinysearch "tls timeout shard migration"
score doc title
12.91 inc_042 TLS handshakes stall during shard move
10.44 note_118 Connection pool timeout under deploy load
8.02 pr_077 Retry budget for migration workers
The interesting part is not that this can be built. The interesting part is where the naive version falls apart.
The naive implementation Link to heading
The first implementation usually looks like this:
fn search(query: &str, docs: &[Doc]) -> Vec<ScoredDoc> {
let query_terms = tokenize(query);
let mut hits = Vec::new();
for doc in docs {
let text = doc.search_text.to_lowercase();
let mut score = 0.0;
for term in &query_terms {
if text.contains(term) {
score += 1.0;
}
}
if score > 0.0 {
hits.push(ScoredDoc { id: doc.id, score });
}
}
hits.sort_by(|a, b| b.score.total_cmp(&a.score));
hits
}
This is a great prototype. It is also a trap.
It fails in four different ways:
| Failure | Why it hurts |
|---|---|
| Every query scans every document | Latency grows with corpus size, not result set size |
contains is not retrieval | It has no document frequency, no term saturation, no field weighting |
| Strings dominate runtime | Lowercasing, allocation, and UTF-8 scans repeat per query |
| Ranking is fake | A document with one rare exact term can lose to a long noisy document |
The bad version gets worse when you try to fix it casually. A HashMap<String, Vec<DocId>> inverted index is faster than scanning, but still allocates heavily and jumps around memory. A fuzzy layer adds more candidate terms, which adds more postings lists, which adds more random reads. A reranker improves quality, but if it receives 5,000 bad candidates, it just makes the slow path more expensive.
The lesson is uncomfortable: search quality and memory layout are the same problem earlier than you expect.
The better shape Link to heading
The design I like for this size of engine has five layers:
flowchart LR
Q["query text"] --> N["normalize + tokenize"]
N --> D["FST term dictionary"]
D --> P["compact postings"]
P --> B["BM25 top-k"]
B --> R["feature rerank"]
R --> F["rank fusion + explanation"]
The split matters because every stage has a different job:
| Stage | Job | Output |
|---|---|---|
| Normalizer | Make equivalent strings comparable | query terms |
| FST dictionary | Map term bytes to term ids and support prefix/fuzzy lookup | candidate terms |
| Postings store | Return doc ids and term frequencies cheaply | candidate docs |
| BM25 scorer | Do fast sparse ranking | top-k docs |
| Reranker | Apply slower contextual signals to a small set | final docs |
| RRF/fusion | Blend independent rankers without trusting raw score scales | final order |
The rule is simple: broad operations must be cheap, and expensive operations must be narrow.
Build the dictionary first Link to heading
A finite-state transducer is a compact way to represent a sorted set or map of strings. In Rust, the fst crate is a good mental model: it can store ordered byte keys and associate each key with a value. For a search engine, the value is usually a term id.
The dictionary answers:
- Does this term exist?
- What is its internal term id?
- Which terms match this prefix?
- Which terms are close enough for typo tolerance?
The important property is that the dictionary is immutable and compact. That makes it boring in the best way: build it once, memory-map it, and keep query-time work predictable.
use fst::Map;
struct TermDict {
map: Map<Vec<u8>>,
}
impl TermDict {
fn term_id(&self, term: &str) -> Option<u32> {
self.map.get(term).map(|id| id as u32)
}
}
The input has to be sorted:
let mut terms: Vec<(String, u64)> = vocabulary
.into_iter()
.enumerate()
.map(|(id, term)| (term, id as u64))
.collect();
terms.sort_by(|a, b| a.0.cmp(&b.0));
let mut builder = fst::MapBuilder::memory();
for (term, id) in terms {
builder.insert(term, id)?;
}
let bytes = builder.into_inner()?;
This is not the whole index. It is the front door. The dictionary should be tiny compared to the postings.
Postings are where performance is won Link to heading
An inverted index is just:
term -> [(doc_id, term_frequency), ...]
The layout decides whether queries are fast.
The naive representation is a map of vectors:
HashMap<TermId, Vec<Posting>>
That is fine while building. It is not ideal for query-time. The query path wants contiguous memory:
postings.bin
[doc_delta, tf, doc_delta, tf, doc_delta, tf, ...]
terms.meta
term_id -> { offset, len, doc_freq }
Instead of storing absolute doc ids, store deltas:
docs: 101, 108, 130, 131
deltas: 101, 7, 22, 1
Then encode small integers with a variable-length format. For small corpora, even a plain u32 array can be faster because decoding is branchless and simple. Compression is not automatically a win. The question is whether you are limited by memory bandwidth, cache capacity, or integer decode overhead.
For the fake corpus I used while tuning, the layouts looked like this:
| Layout | Index size | Median query | Notes |
|---|---|---|---|
HashMap<String, Vec<Posting>> | 74 MB | 18.4 ms | too many pointer jumps |
sorted term ids + Vec<Posting> | 41 MB | 7.9 ms | less allocation, still bulky |
delta encoded u32 blocks | 26 MB | 4.6 ms | best simple version |
| varint blocks | 17 MB | 5.1 ms | smaller, slightly more CPU |
Those numbers are fictional but the tradeoff is real: “smaller” and “faster” overlap until decoding work becomes the bottleneck.
BM25 is a good first scorer Link to heading
Term overlap is not enough. A rare term should matter more than a common term. A term repeated 40 times should help, but not linearly. Long documents should not win just because they contain more words.
BM25 handles those cases well enough to be the first real scorer:
score(q, d) = sum over query terms:
idf(term) *
(tf * (k1 + 1)) /
(tf + k1 * (1 - b + b * doc_len / avg_doc_len))
The useful parts:
idfrewards rare terms.tfsaturates, so repeated terms stop helping after a point.doc_len / avg_doc_lennormalizes long documents.
The hot path becomes:
fn score_term(
acc: &mut [f32],
postings: &[Posting],
idf: f32,
norms: &[f32],
) {
const K1: f32 = 1.2;
for posting in postings {
let tf = posting.tf as f32;
let norm = norms[posting.doc_id as usize];
let score = idf * (tf * (K1 + 1.0)) / (tf + K1 * norm);
acc[posting.doc_id as usize] += score;
}
}
That array write is the first serious design fork.
If the corpus has 200,000 docs, a dense Vec<f32> accumulator is tiny and fast. If the corpus has 200 million docs, it is absurd. For the local-search shape, a dense accumulator plus a touched-doc list is excellent:
if acc[doc] == 0.0 {
touched.push(doc);
}
acc[doc] += score;
After scoring, only sort the touched docs, or keep a bounded heap if the touched set is large.
The query planner matters Link to heading
The cheapest query is not always “score every term”.
For:
"timeout during shard migration"
during is probably useless. timeout is broad. shard and migration are more selective. Score the selective terms first.
The planner can use document frequency:
terms.sort_by_key(|term| term.doc_freq);
Then apply gates:
| Gate | Example |
|---|---|
| Drop stopwords | the, during, with |
| Cap very broad terms | ignore error if it hits 80% of docs |
| Require at least one selective term | shard or migration must match |
| Expand only rare prefixes | migr* is okay, e* is not |
This is where FST prefix lookup can hurt you. Prefix search is seductive:
mig -> migration, migrate, migrated, migrator
But a short prefix can explode into thousands of terms. The dictionary lookup is cheap. The postings are not.
I like explicit expansion budgets:
struct ExpansionBudget {
max_terms: usize,
max_total_doc_freq: u32,
}
When the expansion crosses the budget, degrade gracefully:
- use only exact terms
- use top-N rare expansions
- ask the reranker to handle fuzzy similarity later
Reranking should be boring and narrow Link to heading
BM25 is lexical. It does not understand every signal you care about.
For a local technical corpus, I usually want extra signals:
| Signal | Why it helps |
|---|---|
| title match | titles are dense summaries |
| path or namespace match | auth/cache beats random mentions of cache |
| recency | recent notes may be more actionable |
| exact phrase | “token refresh” is stronger than separate words |
| proximity | words near each other usually mean more |
| source type | incident, PR, doc, note, log |
The mistake is applying those signals to the full corpus. Rerank the top 100 or 200 BM25 candidates.
fn rerank(query: &Query, mut docs: Vec<ScoredDoc>) -> Vec<ScoredDoc> {
for doc in &mut docs {
doc.score += 1.7 * title_overlap(query, doc);
doc.score += 1.2 * exact_phrase(query, doc);
doc.score += 0.8 * path_overlap(query, doc);
doc.score += 0.4 * recency_boost(doc);
}
docs.sort_by(|a, b| b.score.total_cmp(&a.score));
docs
}
This looks unsophisticated because it is. That is the point. A small deterministic reranker is debuggable. You can print the score breakdown:
{
"doc": "inc_042",
"score": 12.91,
"features": {
"bm25": 8.14,
"title_overlap": 1.70,
"phrase": 1.20,
"path": 0.80,
"recency": 0.32
},
"matched_terms": ["tls", "timeout", "shard", "migration"]
}
This kind of explanation is more useful than a mysterious score with four decimal places.
RRF is a better blender than score normalization Link to heading
At some point you will have more than one retriever:
- BM25 over body text
- exact title matcher
- prefix/fuzzy term matcher
- vector search
- recent-document retriever
The obvious move is to normalize scores and add them. That gets messy because every retriever has a different score distribution. A BM25 score of 12 and a vector similarity of 0.82 do not mean comparable things.
Reciprocal Rank Fusion avoids that by using ranks:
rrf_score(doc) = sum over rankers 1 / (k + rank(doc))
A simple implementation:
fn rrf(lists: &[Vec<DocId>], k: f32) -> HashMap<DocId, f32> {
let mut scores = HashMap::new();
for list in lists {
for (rank, doc) in list.iter().enumerate() {
let contribution = 1.0 / (k + rank as f32 + 1.0);
*scores.entry(*doc).or_insert(0.0) += contribution;
}
}
scores
}
RRF has a nice failure mode: one weird retriever can suggest candidates without dominating the result set. But it is not magic. If every retriever shares the same blind spot, fusion just agrees with itself confidently.
I like using RRF when the retrievers are genuinely different. BM25 plus title exact plus vector search is useful. BM25 body plus BM25 body with a different boost is less useful.
Cache-aware scoring Link to heading
The phrase “fits in L3 cache” is a forcing function, not a universal requirement. Modern CPUs vary wildly. The point is to budget the hot query path like it has to live in a small memory neighborhood.
The hot data is:
| Data | Access pattern |
|---|---|
| term metadata | tiny random reads |
| postings blocks | sequential reads per term |
| doc length norms | random reads by doc id |
| score accumulator | random writes by doc id |
| top-k heap | tiny |
Three tactics helped more than clever scoring:
- Keep doc ids dense.
- Store postings contiguously.
- Keep the accumulator reusable.
Do not allocate a new score vector for every query:
struct QueryScratch {
scores: Vec<f32>,
touched: Vec<u32>,
}
impl QueryScratch {
fn reset(&mut self) {
for doc in self.touched.drain(..) {
self.scores[doc as usize] = 0.0;
}
}
}
This clears only the docs touched by the previous query. It is a small trick, but it removes a full-corpus memset from every request.
Testing search without lying to yourself Link to heading
Search tests are easy to make useless.
This test is weak:
assert!(search("timeout").len() > 0);
It proves the engine returns something. It says nothing about ranking.
Better tests use named queries with expected top results:
{
"query": "tls timeout shard migration",
"must_include_top_3": ["inc_042"],
"must_not_include_top_5": ["note_991"],
"why": "rare terms should beat generic timeout notes"
}
I like keeping a small eval file with adversarial cases:
| Case | What it catches |
|---|---|
| rare term beats common term | broken IDF |
| long noisy doc loses | missing length normalization |
| prefix expansion budget | runaway fuzzy lookup |
| exact phrase boost | bag-of-words weakness |
| title beats body mention | field weighting |
| stale note loses tie | recency tie-break |
For performance, keep benchmark queries separate from quality queries. A query like error is a stress test, not a quality test. A query like oauth token refresh race is closer to what a real user asks.
What surprised me Link to heading
The most counterintuitive part was that the “smart” pieces were rarely the bottleneck.
BM25 was not expensive. Bad candidate generation was expensive.
Fuzzy search was not expensive. Unbounded fuzzy expansion was expensive.
Reranking was not expensive. Reranking thousands of weak candidates was expensive.
The dictionary was not expensive. Turning dictionary matches into scattered postings reads was expensive.
Most of the work was making each stage return a small, useful frontier to the next stage.
Tradeoffs Link to heading
I would not build this for every search problem.
Use a real search engine when you need distributed indexing, complex query syntax, operational tooling, per-tenant isolation, analyzers for many languages, or petabyte-scale retention.
Use a small embedded search engine when:
- the corpus fits on one machine
- rebuilds are acceptable
- query latency matters more than write latency
- explainability matters
- you want deterministic behavior in tests
- the index can be memory-mapped or loaded at startup
The tiny version is not less serious. It just has fewer places to hide mistakes.
Conclusion Link to heading
Building a small search engine is mostly an exercise in refusing to waste work.
Do not scan documents if you can scan postings. Do not expand prefixes without a budget. Do not rerank the whole world. Do not normalize unrelated scores when rank fusion is enough. Do not allocate on the query path if scratch space will do.
The final design is simple:
FST dictionary
-> compact postings
-> BM25 candidate generation
-> narrow deterministic rerank
-> optional RRF blend
-> score explanation
That is enough to make a local search box feel instant, and still keep the ranking behavior understandable when it is wrong.
References Link to heading
- Rust
fstcrate documentation for finite-state transducer backed sets and maps. - SQLite FTS5 documentation for full-text indexing and the built-in
bm25()ranking function. - Tantivy documentation for a Rust full-text search engine and its indexing/query model.
- Introduction to Information Retrieval: Okapi BM25 for the scoring formula, term frequency scaling, and document length normalization.
- Reciprocal Rank Fusion paper by Cormack, Clarke, and Buettcher for rank-based result fusion.