The first version of my search engine was slower after I added an index.

That sounds backwards, but it is a real failure mode. A bad index can turn a simple sequential scan into a pile of cache misses, hash lookups, tiny heap allocations, branchy score calculations, and random memory walks. The CPU stops doing search and starts waiting for memory.

The target I wanted was intentionally unreasonable: a local search engine for a few hundred thousand short technical records that could answer ranked queries inside an interactive UI budget, while keeping the hot path small enough to stay friendly to an L3 cache.

Not “web search”. Not “vector database”. Not “throw Elasticsearch at it”.

Something closer to this:

$ tinysearch "tls timeout shard migration"

score   doc       title
12.91   inc_042   TLS handshakes stall during shard move
10.44   note_118  Connection pool timeout under deploy load
 8.02   pr_077    Retry budget for migration workers

The interesting part is not that this can be built. The interesting part is where the naive version falls apart.

The naive implementation Link to heading

The first implementation usually looks like this:

fn search(query: &str, docs: &[Doc]) -> Vec<ScoredDoc> {
    let query_terms = tokenize(query);
    let mut hits = Vec::new();

    for doc in docs {
        let text = doc.search_text.to_lowercase();
        let mut score = 0.0;

        for term in &query_terms {
            if text.contains(term) {
                score += 1.0;
            }
        }

        if score > 0.0 {
            hits.push(ScoredDoc { id: doc.id, score });
        }
    }

    hits.sort_by(|a, b| b.score.total_cmp(&a.score));
    hits
}

This is a great prototype. It is also a trap.

It fails in four different ways:

FailureWhy it hurts
Every query scans every documentLatency grows with corpus size, not result set size
contains is not retrievalIt has no document frequency, no term saturation, no field weighting
Strings dominate runtimeLowercasing, allocation, and UTF-8 scans repeat per query
Ranking is fakeA document with one rare exact term can lose to a long noisy document

The bad version gets worse when you try to fix it casually. A HashMap<String, Vec<DocId>> inverted index is faster than scanning, but still allocates heavily and jumps around memory. A fuzzy layer adds more candidate terms, which adds more postings lists, which adds more random reads. A reranker improves quality, but if it receives 5,000 bad candidates, it just makes the slow path more expensive.

The lesson is uncomfortable: search quality and memory layout are the same problem earlier than you expect.

The better shape Link to heading

The design I like for this size of engine has five layers:

flowchart LR
    Q["query text"] --> N["normalize + tokenize"]
    N --> D["FST term dictionary"]
    D --> P["compact postings"]
    P --> B["BM25 top-k"]
    B --> R["feature rerank"]
    R --> F["rank fusion + explanation"]

The split matters because every stage has a different job:

StageJobOutput
NormalizerMake equivalent strings comparablequery terms
FST dictionaryMap term bytes to term ids and support prefix/fuzzy lookupcandidate terms
Postings storeReturn doc ids and term frequencies cheaplycandidate docs
BM25 scorerDo fast sparse rankingtop-k docs
RerankerApply slower contextual signals to a small setfinal docs
RRF/fusionBlend independent rankers without trusting raw score scalesfinal order

The rule is simple: broad operations must be cheap, and expensive operations must be narrow.

Build the dictionary first Link to heading

A finite-state transducer is a compact way to represent a sorted set or map of strings. In Rust, the fst crate is a good mental model: it can store ordered byte keys and associate each key with a value. For a search engine, the value is usually a term id.

The dictionary answers:

  • Does this term exist?
  • What is its internal term id?
  • Which terms match this prefix?
  • Which terms are close enough for typo tolerance?

The important property is that the dictionary is immutable and compact. That makes it boring in the best way: build it once, memory-map it, and keep query-time work predictable.

use fst::Map;

struct TermDict {
    map: Map<Vec<u8>>,
}

impl TermDict {
    fn term_id(&self, term: &str) -> Option<u32> {
        self.map.get(term).map(|id| id as u32)
    }
}

The input has to be sorted:

let mut terms: Vec<(String, u64)> = vocabulary
    .into_iter()
    .enumerate()
    .map(|(id, term)| (term, id as u64))
    .collect();

terms.sort_by(|a, b| a.0.cmp(&b.0));

let mut builder = fst::MapBuilder::memory();
for (term, id) in terms {
    builder.insert(term, id)?;
}
let bytes = builder.into_inner()?;

This is not the whole index. It is the front door. The dictionary should be tiny compared to the postings.

Postings are where performance is won Link to heading

An inverted index is just:

term -> [(doc_id, term_frequency), ...]

The layout decides whether queries are fast.

The naive representation is a map of vectors:

HashMap<TermId, Vec<Posting>>

That is fine while building. It is not ideal for query-time. The query path wants contiguous memory:

postings.bin
  [doc_delta, tf, doc_delta, tf, doc_delta, tf, ...]

terms.meta
  term_id -> { offset, len, doc_freq }

Instead of storing absolute doc ids, store deltas:

docs:   101, 108, 130, 131
deltas: 101,   7,  22,   1

Then encode small integers with a variable-length format. For small corpora, even a plain u32 array can be faster because decoding is branchless and simple. Compression is not automatically a win. The question is whether you are limited by memory bandwidth, cache capacity, or integer decode overhead.

For the fake corpus I used while tuning, the layouts looked like this:

LayoutIndex sizeMedian queryNotes
HashMap<String, Vec<Posting>>74 MB18.4 mstoo many pointer jumps
sorted term ids + Vec<Posting>41 MB7.9 msless allocation, still bulky
delta encoded u32 blocks26 MB4.6 msbest simple version
varint blocks17 MB5.1 mssmaller, slightly more CPU

Those numbers are fictional but the tradeoff is real: “smaller” and “faster” overlap until decoding work becomes the bottleneck.

BM25 is a good first scorer Link to heading

Term overlap is not enough. A rare term should matter more than a common term. A term repeated 40 times should help, but not linearly. Long documents should not win just because they contain more words.

BM25 handles those cases well enough to be the first real scorer:

score(q, d) = sum over query terms:

  idf(term) *
  (tf * (k1 + 1)) /
  (tf + k1 * (1 - b + b * doc_len / avg_doc_len))

The useful parts:

  • idf rewards rare terms.
  • tf saturates, so repeated terms stop helping after a point.
  • doc_len / avg_doc_len normalizes long documents.

The hot path becomes:

fn score_term(
    acc: &mut [f32],
    postings: &[Posting],
    idf: f32,
    norms: &[f32],
) {
    const K1: f32 = 1.2;

    for posting in postings {
        let tf = posting.tf as f32;
        let norm = norms[posting.doc_id as usize];
        let score = idf * (tf * (K1 + 1.0)) / (tf + K1 * norm);
        acc[posting.doc_id as usize] += score;
    }
}

That array write is the first serious design fork.

If the corpus has 200,000 docs, a dense Vec<f32> accumulator is tiny and fast. If the corpus has 200 million docs, it is absurd. For the local-search shape, a dense accumulator plus a touched-doc list is excellent:

if acc[doc] == 0.0 {
    touched.push(doc);
}
acc[doc] += score;

After scoring, only sort the touched docs, or keep a bounded heap if the touched set is large.

The query planner matters Link to heading

The cheapest query is not always “score every term”.

For:

"timeout during shard migration"

during is probably useless. timeout is broad. shard and migration are more selective. Score the selective terms first.

The planner can use document frequency:

terms.sort_by_key(|term| term.doc_freq);

Then apply gates:

GateExample
Drop stopwordsthe, during, with
Cap very broad termsignore error if it hits 80% of docs
Require at least one selective termshard or migration must match
Expand only rare prefixesmigr* is okay, e* is not

This is where FST prefix lookup can hurt you. Prefix search is seductive:

mig -> migration, migrate, migrated, migrator

But a short prefix can explode into thousands of terms. The dictionary lookup is cheap. The postings are not.

I like explicit expansion budgets:

struct ExpansionBudget {
    max_terms: usize,
    max_total_doc_freq: u32,
}

When the expansion crosses the budget, degrade gracefully:

  • use only exact terms
  • use top-N rare expansions
  • ask the reranker to handle fuzzy similarity later

Reranking should be boring and narrow Link to heading

BM25 is lexical. It does not understand every signal you care about.

For a local technical corpus, I usually want extra signals:

SignalWhy it helps
title matchtitles are dense summaries
path or namespace matchauth/cache beats random mentions of cache
recencyrecent notes may be more actionable
exact phrase“token refresh” is stronger than separate words
proximitywords near each other usually mean more
source typeincident, PR, doc, note, log

The mistake is applying those signals to the full corpus. Rerank the top 100 or 200 BM25 candidates.

fn rerank(query: &Query, mut docs: Vec<ScoredDoc>) -> Vec<ScoredDoc> {
    for doc in &mut docs {
        doc.score += 1.7 * title_overlap(query, doc);
        doc.score += 1.2 * exact_phrase(query, doc);
        doc.score += 0.8 * path_overlap(query, doc);
        doc.score += 0.4 * recency_boost(doc);
    }

    docs.sort_by(|a, b| b.score.total_cmp(&a.score));
    docs
}

This looks unsophisticated because it is. That is the point. A small deterministic reranker is debuggable. You can print the score breakdown:

{
  "doc": "inc_042",
  "score": 12.91,
  "features": {
    "bm25": 8.14,
    "title_overlap": 1.70,
    "phrase": 1.20,
    "path": 0.80,
    "recency": 0.32
  },
  "matched_terms": ["tls", "timeout", "shard", "migration"]
}

This kind of explanation is more useful than a mysterious score with four decimal places.

RRF is a better blender than score normalization Link to heading

At some point you will have more than one retriever:

  • BM25 over body text
  • exact title matcher
  • prefix/fuzzy term matcher
  • vector search
  • recent-document retriever

The obvious move is to normalize scores and add them. That gets messy because every retriever has a different score distribution. A BM25 score of 12 and a vector similarity of 0.82 do not mean comparable things.

Reciprocal Rank Fusion avoids that by using ranks:

rrf_score(doc) = sum over rankers 1 / (k + rank(doc))

A simple implementation:

fn rrf(lists: &[Vec<DocId>], k: f32) -> HashMap<DocId, f32> {
    let mut scores = HashMap::new();

    for list in lists {
        for (rank, doc) in list.iter().enumerate() {
            let contribution = 1.0 / (k + rank as f32 + 1.0);
            *scores.entry(*doc).or_insert(0.0) += contribution;
        }
    }

    scores
}

RRF has a nice failure mode: one weird retriever can suggest candidates without dominating the result set. But it is not magic. If every retriever shares the same blind spot, fusion just agrees with itself confidently.

I like using RRF when the retrievers are genuinely different. BM25 plus title exact plus vector search is useful. BM25 body plus BM25 body with a different boost is less useful.

Cache-aware scoring Link to heading

The phrase “fits in L3 cache” is a forcing function, not a universal requirement. Modern CPUs vary wildly. The point is to budget the hot query path like it has to live in a small memory neighborhood.

The hot data is:

DataAccess pattern
term metadatatiny random reads
postings blockssequential reads per term
doc length normsrandom reads by doc id
score accumulatorrandom writes by doc id
top-k heaptiny

Three tactics helped more than clever scoring:

  1. Keep doc ids dense.
  2. Store postings contiguously.
  3. Keep the accumulator reusable.

Do not allocate a new score vector for every query:

struct QueryScratch {
    scores: Vec<f32>,
    touched: Vec<u32>,
}

impl QueryScratch {
    fn reset(&mut self) {
        for doc in self.touched.drain(..) {
            self.scores[doc as usize] = 0.0;
        }
    }
}

This clears only the docs touched by the previous query. It is a small trick, but it removes a full-corpus memset from every request.

Testing search without lying to yourself Link to heading

Search tests are easy to make useless.

This test is weak:

assert!(search("timeout").len() > 0);

It proves the engine returns something. It says nothing about ranking.

Better tests use named queries with expected top results:

{
  "query": "tls timeout shard migration",
  "must_include_top_3": ["inc_042"],
  "must_not_include_top_5": ["note_991"],
  "why": "rare terms should beat generic timeout notes"
}

I like keeping a small eval file with adversarial cases:

CaseWhat it catches
rare term beats common termbroken IDF
long noisy doc losesmissing length normalization
prefix expansion budgetrunaway fuzzy lookup
exact phrase boostbag-of-words weakness
title beats body mentionfield weighting
stale note loses tierecency tie-break

For performance, keep benchmark queries separate from quality queries. A query like error is a stress test, not a quality test. A query like oauth token refresh race is closer to what a real user asks.

What surprised me Link to heading

The most counterintuitive part was that the “smart” pieces were rarely the bottleneck.

BM25 was not expensive. Bad candidate generation was expensive.

Fuzzy search was not expensive. Unbounded fuzzy expansion was expensive.

Reranking was not expensive. Reranking thousands of weak candidates was expensive.

The dictionary was not expensive. Turning dictionary matches into scattered postings reads was expensive.

Most of the work was making each stage return a small, useful frontier to the next stage.

Tradeoffs Link to heading

I would not build this for every search problem.

Use a real search engine when you need distributed indexing, complex query syntax, operational tooling, per-tenant isolation, analyzers for many languages, or petabyte-scale retention.

Use a small embedded search engine when:

  • the corpus fits on one machine
  • rebuilds are acceptable
  • query latency matters more than write latency
  • explainability matters
  • you want deterministic behavior in tests
  • the index can be memory-mapped or loaded at startup

The tiny version is not less serious. It just has fewer places to hide mistakes.

Conclusion Link to heading

Building a small search engine is mostly an exercise in refusing to waste work.

Do not scan documents if you can scan postings. Do not expand prefixes without a budget. Do not rerank the whole world. Do not normalize unrelated scores when rank fusion is enough. Do not allocate on the query path if scratch space will do.

The final design is simple:

FST dictionary
  -> compact postings
  -> BM25 candidate generation
  -> narrow deterministic rerank
  -> optional RRF blend
  -> score explanation

That is enough to make a local search box feel instant, and still keep the ranking behavior understandable when it is wrong.

References Link to heading