Fuzzy Search: Typos, Tries, and the Algorithms Behind Instant Results

A deep dive into how fuzzy search works — from edit distance and tries to stemming, BK-trees, and how Lucene/Elasticsearch find typo-tolerant matches at scale.

·11 min read

You mistype "serch" and Google still returns results for "search". Elasticsearch finds "recieve" when you meant "receive". Your IDE's fuzzy file finder opens the right file even when you drop a letter. How?

The answer is fuzzy search — the ability to find strings that are close to a query, not just identical. Under the hood it combines a handful of elegant data structures and algorithms. This post walks through all of them interactively.

What Does "Close" Mean? Edit Distance

The standard measure of string similarity is Levenshtein distance: the minimum number of single-character operations — insertions, deletions, or substitutions — needed to transform one string into another.

d("kitten","sitting")=3d(\text{"kitten"}, \text{"sitting"}) = 3

Because: k→s (substitute), e→i (substitute), insert g at end.

The classic dynamic-programming solution fills an (m+1)×(n+1)(m+1) \times (n+1) matrix:

d[i][j]={jif i=0iif j=0d[i1][j1]if s[i]=t[j]1+min(d[i1][j],  d[i][j1],  d[i1][j1])otherwised[i][j] = \begin{cases} j & \text{if } i = 0 \\ i & \text{if } j = 0 \\ d[i-1][j-1] & \text{if } s[i] = t[j] \\ 1 + \min(d[i-1][j],\; d[i][j-1],\; d[i-1][j-1]) & \text{otherwise} \end{cases}

This runs in O(mn)O(mn) time and O(min(m,n))O(\min(m,n)) space (only two rows needed at once):

levenshtein.js
function levenshtein(a, b) {
  const m = a.length, n = b.length
  // single-row DP — O(n) space
  let row = Array.from({ length: n + 1 }, (_, j) => j)
  for (let i = 1; i <= m; i++) {
    let prev = row[0]
    row[0] = i
    for (let j = 1; j <= n; j++) {
      const tmp = row[j]
      row[j] = a[i-1] === b[j-1] ? prev : 1 + Math.min(prev, row[j], row[j-1])
      prev = tmp
    }
  }
  return row[n]
}
💡

Damerau–Levenshtein adds one more operation: transposition (swapping two adjacent characters). "teh" → "the" costs 1 instead of 2. Most user-facing search uses this variant since transpositions are the most common typing error.

The Trie: Prefix Search in O(k) Time

A linear scan through a dictionary — computing Levenshtein against every word — is O(Nmn)O(N \cdot m \cdot n) where NN is vocabulary size. For millions of words that's unacceptably slow.

A trie (prefix tree) solves prefix search. Every path from root to a node spells out a prefix; marked nodes indicate complete words. Insertions and lookups are O(k)O(k) where kk is key length, independent of dictionary size.

trie.js
class TrieNode {
  constructor() {
    this.children = {}   // char → TrieNode
    this.isEnd = false
  }
}
 
class Trie {
  constructor() { this.root = new TrieNode() }
 
  insert(word) {
    let node = this.root
    for (const ch of word.toLowerCase()) {
      node.children[ch] ??= new TrieNode()
      node = node.children[ch]
    }
    node.isEnd = true
  }
 
  // Returns true if word exists exactly
  search(word) {
    let node = this.root
    for (const ch of word.toLowerCase()) {
      if (!node.children[ch]) return false
      node = node.children[ch]
    }
    return node.isEnd
  }
}

Memory efficiency: a trie stores each unique prefix only once. The words "search", "searcher", "searching" share a 6-node chain for "search" rather than storing those 6 characters three times.

Interactive Playground

Build a trie from your own words, trace prefixes through the structure, and see exactly how many characters are saved by prefix sharing:

① Trie Playground

Add words and watch the trie build. Type in the second box to trace a path through the trie (highlighted in indigo).

appleicationtbananadit
Words: 7Trie nodes: 22Chars stored naively: 38Shared prefix savings: 16 chars
Legend: word end traced path

② Fuzzy Search Playground

Search 30 words from a search-engine vocabulary. Try typos like , , or .

Max edits (k):
searchd=1

1 result exact d=1 d=2 d=3

③ Memory Cost vs. Fuzziness

Trie size if all strings within edit distance k were pre-indexed for the 30-word demo vocabulary. Bars use a log scale.

Computing…

This exponential blowup is why search engines never pre-expand the index. Instead, Lucene constructs a Levenshtein automaton at query time and intersects it with a compressed trie (FST) in O(result count × query length).

Stemming: Normalise Before You Index

Imagine a user searches for "running". Should it match documents about "run", "runs", "runner"? Almost certainly yes.

Stemming reduces a word to its root form before indexing and before querying, so all morphological variants map to the same token:

OriginalStemmed
searchingsearch
searchessearch
searchedsearch
runningrun
runsrun
runnerrun

The Porter Stemmer (1980, still widely used) applies a cascade of string rules:

porter-stemmer-excerpt.js
// Step 1a — plurals / past tense
word = word.replace(/sses$/, 'ss')   // caresses → caress
word = word.replace(/ies$/, 'i')     // ponies   → poni
word = word.replace(/s$/, '')        // cats     → cat
 
// Step 1b — -ing / -ed
if (/eed$/.test(word) && measure(word) > 0)
  word = word.replace(/eed$/, 'ee')  // agreed   → agree
else if (/(ed|ing)$/.test(word) && hasVowel(stem(word)))
  word = word.replace(/(ed|ing)$/, '') // plastered → plaster

Stemming is lossy and language-specific. It works well for English inflections but can over-stem ("universe" → "univers", "university" → "univers") or under-stem. For many languages, lemmatisation (dictionary lookup of the canonical form) is more accurate but slower.

In practice you apply stemming on top of fuzzy matching: first stem the query and all indexed terms, then do fuzzy matching on the stems. This way you catch both morphological variants and typos.

Fuzzy Trie Traversal

Now that we have a trie, how do we search it fuzzily?

The naive approach — enumerate all strings within edit distance kk and look each up in the trie — is too slow and memory-hungry (see the Memory section below).

A better approach traverses the trie with a running Levenshtein row. At each trie node we maintain the current DP row for the path from root to that node. If the minimum value in the row ever exceeds kk, we prune the entire subtree:

fuzzy-trie-search.js
function fuzzySearch(trie, query, maxDist) {
  const results = []
  const initial = Array.from({ length: query.length + 1 }, (_, i) => i)
 
  function dfs(node, ch, prefix, prevRow) {
    const curr = [prevRow[0] + 1]
    for (let j = 1; j <= query.length; j++) {
      curr[j] = query[j-1] === ch
        ? prevRow[j-1]
        : 1 + Math.min(prevRow[j], curr[j-1], prevRow[j-1])
    }
 
    // Prune: if the whole row exceeds maxDist, no descendant can match
    if (Math.min(...curr) > maxDist) return
 
    if (node.isEnd && curr[query.length] <= maxDist)
      results.push({ word: prefix, dist: curr[query.length] })
 
    for (const [c, child] of Object.entries(node.children))
      dfs(child, c, prefix + c, curr)
  }
 
  for (const [c, child] of Object.entries(trie.root.children))
    dfs(child, c, c, initial)
 
  return results.sort((a, b) => a.dist - b.dist)
}

This prunes huge chunks of the search space and runs in O(result count×query2)O(\text{result count} \times |\text{query}|^2) in practice — much better than exhaustive enumeration.

BK-Trees: Fuzzy Search as a Metric Space Problem

A BK-tree (Burkhard–Keller, 1973) is a tree built for metric spaces — any space where a distance function satisfies the triangle inequality. Levenshtein distance qualifies.

Building: insert the first word as the root. Every subsequent word ww is placed at the child with edge weight d(root,w)d(\text{root}, w). Recurse.

Searching: to find all words within distance kk of a query qq, visit a node nn at edge weight dd. By the triangle inequality, any matching word must lie within edge weights [d(q,n)k,  d(q,n)+k][d(q,n) - k,\; d(q,n) + k]. Prune all children outside that range.

bk-tree.js
class BKTree {
  insert(word) {
    if (!this.root) { this.root = { word, children: {} }; return }
    let node = this.root
    while (true) {
      const d = levenshtein(word, node.word)
      if (d === 0) return // duplicate
      if (!node.children[d]) { node.children[d] = { word, children: {} }; return }
      node = node.children[d]
    }
  }
 
  search(query, maxDist) {
    const results = []
    const stack = [this.root]
    while (stack.length) {
      const node = stack.pop()
      const d = levenshtein(query, node.word)
      if (d <= maxDist) results.push({ word: node.word, dist: d })
      for (const [edge, child] of Object.entries(node.children)) {
        if (Math.abs(d - Number(edge)) <= maxDist)
          stack.push(child)
      }
    }
    return results
  }
}

BK-trees work well for spell checkers — a static dictionary that never changes. For a dynamic index with millions of documents, inverted indexes (see below) are preferred.

The Memory Problem

What if you tried to pre-expand the index — storing every word and all of its typo variants up to distance kk?

For a word of length LL over a 26-letter alphabet, the number of strings at edit distance exactly 1 is:

L25substitutions+Ldeletions+(L+1)26insertions=52L+26\underbrace{L \cdot 25}_{\text{substitutions}} + \underbrace{L}_{\text{deletions}} + \underbrace{(L+1) \cdot 26}_{\text{insertions}} = 52L + 26

For L=5L = 5 that is 286 variants per word. At k=2k = 2 the count grows roughly quadratically; at k=3k = 3 it explodes into the millions. The playground above measured this exactly for a 30-word vocabulary:

The bars show why no production system pre-expands — the exponential memory blowup is the entire reason Lucene uses on-the-fly Levenshtein automata instead.

How Lucene and Elasticsearch Do It

The Inverted Index

Lucene's foundation is the inverted index: a map from each term to the list of documents containing it.

"search"   → [doc1, doc5, doc12, …]
"searcher" → [doc3, doc9, …]
"fuzzy"    → [doc2, doc7, …]

During indexing, text is analysed (tokenised → lowercased → stemmed → stop-words removed) and each resulting token is added to the inverted index. The index is stored as an FST (Finite State Transducer) — essentially a compressed, immutable trie that maps terms to their posting lists.

Levenshtein Automata (Lucene's Fuzzy Secret)

When you issue a fuzzy query like fuzzy:"serch"~1, Lucene does not compute edit distance against every term. Instead it builds a Levenshtein automaton — a DFA that accepts exactly the set of strings within edit distance kk of your query.

Constructing this DFA is a one-time O(q2k)O(|q|^2 \cdot k) operation. Lucene then intersects the DFA with the FST term dictionary. The intersection enumerates only the matching terms, without touching the rest. This is O(matches×q)O(\text{matches} \times |q|) — fantastically fast.

Query "serch", k=1
↓ Build Levenshtein DFA  ←─── O(|q|² · k)
↓ Intersect with FST term dict  ←─── O(matches · |q|)
↓ Retrieve posting lists for matched terms
↓ Score and rank documents

Lucene's FuzzyQuery caps maxEdits at 2 — for good reason. At k3k \geq 3 the automaton intersects too many terms and performance degrades rapidly. Elasticsearch exposes this as fuzziness: "AUTO" which applies k=0k=0 for words of length ≤ 2, k=1k=1 for length 3–5, and k=2k=2 for longer words.

Elasticsearch Fuzzy Queries

{
  "query": {
    "fuzzy": {
      "title": {
        "value": "serch",
        "fuzziness": "AUTO",
        "prefix_length": 1,
        "max_expansions": 50
      }
    }
  }
}

Key parameters:

ParameterEffect
fuzzinessAUTO, 0, 1, or 2 — max edit distance
prefix_lengthCharacters that must match exactly at the start. Higher = faster, less recall
max_expansionsCap on how many terms the automaton expands to
transpositionsWhether to count swaps as 1 edit (Damerau–Levenshtein)

prefix_length: 1 is a common optimisation: require the first character to match exactly, which prunes ~96% of the term dictionary before the automaton even runs.

Inside Elasticsearch at Scale

Elasticsearch distributes the inverted index across shards. A fuzzy query fans out to every shard, each independently intersecting the Levenshtein automaton with its local FST. Results are then merged and ranked by relevance score (BM25) on the coordinating node.

Client
  │ fuzzy query

Coordinating node
  ├── Shard 0  ── FST intersection ── partial hits
  ├── Shard 1  ── FST intersection ── partial hits
  └── Shard N  ── FST intersection ── partial hits
  │ merge + rank (BM25)

Top-k results

Because each shard's FST is immutable once written (Lucene segments are write-once), the whole index can be memory-mapped — the OS page cache does the caching automatically.

Putting It All Together

A production fuzzy search pipeline looks like this:

  1. Indexing: tokenise → lowercase → stem → insert into trie / inverted index (FST)
  2. Query time: tokenise query → lowercase → (optionally) stem → build Levenshtein automaton → intersect with term FST → fetch posting lists → rank
  3. Tune fuzziness: prefix_length for speed, fuzziness: AUTO for recall, max_expansions to cap cost

The layering is key — stemming catches morphological variants for free (no edit distance needed), and fuzzy matching only has to handle genuine typos and OCR errors. Combining both gives you robust, user-forgiving search without blowing up your index size.

Further Reading

  • Lucene source: org.apache.lucene.search.FuzzyQuery and org.apache.lucene.util.automaton.LevenshteinAutomata
  • Schulz & Mihov (2002) — Fast String Correction with Levenshtein Automata (the paper behind Lucene's implementation)
  • Elasticsearch documentation — Fuzzy query
  • Norvig (2007) — How to Write a Spelling Corrector — a beautiful BK-tree-style approach in 20 lines of Python
  • Porter (1980) — An algorithm for suffix stripping — the original stemmer paper

Related Articles