A deep dive into how fuzzy search works — from edit distance and tries to stemming, BK-trees, and how Lucene/Elasticsearch find typo-tolerant matches at scale.
You mistype "serch" and Google still returns results for "search". Elasticsearch finds "recieve" when you meant "receive". Your IDE's fuzzy file finder opens the right file even when you drop a letter. How?
The answer is fuzzy search — the ability to find strings that are close to a query, not just identical. Under the hood it combines a handful of elegant data structures and algorithms. This post walks through all of them interactively.
What Does "Close" Mean? Edit Distance
The standard measure of string similarity is Levenshtein distance: the minimum number of single-character operations — insertions, deletions, or substitutions — needed to transform one string into another.
Because: k→s (substitute), e→i (substitute), insert g at end.
The classic dynamic-programming solution fills an matrix:
This runs in time and space (only two rows needed at once):
function levenshtein(a, b) {
const m = a.length, n = b.length
// single-row DP — O(n) space
let row = Array.from({ length: n + 1 }, (_, j) => j)
for (let i = 1; i <= m; i++) {
let prev = row[0]
row[0] = i
for (let j = 1; j <= n; j++) {
const tmp = row[j]
row[j] = a[i-1] === b[j-1] ? prev : 1 + Math.min(prev, row[j], row[j-1])
prev = tmp
}
}
return row[n]
}Damerau–Levenshtein adds one more operation: transposition (swapping two adjacent characters). "teh" → "the" costs 1 instead of 2. Most user-facing search uses this variant since transpositions are the most common typing error.
The Trie: Prefix Search in O(k) Time
A linear scan through a dictionary — computing Levenshtein against every word — is where is vocabulary size. For millions of words that's unacceptably slow.
A trie (prefix tree) solves prefix search. Every path from root to a node spells out a prefix; marked nodes indicate complete words. Insertions and lookups are where is key length, independent of dictionary size.
class TrieNode {
constructor() {
this.children = {} // char → TrieNode
this.isEnd = false
}
}
class Trie {
constructor() { this.root = new TrieNode() }
insert(word) {
let node = this.root
for (const ch of word.toLowerCase()) {
node.children[ch] ??= new TrieNode()
node = node.children[ch]
}
node.isEnd = true
}
// Returns true if word exists exactly
search(word) {
let node = this.root
for (const ch of word.toLowerCase()) {
if (!node.children[ch]) return false
node = node.children[ch]
}
return node.isEnd
}
}Memory efficiency: a trie stores each unique prefix only once. The words "search", "searcher", "searching" share a 6-node chain for "search" rather than storing those 6 characters three times.
Interactive Playground
Build a trie from your own words, trace prefixes through the structure, and see exactly how many characters are saved by prefix sharing:
① Trie Playground
Add words and watch the trie build. Type in the second box to trace a path through the trie (highlighted in indigo).
② Fuzzy Search Playground
Search 30 words from a search-engine vocabulary. Try typos like , , or .
1 result — exact d=1 d=2 d=3
③ Memory Cost vs. Fuzziness
Trie size if all strings within edit distance k were pre-indexed for the 30-word demo vocabulary. Bars use a log scale.
Computing…
This exponential blowup is why search engines never pre-expand the index. Instead, Lucene constructs a Levenshtein automaton at query time and intersects it with a compressed trie (FST) in O(result count × query length).
Stemming: Normalise Before You Index
Imagine a user searches for "running". Should it match documents about "run", "runs", "runner"? Almost certainly yes.
Stemming reduces a word to its root form before indexing and before querying, so all morphological variants map to the same token:
| Original | Stemmed |
|---|---|
| searching | search |
| searches | search |
| searched | search |
| running | run |
| runs | run |
| runner | run |
The Porter Stemmer (1980, still widely used) applies a cascade of string rules:
// Step 1a — plurals / past tense
word = word.replace(/sses$/, 'ss') // caresses → caress
word = word.replace(/ies$/, 'i') // ponies → poni
word = word.replace(/s$/, '') // cats → cat
// Step 1b — -ing / -ed
if (/eed$/.test(word) && measure(word) > 0)
word = word.replace(/eed$/, 'ee') // agreed → agree
else if (/(ed|ing)$/.test(word) && hasVowel(stem(word)))
word = word.replace(/(ed|ing)$/, '') // plastered → plasterStemming is lossy and language-specific. It works well for English inflections but can over-stem ("universe" → "univers", "university" → "univers") or under-stem. For many languages, lemmatisation (dictionary lookup of the canonical form) is more accurate but slower.
In practice you apply stemming on top of fuzzy matching: first stem the query and all indexed terms, then do fuzzy matching on the stems. This way you catch both morphological variants and typos.
Fuzzy Trie Traversal
Now that we have a trie, how do we search it fuzzily?
The naive approach — enumerate all strings within edit distance and look each up in the trie — is too slow and memory-hungry (see the Memory section below).
A better approach traverses the trie with a running Levenshtein row. At each trie node we maintain the current DP row for the path from root to that node. If the minimum value in the row ever exceeds , we prune the entire subtree:
function fuzzySearch(trie, query, maxDist) {
const results = []
const initial = Array.from({ length: query.length + 1 }, (_, i) => i)
function dfs(node, ch, prefix, prevRow) {
const curr = [prevRow[0] + 1]
for (let j = 1; j <= query.length; j++) {
curr[j] = query[j-1] === ch
? prevRow[j-1]
: 1 + Math.min(prevRow[j], curr[j-1], prevRow[j-1])
}
// Prune: if the whole row exceeds maxDist, no descendant can match
if (Math.min(...curr) > maxDist) return
if (node.isEnd && curr[query.length] <= maxDist)
results.push({ word: prefix, dist: curr[query.length] })
for (const [c, child] of Object.entries(node.children))
dfs(child, c, prefix + c, curr)
}
for (const [c, child] of Object.entries(trie.root.children))
dfs(child, c, c, initial)
return results.sort((a, b) => a.dist - b.dist)
}This prunes huge chunks of the search space and runs in in practice — much better than exhaustive enumeration.
BK-Trees: Fuzzy Search as a Metric Space Problem
A BK-tree (Burkhard–Keller, 1973) is a tree built for metric spaces — any space where a distance function satisfies the triangle inequality. Levenshtein distance qualifies.
Building: insert the first word as the root. Every subsequent word is placed at the child with edge weight . Recurse.
Searching: to find all words within distance of a query , visit a node at edge weight . By the triangle inequality, any matching word must lie within edge weights . Prune all children outside that range.
class BKTree {
insert(word) {
if (!this.root) { this.root = { word, children: {} }; return }
let node = this.root
while (true) {
const d = levenshtein(word, node.word)
if (d === 0) return // duplicate
if (!node.children[d]) { node.children[d] = { word, children: {} }; return }
node = node.children[d]
}
}
search(query, maxDist) {
const results = []
const stack = [this.root]
while (stack.length) {
const node = stack.pop()
const d = levenshtein(query, node.word)
if (d <= maxDist) results.push({ word: node.word, dist: d })
for (const [edge, child] of Object.entries(node.children)) {
if (Math.abs(d - Number(edge)) <= maxDist)
stack.push(child)
}
}
return results
}
}BK-trees work well for spell checkers — a static dictionary that never changes. For a dynamic index with millions of documents, inverted indexes (see below) are preferred.
The Memory Problem
What if you tried to pre-expand the index — storing every word and all of its typo variants up to distance ?
For a word of length over a 26-letter alphabet, the number of strings at edit distance exactly 1 is:
For that is 286 variants per word. At the count grows roughly quadratically; at it explodes into the millions. The playground above measured this exactly for a 30-word vocabulary:
The bars show why no production system pre-expands — the exponential memory blowup is the entire reason Lucene uses on-the-fly Levenshtein automata instead.
How Lucene and Elasticsearch Do It
The Inverted Index
Lucene's foundation is the inverted index: a map from each term to the list of documents containing it.
"search" → [doc1, doc5, doc12, …]
"searcher" → [doc3, doc9, …]
"fuzzy" → [doc2, doc7, …]During indexing, text is analysed (tokenised → lowercased → stemmed → stop-words removed) and each resulting token is added to the inverted index. The index is stored as an FST (Finite State Transducer) — essentially a compressed, immutable trie that maps terms to their posting lists.
Levenshtein Automata (Lucene's Fuzzy Secret)
When you issue a fuzzy query like fuzzy:"serch"~1, Lucene does not compute edit distance against every term. Instead it builds a Levenshtein automaton — a DFA that accepts exactly the set of strings within edit distance of your query.
Constructing this DFA is a one-time operation. Lucene then intersects the DFA with the FST term dictionary. The intersection enumerates only the matching terms, without touching the rest. This is — fantastically fast.
Query "serch", k=1
↓ Build Levenshtein DFA ←─── O(|q|² · k)
↓ Intersect with FST term dict ←─── O(matches · |q|)
↓ Retrieve posting lists for matched terms
↓ Score and rank documentsLucene's FuzzyQuery caps maxEdits at 2 — for good reason. At the automaton intersects too many terms and performance degrades rapidly. Elasticsearch exposes this as fuzziness: "AUTO" which applies for words of length ≤ 2, for length 3–5, and for longer words.
Elasticsearch Fuzzy Queries
{
"query": {
"fuzzy": {
"title": {
"value": "serch",
"fuzziness": "AUTO",
"prefix_length": 1,
"max_expansions": 50
}
}
}
}Key parameters:
| Parameter | Effect |
|---|---|
fuzziness | AUTO, 0, 1, or 2 — max edit distance |
prefix_length | Characters that must match exactly at the start. Higher = faster, less recall |
max_expansions | Cap on how many terms the automaton expands to |
transpositions | Whether to count swaps as 1 edit (Damerau–Levenshtein) |
prefix_length: 1 is a common optimisation: require the first character to match exactly, which prunes ~96% of the term dictionary before the automaton even runs.
Inside Elasticsearch at Scale
Elasticsearch distributes the inverted index across shards. A fuzzy query fans out to every shard, each independently intersecting the Levenshtein automaton with its local FST. Results are then merged and ranked by relevance score (BM25) on the coordinating node.
Client
│ fuzzy query
▼
Coordinating node
├── Shard 0 ── FST intersection ── partial hits
├── Shard 1 ── FST intersection ── partial hits
└── Shard N ── FST intersection ── partial hits
│ merge + rank (BM25)
▼
Top-k resultsBecause each shard's FST is immutable once written (Lucene segments are write-once), the whole index can be memory-mapped — the OS page cache does the caching automatically.
Putting It All Together
A production fuzzy search pipeline looks like this:
- Indexing: tokenise → lowercase → stem → insert into trie / inverted index (FST)
- Query time: tokenise query → lowercase → (optionally) stem → build Levenshtein automaton → intersect with term FST → fetch posting lists → rank
- Tune fuzziness:
prefix_lengthfor speed,fuzziness: AUTOfor recall,max_expansionsto cap cost
The layering is key — stemming catches morphological variants for free (no edit distance needed), and fuzzy matching only has to handle genuine typos and OCR errors. Combining both gives you robust, user-forgiving search without blowing up your index size.
Further Reading
- Lucene source:
org.apache.lucene.search.FuzzyQueryandorg.apache.lucene.util.automaton.LevenshteinAutomata - Schulz & Mihov (2002) — Fast String Correction with Levenshtein Automata (the paper behind Lucene's implementation)
- Elasticsearch documentation — Fuzzy query
- Norvig (2007) — How to Write a Spelling Corrector — a beautiful BK-tree-style approach in 20 lines of Python
- Porter (1980) — An algorithm for suffix stripping — the original stemmer paper
Related Articles
From Regex to Automata to Generators
How regular expressions compile to deterministic finite automata — with an interactive visualizer, a DFA-powered string generator, and a note on closing the loop with property-based testing.
Perfect Hash Functions: Collision-Free by Design
A deep dive into perfect hash functions — what they are, why they matter, how to find one efficiently with parallel search, and how they compare to JavaScript's built-in Map.
buckaroo.pm - Find C++ packages
A companion website for the Buckaroo package manager — browse ~400 C++ packages from GitHub, explore the dependency graph, and find what you need instantly with precompiled fuzzy search.