A deep dive into perfect hash functions — what they are, why they matter, how to find one efficiently with parallel search, and how they compare to JavaScript's built-in Map.
Every time you look something up in a hash map you are implicitly betting that the hash function will not place two keys in the same bucket. When it does — a collision — the map has to do extra work: it walks a linked list, probes the next slot, or rebuilds the bucket. For a general-purpose map that has to handle arbitrary keys, some collisions are inevitable.
But what if you know your key set in advance?
A perfect hash function maps a fixed set of keys to an array with zero collisions. Every lookup is a single computation followed by a single array access. No branching, no chaining, no amortised cost.
What Makes a Hash Function "Perfect"?
Given distinct keys and a table of size :
- A perfect hash function satisfies for all . Collisions are structurally impossible.
- A minimal perfect hash function (MPHF) additionally requires — the table has exactly as many slots as keys. Not a single slot is wasted.
| Type | Table size | Load factor | Search effort |
|---|---|---|---|
| Ordinary hash map | ≤ 100 % | Collision resolution needed | |
| Perfect hash | ≤ 100 % | O(1) worst-case | |
| Minimal perfect hash | 100 % | O(1) + zero waste |
Perfect hashing only applies to static key sets. If keys are added or removed after the hash function is built, you need to rebuild from scratch. This is the main practical limitation.
The Birthday Problem and Load Factor
Before worrying about algorithms, intuition helps. Suppose you pick a random hash function and ask: what is the probability it produces no collisions for our keys in a table of size ?
This is the Birthday Problem in disguise. The probability of a collision-free assignment is approximately:
So the expected number of random seeds you have to try before finding a perfect one is:
For words this works out to:
| Table size | Load | Expected seeds |
|---|---|---|
| 10 (minimal) | 100 % | ~100 |
| 13 | 77 % | ~28 |
| 16 | 63 % | ~11 |
| 20 | 50 % | ~5 |
| 32 | 31 % | ~2 |
The key insight: a slightly larger table dramatically reduces the search effort. This is the trade-off the playground below lets you explore.
A Simple Parameterised Hash
For a fixed key set the cheapest approach is trial and error over a family of hash functions. We pick a family parameterised by a single integer seed and test seeds one by one until we find a collision-free one.
function hash(word: string, seed: number, tableSize: number): number {
let h = seed
for (let i = 0; i < word.length; i++) {
h = (Math.imul(h, 31) + word.charCodeAt(i)) & 0x7fffffff
}
return h % tableSize
}
function findPerfectHash(words: string[], tableSize: number): number | null {
for (let seed = 0; seed < 200_000; seed++) {
const slots = new Set<number>()
let ok = true
for (const w of words) {
const slot = hash(w, seed, tableSize)
if (slots.has(slot)) { ok = false; break }
slots.add(slot)
}
if (ok) return seed
}
return null // no seed found in range
}Math.imul gives us 32-bit integer multiplication without allocating a BigInt. The & 0x7fffffff mask keeps h positive. The seed is the starting accumulator value, so every seed traces a completely different trajectory through the key space.
Parallelising the Search with Web Workers
The seed search is embarrassingly parallel: each seed is independent. We can divide the search space across multiple Web Workers and race them — whichever worker finds a valid seed first wins, and the rest are terminated.
const NUM_WORKERS = 4
const SEEDS_PER_WORKER = 50_000
function searchParallel(words: string[], tableSize: number) {
let found = false
const workers = Array.from({ length: NUM_WORKERS }, (_, i) => {
const worker = new Worker('/workers/perfect-hash-worker.js')
worker.postMessage({
words,
tableSize,
startSeed: i * SEEDS_PER_WORKER,
endSeed: i * SEEDS_PER_WORKER + SEEDS_PER_WORKER - 1,
workerId: i,
})
worker.onmessage = (e) => {
if (e.data.type === 'found' && !found) {
found = true
workers.forEach(w => w.terminate()) // stop all other workers
console.log('seed =', e.data.seed, 'found in', e.data.seedsTried, 'tries')
}
}
return worker
})
}Each worker independently searches its slice of the seed space (0–49 999, 50 000–99 999, …). For word sets with a generous table size the first valid seed is typically in the low hundreds, so Worker 1 wins immediately and the others barely start. With a tight table (load ≈ 100 %) the search spreads across workers and the parallel speed-up becomes visible.
Watch the progress bars in the playground. When the table is large, Worker 1 fires almost instantly and cancels the rest. Shrink the table to minimum size and you will see all four workers grind through seeds in unison.
Interactive: Build Your Own Perfect Hash
Enter your own word list, drag the table size, and watch the parallel workers race.
Perfect Hash Playground
4 web workers search in parallel — each covers a different seed range
h = seed; h = (h×31 + c) & 0x7fffffffUsing the Hash at Runtime
Once the seed is known, lookups are two lines:
// Build the table once (at init time)
const TABLE: (string | null)[] = new Array(TABLE_SIZE).fill(null)
for (const word of WORDS) {
TABLE[hash(word, SEED, TABLE_SIZE)] = word
}
// Lookup — O(1), zero branches, no collision path
function has(word: string): boolean {
return TABLE[hash(word, SEED, TABLE_SIZE)] === word
}The has function always executes in constant time regardless of table occupancy.
Real-World Use Cases
Compilers and interpreters are the canonical home for perfect hashing. Every language has a fixed set of reserved keywords (if, while, return, …). The lexer must classify each identifier token as a keyword or not. A perfect hash produces a slot index in a single pass over the characters — no equality tests after the first one.
GCC uses a minimal perfect hash generated by
gperffor its C/C++ keyword table. So does CPython for Python keywords.
Network packet processing uses perfect hashing for protocol field lookup. The set of valid field names in a binary protocol is fixed at compile time; a perfect hash lets the parser dispatch without a chain of if/else.
Embedded systems with severe RAM constraints prefer MPHFs. A collision-free table with no empty slots wastes no memory — important on microcontrollers with 8 KB of RAM.
Static configuration tables — country codes, MIME types, HTTP header names, Unicode category tables — are all fixed sets where a perfect hash can eliminate runtime overhead entirely.
Retrieval Speed: Perfect Hash vs JavaScript Map
V8's Map is impressively fast for general use — it uses inline caches and type specialisation to approach amortised. But a perfect hash is structurally simpler: one hash computation, one array read. No bucket chains, no rehashing, no load factor checks.
The benchmark in the playground runs one million lookups of each type and reports nanoseconds per operation. Results vary by browser and hardware, but the story is consistent:
- For small key sets (< 20 keys) V8's Map is surprisingly competitive — its JIT has specialised the
Map.getcall so thoroughly that the overhead is near zero. - For medium key sets (20–200 keys) the perfect hash typically wins by 1.5–3 ×.
- For large static tables (thousands of keys) the perfect hash advantage is most consistent because Map's cache locality suffers as the bucket array grows.
Micro-benchmarks in JavaScript are noisy. JIT warm-up, GC pauses, and CPU branch-prediction state all affect results. Run the benchmark several times and treat the ratio as an order-of-magnitude guide, not a precise measurement.
Limitations
Static keys only. Any change to the key set requires rebuilding the hash function. If your dictionary grows at runtime, a perfect hash is the wrong tool.
Build time cost. Finding the seed takes between microseconds (generous table) and milliseconds (tight table). This is amortised over all lookups, so it is worthwhile for hot paths called millions of times.
Seed range. The naïve search can fail to find a seed if the table is too small or the search range is too narrow. Production tools like gperf and cmph use more sophisticated constructions (CHD, BDZ, FCH) that guarantee a solution for any key set.
Further Reading
- Botelho, Pagh & Ziviani — "Simple and Space-Efficient Minimal Perfect Hash Functions" (2007) — the CHD algorithm used in most modern MPHF libraries
- GNU
gperfdocumentation — battle-tested tool for generating C/C++ keyword tables - Knuth, The Art of Computer Programming Vol. 3, §6.4 — the theoretical foundations of hashing
@typespec/compilersource — example of a TypeScript project that uses pre-computed keyword tables
Related Articles
Fuzzy Search: Typos, Tries, and the Algorithms Behind Instant Results
A deep dive into how fuzzy search works — from edit distance and tries to stemming, BK-trees, and how Lucene/Elasticsearch find typo-tolerant matches at scale.
From Regex to Automata to Generators
How regular expressions compile to deterministic finite automata — with an interactive visualizer, a DFA-powered string generator, and a note on closing the loop with property-based testing.
Buckaroo: Conflict-Driven Dependency Resolution for C++
How Buckaroo's dependency resolver borrows ideas from SAT solvers — using conflict-driven clause learning to prune the search space and resolve C++ packages blazingly fast.