#

Perfect Hash Functions: Collision-Free by Design

A deep dive into perfect hash functions — what they are, why they matter, how to find one efficiently with parallel search, and how they compare to JavaScript's built-in Map.

·8 min read

Every time you look something up in a hash map you are implicitly betting that the hash function will not place two keys in the same bucket. When it does — a collision — the map has to do extra work: it walks a linked list, probes the next slot, or rebuilds the bucket. For a general-purpose map that has to handle arbitrary keys, some collisions are inevitable.

But what if you know your key set in advance?

A perfect hash function maps a fixed set of keys to an array with zero collisions. Every lookup is a single computation followed by a single array access. No branching, no chaining, no amortised cost.

What Makes a Hash Function "Perfect"?

Given nn distinct keys K={k1,k2,,kn}K = \{k_1, k_2, \ldots, k_n\} and a table of size mm:

  • A perfect hash function satisfies h(ki)h(kj)h(k_i) \neq h(k_j) for all iji \neq j. Collisions are structurally impossible.
  • A minimal perfect hash function (MPHF) additionally requires m=nm = n — the table has exactly as many slots as keys. Not a single slot is wasted.
TypeTable sizeLoad factorSearch effort
Ordinary hash mapm>nm > n≤ 100 %Collision resolution needed
Perfect hashmnm \geq n≤ 100 %O(1) worst-case
Minimal perfect hashm=nm = n100 %O(1) + zero waste

Perfect hashing only applies to static key sets. If keys are added or removed after the hash function is built, you need to rebuild from scratch. This is the main practical limitation.

The Birthday Problem and Load Factor

Before worrying about algorithms, intuition helps. Suppose you pick a random hash function and ask: what is the probability it produces no collisions for our nn keys in a table of size mm?

This is the Birthday Problem in disguise. The probability of a collision-free assignment is approximately:

P(no collision)en(n1)/(2m)P(\text{no collision}) \approx e^{-n(n-1)/(2m)}

So the expected number of random seeds you have to try before finding a perfect one is:

E[seeds]en(n1)/(2m)E[\text{seeds}] \approx e^{n(n-1)/(2m)}

For n=10n = 10 words this works out to:

Table size mmLoad n/mn/mExpected seeds
10 (minimal)100 %~100
1377 %~28
1663 %~11
2050 %~5
3231 %~2

The key insight: a slightly larger table dramatically reduces the search effort. This is the trade-off the playground below lets you explore.

A Simple Parameterised Hash

For a fixed key set the cheapest approach is trial and error over a family of hash functions. We pick a family parameterised by a single integer seed and test seeds one by one until we find a collision-free one.

perfect-hash.ts
function hash(word: string, seed: number, tableSize: number): number {
  let h = seed
  for (let i = 0; i < word.length; i++) {
    h = (Math.imul(h, 31) + word.charCodeAt(i)) & 0x7fffffff
  }
  return h % tableSize
}
 
function findPerfectHash(words: string[], tableSize: number): number | null {
  for (let seed = 0; seed < 200_000; seed++) {
    const slots = new Set<number>()
    let ok = true
    for (const w of words) {
      const slot = hash(w, seed, tableSize)
      if (slots.has(slot)) { ok = false; break }
      slots.add(slot)
    }
    if (ok) return seed
  }
  return null   // no seed found in range
}

Math.imul gives us 32-bit integer multiplication without allocating a BigInt. The & 0x7fffffff mask keeps h positive. The seed is the starting accumulator value, so every seed traces a completely different trajectory through the key space.

Parallelising the Search with Web Workers

The seed search is embarrassingly parallel: each seed is independent. We can divide the search space across multiple Web Workers and race them — whichever worker finds a valid seed first wins, and the rest are terminated.

use-perfect-hash.ts
const NUM_WORKERS = 4
const SEEDS_PER_WORKER = 50_000
 
function searchParallel(words: string[], tableSize: number) {
  let found = false
 
  const workers = Array.from({ length: NUM_WORKERS }, (_, i) => {
    const worker = new Worker('/workers/perfect-hash-worker.js')
    worker.postMessage({
      words,
      tableSize,
      startSeed: i * SEEDS_PER_WORKER,
      endSeed:   i * SEEDS_PER_WORKER + SEEDS_PER_WORKER - 1,
      workerId:  i,
    })
    worker.onmessage = (e) => {
      if (e.data.type === 'found' && !found) {
        found = true
        workers.forEach(w => w.terminate())  // stop all other workers
        console.log('seed =', e.data.seed, 'found in', e.data.seedsTried, 'tries')
      }
    }
    return worker
  })
}

Each worker independently searches its slice of the seed space (0–49 999, 50 000–99 999, …). For word sets with a generous table size the first valid seed is typically in the low hundreds, so Worker 1 wins immediately and the others barely start. With a tight table (load ≈ 100 %) the search spreads across workers and the parallel speed-up becomes visible.

💡

Watch the progress bars in the playground. When the table is large, Worker 1 fires almost instantly and cancels the rest. Shrink the table to minimum size and you will see all four workers grind through seeds in unison.

Interactive: Build Your Own Perfect Hash

Enter your own word list, drag the table size, and watch the parallel workers race.

Perfect Hash Playground

4 web workers search in parallel — each covers a different seed range

Array size (slots)16
10 (minimal)load 63%80
Expected seeds to try~17
Hash: h = seed; h = (h×31 + c) & 0x7fffffff

Using the Hash at Runtime

Once the seed is known, lookups are two lines:

lookup.ts
// Build the table once (at init time)
const TABLE: (string | null)[] = new Array(TABLE_SIZE).fill(null)
for (const word of WORDS) {
  TABLE[hash(word, SEED, TABLE_SIZE)] = word
}
 
// Lookup — O(1), zero branches, no collision path
function has(word: string): boolean {
  return TABLE[hash(word, SEED, TABLE_SIZE)] === word
}

The has function always executes in constant time regardless of table occupancy.

Real-World Use Cases

Compilers and interpreters are the canonical home for perfect hashing. Every language has a fixed set of reserved keywords (if, while, return, …). The lexer must classify each identifier token as a keyword or not. A perfect hash produces a slot index in a single pass over the characters — no equality tests after the first one.

GCC uses a minimal perfect hash generated by gperf for its C/C++ keyword table. So does CPython for Python keywords.

Network packet processing uses perfect hashing for protocol field lookup. The set of valid field names in a binary protocol is fixed at compile time; a perfect hash lets the parser dispatch without a chain of if/else.

Embedded systems with severe RAM constraints prefer MPHFs. A collision-free table with no empty slots wastes no memory — important on microcontrollers with 8 KB of RAM.

Static configuration tables — country codes, MIME types, HTTP header names, Unicode category tables — are all fixed sets where a perfect hash can eliminate runtime overhead entirely.

Retrieval Speed: Perfect Hash vs JavaScript Map

V8's Map is impressively fast for general use — it uses inline caches and type specialisation to approach O(1)O(1) amortised. But a perfect hash is structurally simpler: one hash computation, one array read. No bucket chains, no rehashing, no load factor checks.

The benchmark in the playground runs one million lookups of each type and reports nanoseconds per operation. Results vary by browser and hardware, but the story is consistent:

  • For small key sets (< 20 keys) V8's Map is surprisingly competitive — its JIT has specialised the Map.get call so thoroughly that the overhead is near zero.
  • For medium key sets (20–200 keys) the perfect hash typically wins by 1.5–3 ×.
  • For large static tables (thousands of keys) the perfect hash advantage is most consistent because Map's cache locality suffers as the bucket array grows.

Micro-benchmarks in JavaScript are noisy. JIT warm-up, GC pauses, and CPU branch-prediction state all affect results. Run the benchmark several times and treat the ratio as an order-of-magnitude guide, not a precise measurement.

Limitations

Static keys only. Any change to the key set requires rebuilding the hash function. If your dictionary grows at runtime, a perfect hash is the wrong tool.

Build time cost. Finding the seed takes between microseconds (generous table) and milliseconds (tight table). This is amortised over all lookups, so it is worthwhile for hot paths called millions of times.

Seed range. The naïve search can fail to find a seed if the table is too small or the search range is too narrow. Production tools like gperf and cmph use more sophisticated constructions (CHD, BDZ, FCH) that guarantee a solution for any key set.

Further Reading

  • Botelho, Pagh & Ziviani — "Simple and Space-Efficient Minimal Perfect Hash Functions" (2007) — the CHD algorithm used in most modern MPHF libraries
  • GNU gperf documentation — battle-tested tool for generating C/C++ keyword tables
  • Knuth, The Art of Computer Programming Vol. 3, §6.4 — the theoretical foundations of hashing
  • @typespec/compiler source — example of a TypeScript project that uses pre-computed keyword tables

Related Articles