How regular expressions compile to deterministic finite automata — with an interactive visualizer, a DFA-powered string generator, and a note on closing the loop with property-based testing.
Every time you call /[0-9]+/.test(s) something interesting happens before the first character is compared. The engine parses the pattern into a syntax tree, converts it to a state machine, and runs that machine forward over the input. The matching step is almost the boring part.
This post traces the full pipeline — regex → NFA → DFA — builds an interactive playground for it, then turns the machine around to generate strings that are guaranteed to satisfy the pattern.
What Is a Regular Expression, Formally?
A regular expression over an alphabet is defined inductively:
- matches nothing; matches the empty string
- Any matches the single character
- If and are regex: (concatenation), (alternation), (Kleene star) are also regex
Every regular expression generates a regular language — a set of strings. The whole zoo of extended syntax (+, ?, [a-z], \d, …) are just syntactic sugar for these three operations.
Regular expressions in the formal sense are strictly less expressive than the "regex" found in most programming languages. Features like backreferences (\1) or lookahead ((?=...)) can match non-regular languages. Those features can't be compiled to a DFA. This post covers the formal, DFA-compilable subset.
Step 1: Regex → NFA (Thompson's Construction)
The first step is Thompson's construction, named after Ken Thompson's 1968 paper that introduced it for use in ed. It converts a regex AST into a nondeterministic finite automaton (NFA).
An NFA is a tuple where — unlike a DFA, it can have multiple outgoing transitions on the same symbol, and epsilon () transitions that consume no input.
The construction is beautifully compositional. Each sub-expression becomes a fragment with a single start state and a single accept state:
Literal :
→ (s) ──a──▶ ((a))Concatenation :
→ (s_r) ──···──▶ (a_r) ──ε──▶ (s_s) ──···──▶ ((a_s))Alternation :
┌──ε──▶ (s_r) ──···──▶ (a_r) ──ε──┐
→ (s) ───┤ ├──▶ ((a))
└──ε──▶ (s_s) ──···──▶ (a_s) ──ε──┘Kleene star :
┌──────────────ε──────────────┐
→ (s) ───┼──ε──▶ (s_r) ──···──▶ (a_r)─┼──ε──▶ ((a))
└──────────────ε──────────────┘The resulting NFA has at most states and transitions where is the length of the regex.
// Each regex fragment has exactly one start and one accept state
interface NFAFrag { start: number; accept: number; transitions: [from, symbol, to][] }
function concat(l: NFAFrag, r: NFAFrag): NFAFrag {
return {
start: l.start,
accept: r.accept,
transitions: [...l.transitions, [l.accept, 'ε', r.start], ...r.transitions],
}
}
function union(l: NFAFrag, r: NFAFrag): NFAFrag {
const [s, a] = [newState(), newState()]
return {
start: s, accept: a,
transitions: [
[s, 'ε', l.start], [s, 'ε', r.start],
...l.transitions, [l.accept, 'ε', a],
...r.transitions, [r.accept, 'ε', a],
],
}
}
function star(c: NFAFrag): NFAFrag {
const [s, a] = [newState(), newState()]
return {
start: s, accept: a,
transitions: [
[s, 'ε', c.start], [s, 'ε', a],
...c.transitions,
[c.accept, 'ε', c.start], [c.accept, 'ε', a],
],
}
}Step 2: NFA → DFA (Subset Construction)
An NFA is elegant to build but awkward to run — at each step you might be in multiple states simultaneously. The subset construction (also called the powerset construction) eliminates the nondeterminism.
The key idea: each DFA state is a set of NFA states — specifically the set of all NFA states the machine could be in after reading the input so far.
The algorithm:
- Start DFA state: — all NFA states reachable via from the start
- For each DFA state and each symbol , compute:
- A DFA state is accepting if it contains at least one NFA accepting state
function epsilonClosure(nfa: NFA, states: Set<number>): Set<number> {
const closure = new Set(states)
const stack = [...states]
while (stack.length) {
const s = stack.pop()!
for (const t of nfa.epsilonTransitions(s)) {
if (!closure.has(t)) { closure.add(t); stack.push(t) }
}
}
return closure
}
function subsetConstruct(nfa: NFA): DFA {
const start = epsilonClosure(nfa, new Set([nfa.start]))
const worklist = [start]
const dfaStates = new Map<string, DFAState>()
while (worklist.length) {
const curr = worklist.pop()!
const key = [...curr].sort().join(',')
if (dfaStates.has(key)) continue
const transitions = new Map<string, Set<number>>()
for (const nfaState of curr) {
for (const [sym, targets] of nfa.transitionsFrom(nfaState)) {
if (sym === 'ε') continue
const t = transitions.get(sym) ?? new Set()
targets.forEach(x => t.add(x))
transitions.set(sym, t)
}
}
const dfaTrans = new Map<string, string>()
for (const [sym, raw] of transitions) {
const next = epsilonClosure(nfa, raw)
const nextKey = [...next].sort().join(',')
dfaTrans.set(sym, nextKey)
worklist.push(next)
}
dfaStates.set(key, {
nfaStates: curr,
accepting: [...curr].some(s => s === nfa.accept),
transitions: dfaTrans,
})
}
return new DFA(dfaStates, key(start))
}In the worst case the subset construction produces DFA states from an -state NFA. In practice, for typical regex patterns the DFA is small. The NFA for (a|b)*abb has 11 states; the DFA has just 5. DFA minimisation (Hopcroft's algorithm) can further reduce the state count.
Why bother converting to a DFA? Because a DFA can be run in time for an input of length with no backtracking. Backtracking NFA engines (as used in most language runtimes) can exhibit exponential worst-case behaviour — the infamous catastrophic backtracking vulnerability. V8, for example, switched to a DFA-based engine for certain patterns in response to ReDoS attacks.
Interactive Playground
Try it yourself. The playground parses your regex, runs Thompson's construction, applies subset construction, then renders the resulting DFA. Type an input string to step through the automaton one character at a time.
Regex → DFA Playground
Type a regex to see its deterministic automaton
5 states · 10 transitions · alphabet: {a, b}
Double ring = accepting state · Arrow in = start state · q0 is always the start
Trace an input string
A few things to notice:
(a|b)*abb— the classic textbook example. Five states; any string ending withabbis accepted.[0-9]+— two states. The DFA immediately collapses all ten digit transitions because they all lead to the same NFA state set.a*b*— three states. The interesting edge is the self-loop onbcombined with no way to go back to readingaonce you start readingb.
The double-ring marks an accepting state: if the machine finishes in that state the input is accepted. The incoming arrow marks the start state. States are labelled in the order they are discovered by subset construction — so is always the start.
Running Backwards: DFA as a Generator
We've used the DFA to check whether strings match the pattern. But the same machine can also produce strings that match it.
A DFA is just a directed graph with labelled edges. To generate a matching string:
- Start at
- At each state, randomly pick an outgoing transition and follow it, appending the edge's symbol to the output
- If the current state is accepting, randomly decide whether to stop or continue
- If there are no outgoing transitions, stop (and backtrack if necessary)
function generate(dfa: DFA, maxLen = 20): string | null {
for (let attempt = 0; attempt < 50; attempt++) {
let state = dfa.start
let result = ''
for (let step = 0; step <= maxLen; step++) {
if (dfa.isAccepting(state) && (Math.random() < 0.5 || step === maxLen)) {
return result
}
const transitions = [...dfa.transitionsFrom(state)]
if (transitions.length === 0) break
const [symbol, next] = transitions[Math.floor(Math.random() * transitions.length)]
result += symbol
state = next
}
if (dfa.isAccepting(state)) return result
}
return null // retry budget exhausted
}Regex → DFA Playground
Type a regex to see its deterministic automaton
5 states · 10 transitions · alphabet: {a, b}
Double ring = accepting state · Arrow in = start state · q0 is always the start
Generate matching strings
Click "Generate ×12" to produce sample strings and see the states each one visits
The "Generate ×12" button uses exactly this approach. Click it, then click any generated string to trace it through the automaton.
This is more useful than it might first appear:
- Fuzzing: generate random inputs that do match a format, then test the downstream parser
- Test data factories:
generate(/ISO-8601 date/)gives you valid dates without hardcoding them - Negative test data: generate strings then modify one character; the result is likely to not match
- Grammar exploration: generate from a complex regex to see what it actually accepts
Closing the Loop: Property-Based Testing
Here's a neat sanity check you can apply to any regex/generator pair.
The property: every string produced by generate(regex) must be accepted by regex.test(s).
This is trivially true if the DFA is correct — but it's a useful regression test when you're building or modifying the DFA construction code. The full pipeline is:
import fc from 'fast-check'
import { compile, generate } from './regex-dfa'
fc.assert(
fc.property(
fc.constantFrom('(a|b)*abb', '[0-9]+', 'a?b+c', '(hi|hello)'),
(pattern) => {
const dfa = compile(pattern)
const re = new RegExp(`^(?:${pattern})$`)
for (let i = 0; i < 100; i++) {
const s = generate(dfa)
if (s !== null && !re.test(s)) return false
}
return true
}
)
)And the dual direction — strings the DFA rejects should be rejected by regex.test:
fc.assert(
fc.property(
fc.constantFrom('(a|b)*abb', '[0-9]+'),
fc.string(), // arbitrary strings
(pattern, s) => {
const dfa = compile(pattern)
const re = new RegExp(`^(?:${pattern})$`)
return dfaAccepts(dfa, s) === re.test(s) // must agree
}
)
)Running these two checks against a large sample of patterns and inputs gives you very high confidence that your DFA construction is correct. Libraries like fast-check (TypeScript), Hypothesis (Python), or QuickCheck (Haskell) make writing these shrinkable, reproducible property tests straightforward.
Property-based testing shines here because there are infinitely many (pattern, input) combinations to test, and any hand-written test suite will miss edge cases. PBT explores the space systematically and, when it finds a failure, automatically shrinks it to the minimal failing example.
Summary
The pipeline from regex to DFA is three clean steps:
| Step | Algorithm | Complexity |
|---|---|---|
| Regex → NFA | Thompson's construction | states, transitions |
| NFA → DFA | Subset construction | worst case, typical |
| DFA → match | Table-driven simulation | for input of length |
And the DFA is a two-way tool: run it forward to match, run it as a random walk to generate. The round-trip — generate → re-match — is a free correctness check that pairs naturally with property-based testing.
The next time you write a regex, you now know what the engine builds before it reads the first character.
Related Articles
Fuzzy Search: Typos, Tries, and the Algorithms Behind Instant Results
A deep dive into how fuzzy search works — from edit distance and tries to stemming, BK-trees, and how Lucene/Elasticsearch find typo-tolerant matches at scale.
Perfect Hash Functions: Collision-Free by Design
A deep dive into perfect hash functions — what they are, why they matter, how to find one efficiently with parallel search, and how they compare to JavaScript's built-in Map.
Buckaroo: Conflict-Driven Dependency Resolution for C++
How Buckaroo's dependency resolver borrows ideas from SAT solvers — using conflict-driven clause learning to prune the search space and resolve C++ packages blazingly fast.