How regular expressions compile to deterministic finite automata — with an interactive visualizer, a DFA-powered string generator, and a note on closing the loop with property-based testing.

·9 min read

Every time you call /[0-9]+/.test(s) something interesting happens before the first character is compared. The engine parses the pattern into a syntax tree, converts it to a state machine, and runs that machine forward over the input. The matching step is almost the boring part.

This post traces the full pipeline — regex → NFA → DFA — builds an interactive playground for it, then turns the machine around to generate strings that are guaranteed to satisfy the pattern.


What Is a Regular Expression, Formally?

A regular expression over an alphabet Σ\Sigma is defined inductively:

  • \emptyset matches nothing; ε\varepsilon matches the empty string
  • Any aΣa \in \Sigma matches the single character aa
  • If rr and ss are regex: rsrs (concatenation), rsr|s (alternation), rr^* (Kleene star) are also regex

Every regular expression generates a regular language — a set of strings. The whole zoo of extended syntax (+, ?, [a-z], \d, …) are just syntactic sugar for these three operations.

Regular expressions in the formal sense are strictly less expressive than the "regex" found in most programming languages. Features like backreferences (\1) or lookahead ((?=...)) can match non-regular languages. Those features can't be compiled to a DFA. This post covers the formal, DFA-compilable subset.


Step 1: Regex → NFA (Thompson's Construction)

The first step is Thompson's construction, named after Ken Thompson's 1968 paper that introduced it for use in ed. It converts a regex AST into a nondeterministic finite automaton (NFA).

An NFA is a tuple M=(Q,Σ,δ,q0,F)M = (Q, \Sigma, \delta, q_0, F) where δ:Q×(Σ{ε})2Q\delta : Q \times (\Sigma \cup \{\varepsilon\}) \to 2^Q — unlike a DFA, it can have multiple outgoing transitions on the same symbol, and epsilon (ε\varepsilon) transitions that consume no input.

The construction is beautifully compositional. Each sub-expression becomes a fragment with a single start state and a single accept state:

Literal aa:

→ (s) ──a──▶ ((a))

Concatenation rsrs:

→ (s_r) ──···──▶ (a_r) ──ε──▶ (s_s) ──···──▶ ((a_s))

Alternation rsr|s:

         ┌──ε──▶ (s_r) ──···──▶ (a_r) ──ε──┐
→ (s) ───┤                                  ├──▶ ((a))
         └──ε──▶ (s_s) ──···──▶ (a_s) ──ε──┘

Kleene star rr^*:

         ┌──────────────ε──────────────┐
→ (s) ───┼──ε──▶ (s_r) ──···──▶ (a_r)─┼──ε──▶ ((a))
         └──────────────ε──────────────┘

The resulting NFA has at most O(n)O(n) states and O(n)O(n) transitions where nn is the length of the regex.

thompson.ts
// Each regex fragment has exactly one start and one accept state
interface NFAFrag { start: number; accept: number; transitions: [from, symbol, to][] }
 
function concat(l: NFAFrag, r: NFAFrag): NFAFrag {
  return {
    start: l.start,
    accept: r.accept,
    transitions: [...l.transitions, [l.accept, 'ε', r.start], ...r.transitions],
  }
}
 
function union(l: NFAFrag, r: NFAFrag): NFAFrag {
  const [s, a] = [newState(), newState()]
  return {
    start: s, accept: a,
    transitions: [
      [s, 'ε', l.start], [s, 'ε', r.start],
      ...l.transitions, [l.accept, 'ε', a],
      ...r.transitions, [r.accept, 'ε', a],
    ],
  }
}
 
function star(c: NFAFrag): NFAFrag {
  const [s, a] = [newState(), newState()]
  return {
    start: s, accept: a,
    transitions: [
      [s, 'ε', c.start], [s, 'ε', a],
      ...c.transitions,
      [c.accept, 'ε', c.start], [c.accept, 'ε', a],
    ],
  }
}

Step 2: NFA → DFA (Subset Construction)

An NFA is elegant to build but awkward to run — at each step you might be in multiple states simultaneously. The subset construction (also called the powerset construction) eliminates the nondeterminism.

The key idea: each DFA state is a set of NFA states — specifically the set of all NFA states the machine could be in after reading the input so far.

QDFA2QNFAQ_{\text{DFA}} \subseteq 2^{Q_{\text{NFA}}}

The algorithm:

  1. Start DFA state: ε-closure({q0NFA})\varepsilon\text{-closure}(\{q_0^{\text{NFA}}\}) — all NFA states reachable via ε\varepsilon from the start
  2. For each DFA state DD and each symbol aΣa \in \Sigma, compute:
δDFA(D,a)=ε-closure ⁣(qDδNFA(q,a))\delta_{\text{DFA}}(D, a) = \varepsilon\text{-closure}\!\left(\bigcup_{q \in D} \delta_{\text{NFA}}(q, a)\right)
  1. A DFA state is accepting if it contains at least one NFA accepting state
subset.ts
function epsilonClosure(nfa: NFA, states: Set<number>): Set<number> {
  const closure = new Set(states)
  const stack = [...states]
  while (stack.length) {
    const s = stack.pop()!
    for (const t of nfa.epsilonTransitions(s)) {
      if (!closure.has(t)) { closure.add(t); stack.push(t) }
    }
  }
  return closure
}
 
function subsetConstruct(nfa: NFA): DFA {
  const start = epsilonClosure(nfa, new Set([nfa.start]))
  const worklist = [start]
  const dfaStates = new Map<string, DFAState>()
 
  while (worklist.length) {
    const curr = worklist.pop()!
    const key = [...curr].sort().join(',')
    if (dfaStates.has(key)) continue
 
    const transitions = new Map<string, Set<number>>()
    for (const nfaState of curr) {
      for (const [sym, targets] of nfa.transitionsFrom(nfaState)) {
        if (sym === 'ε') continue
        const t = transitions.get(sym) ?? new Set()
        targets.forEach(x => t.add(x))
        transitions.set(sym, t)
      }
    }
 
    const dfaTrans = new Map<string, string>()
    for (const [sym, raw] of transitions) {
      const next = epsilonClosure(nfa, raw)
      const nextKey = [...next].sort().join(',')
      dfaTrans.set(sym, nextKey)
      worklist.push(next)
    }
 
    dfaStates.set(key, {
      nfaStates: curr,
      accepting: [...curr].some(s => s === nfa.accept),
      transitions: dfaTrans,
    })
  }
 
  return new DFA(dfaStates, key(start))
}
💡

In the worst case the subset construction produces 2n2^n DFA states from an nn-state NFA. In practice, for typical regex patterns the DFA is small. The NFA for (a|b)*abb has 11 states; the DFA has just 5. DFA minimisation (Hopcroft's algorithm) can further reduce the state count.

Why bother converting to a DFA? Because a DFA can be run in O(n)O(n) time for an input of length nn with no backtracking. Backtracking NFA engines (as used in most language runtimes) can exhibit exponential worst-case behaviour — the infamous catastrophic backtracking vulnerability. V8, for example, switched to a DFA-based engine for certain patterns in response to ReDoS attacks.


Interactive Playground

Try it yourself. The playground parses your regex, runs Thompson's construction, applies subset construction, then renders the resulting DFA. Type an input string to step through the automaton one character at a time.

Regex → DFA Playground

Type a regex to see its deterministic automaton

//

5 states · 10 transitions · alphabet: {a, b}

abababababq0q1q2q3q4

Double ring = accepting state · Arrow in = start state · q0 is always the start

Trace an input string

step 5 / 5
aabb
q4·✓ Accepted — all input consumed in accepting state

A few things to notice:

  • (a|b)*abb — the classic textbook example. Five states; any string ending with abb is accepted.
  • [0-9]+ — two states. The DFA immediately collapses all ten digit transitions because they all lead to the same NFA state set.
  • a*b* — three states. The interesting edge is the self-loop on b combined with no way to go back to reading a once you start reading b.

The double-ring marks an accepting state: if the machine finishes in that state the input is accepted. The incoming arrow marks the start state. States are labelled q0,q1,q_0, q_1, \ldots in the order they are discovered by subset construction — so q0q_0 is always the start.


Running Backwards: DFA as a Generator

We've used the DFA to check whether strings match the pattern. But the same machine can also produce strings that match it.

A DFA is just a directed graph with labelled edges. To generate a matching string:

  1. Start at q0q_0
  2. At each state, randomly pick an outgoing transition and follow it, appending the edge's symbol to the output
  3. If the current state is accepting, randomly decide whether to stop or continue
  4. If there are no outgoing transitions, stop (and backtrack if necessary)
generate.ts
function generate(dfa: DFA, maxLen = 20): string | null {
  for (let attempt = 0; attempt < 50; attempt++) {
    let state = dfa.start
    let result = ''
 
    for (let step = 0; step <= maxLen; step++) {
      if (dfa.isAccepting(state) && (Math.random() < 0.5 || step === maxLen)) {
        return result
      }
      const transitions = [...dfa.transitionsFrom(state)]
      if (transitions.length === 0) break
      const [symbol, next] = transitions[Math.floor(Math.random() * transitions.length)]
      result += symbol
      state = next
    }
 
    if (dfa.isAccepting(state)) return result
  }
  return null // retry budget exhausted
}

Regex → DFA Playground

Type a regex to see its deterministic automaton

//

5 states · 10 transitions · alphabet: {a, b}

abababababq0q1q2q3q4

Double ring = accepting state · Arrow in = start state · q0 is always the start

Generate matching strings

Click "Generate ×12" to produce sample strings and see the states each one visits

The "Generate ×12" button uses exactly this approach. Click it, then click any generated string to trace it through the automaton.

This is more useful than it might first appear:

  • Fuzzing: generate random inputs that do match a format, then test the downstream parser
  • Test data factories: generate(/ISO-8601 date/) gives you valid dates without hardcoding them
  • Negative test data: generate strings then modify one character; the result is likely to not match
  • Grammar exploration: generate from a complex regex to see what it actually accepts

Closing the Loop: Property-Based Testing

Here's a neat sanity check you can apply to any regex/generator pair.

The property: every string produced by generate(regex) must be accepted by regex.test(s).

sgenerate(r):r.test(s)=true\forall s \in \text{generate}(r) : r.\text{test}(s) = \texttt{true}

This is trivially true if the DFA is correct — but it's a useful regression test when you're building or modifying the DFA construction code. The full pipeline is:

property-test.ts
import fc from 'fast-check'
import { compile, generate } from './regex-dfa'
 
fc.assert(
  fc.property(
    fc.constantFrom('(a|b)*abb', '[0-9]+', 'a?b+c', '(hi|hello)'),
    (pattern) => {
      const dfa = compile(pattern)
      const re = new RegExp(`^(?:${pattern})$`)
 
      for (let i = 0; i < 100; i++) {
        const s = generate(dfa)
        if (s !== null && !re.test(s)) return false
      }
      return true
    }
  )
)

And the dual direction — strings the DFA rejects should be rejected by regex.test:

property-test-reject.ts
fc.assert(
  fc.property(
    fc.constantFrom('(a|b)*abb', '[0-9]+'),
    fc.string(),               // arbitrary strings
    (pattern, s) => {
      const dfa = compile(pattern)
      const re = new RegExp(`^(?:${pattern})$`)
      return dfaAccepts(dfa, s) === re.test(s)  // must agree
    }
  )
)

Running these two checks against a large sample of patterns and inputs gives you very high confidence that your DFA construction is correct. Libraries like fast-check (TypeScript), Hypothesis (Python), or QuickCheck (Haskell) make writing these shrinkable, reproducible property tests straightforward.

💡

Property-based testing shines here because there are infinitely many (pattern, input) combinations to test, and any hand-written test suite will miss edge cases. PBT explores the space systematically and, when it finds a failure, automatically shrinks it to the minimal failing example.


Summary

The pipeline from regex to DFA is three clean steps:

StepAlgorithmComplexity
Regex → NFAThompson's constructionO(n)O(n) states, O(n)O(n) transitions
NFA → DFASubset constructionO(2n)O(2^n) worst case, O(n)O(n) typical
DFA → matchTable-driven simulationO(m)O(m) for input of length mm

And the DFA is a two-way tool: run it forward to match, run it as a random walk to generate. The round-trip — generate → re-match — is a free correctness check that pairs naturally with property-based testing.

The next time you write a regex, you now know what the engine builds before it reads the first character.

Related Articles