From Regex to Automata to Generators

How regular expressions compile to deterministic finite automata — with an interactive visualizer, a DFA-powered string generator, and a note on closing the loop with property-based testing.

February 28, 2026·9 min read

Every time you call /[0-9]+/.test(s) something interesting happens before the first character is compared. The engine parses the pattern into a syntax tree, converts it to a state machine, and runs that machine forward over the input. The matching step is almost the boring part.

This post traces the full pipeline — regex → NFA → DFA — builds an interactive playground for it, then turns the machine around to generate strings that are guaranteed to satisfy the pattern.

What Is a Regular Expression, Formally?

A regular expression over an alphabet $\Sigma$ is defined inductively:

$\emptyset$ matches nothing; $\varepsilon$ matches the empty string
Any $a \in \Sigma$ matches the single character $a$
If $r$ and $s$ are regex: $rs$ (concatenation), $r|s$ (alternation), $r^*$ (Kleene star) are also regex

Every regular expression generates a regular language — a set of strings. The whole zoo of extended syntax (+, ?, [a-z], \d, …) are just syntactic sugar for these three operations.

ℹ

Regular expressions in the formal sense are strictly less expressive than the "regex" found in most programming languages. Features like backreferences (\1) or lookahead ((?=...)) can match non-regular languages. Those features can't be compiled to a DFA. This post covers the formal, DFA-compilable subset.

Step 1: Regex → NFA (Thompson's Construction)

The first step is Thompson's construction, named after Ken Thompson's 1968 paper that introduced it for use in ed. It converts a regex AST into a nondeterministic finite automaton (NFA).

An NFA is a tuple $M = (Q, \Sigma, \delta, q_0, F)$ where $\delta : Q \times (\Sigma \cup \{\varepsilon\}) \to 2^Q$ — unlike a DFA, it can have multiple outgoing transitions on the same symbol, and epsilon ( $\varepsilon$ ) transitions that consume no input.

The construction is beautifully compositional. Each sub-expression becomes a fragment with a single start state and a single accept state:

Literal $a$ :

→ (s) ──a──▶ ((a))

Concatenation $rs$ :

→ (s_r) ──···──▶ (a_r) ──ε──▶ (s_s) ──···──▶ ((a_s))

Alternation $r|s$ :

         ┌──ε──▶ (s_r) ──···──▶ (a_r) ──ε──┐
→ (s) ───┤                                  ├──▶ ((a))
         └──ε──▶ (s_s) ──···──▶ (a_s) ──ε──┘

Kleene star $r^*$ :

         ┌──────────────ε──────────────┐
→ (s) ───┼──ε──▶ (s_r) ──···──▶ (a_r)─┼──ε──▶ ((a))
         └──────────────ε──────────────┘

The resulting NFA has at most $O(n)$ states and $O(n)$ transitions where $n$ is the length of the regex.

thompson.ts

// Each regex fragment has exactly one start and one accept state
interface NFAFrag { start: number; accept: number; transitions: [from, symbol, to][] }
 
function concat(l: NFAFrag, r: NFAFrag): NFAFrag {
  return {
    start: l.start,
    accept: r.accept,
    transitions: [...l.transitions, [l.accept, 'ε', r.start], ...r.transitions],
  }
}
 
function union(l: NFAFrag, r: NFAFrag): NFAFrag {
  const [s, a] = [newState(), newState()]
  return {
    start: s, accept: a,
    transitions: [
      [s, 'ε', l.start], [s, 'ε', r.start],
      ...l.transitions, [l.accept, 'ε', a],
      ...r.transitions, [r.accept, 'ε', a],
    ],
  }
}
 
function star(c: NFAFrag): NFAFrag {
  const [s, a] = [newState(), newState()]
  return {
    start: s, accept: a,
    transitions: [
      [s, 'ε', c.start], [s, 'ε', a],
      ...c.transitions,
      [c.accept, 'ε', c.start], [c.accept, 'ε', a],
    ],
  }
}

Step 2: NFA → DFA (Subset Construction)

An NFA is elegant to build but awkward to run — at each step you might be in multiple states simultaneously. The subset construction (also called the powerset construction) eliminates the nondeterminism.

The key idea: each DFA state is a set of NFA states — specifically the set of all NFA states the machine could be in after reading the input so far.

Q_{\text{DFA}} \subseteq 2^{Q_{\text{NFA}}}

The algorithm:

Start DFA state: $\varepsilon\text{-closure}(\{q_0^{\text{NFA}}\})$ — all NFA states reachable via $\varepsilon$ from the start
For each DFA state $D$ and each symbol $a \in \Sigma$ , compute:

\delta_{\text{DFA}}(D, a) = \varepsilon\text{-closure}\!\left(\bigcup_{q \in D} \delta_{\text{NFA}}(q, a)\right)

A DFA state is accepting if it contains at least one NFA accepting state

subset.ts

function epsilonClosure(nfa: NFA, states: Set<number>): Set<number> {
  const closure = new Set(states)
  const stack = [...states]
  while (stack.length) {
    const s = stack.pop()!
    for (const t of nfa.epsilonTransitions(s)) {
      if (!closure.has(t)) { closure.add(t); stack.push(t) }
    }
  }
  return closure
}
 
function subsetConstruct(nfa: NFA): DFA {
  const start = epsilonClosure(nfa, new Set([nfa.start]))
  const worklist = [start]
  const dfaStates = new Map<string, DFAState>()
 
  while (worklist.length) {
    const curr = worklist.pop()!
    const key = [...curr].sort().join(',')
    if (dfaStates.has(key)) continue
 
    const transitions = new Map<string, Set<number>>()
    for (const nfaState of curr) {
      for (const [sym, targets] of nfa.transitionsFrom(nfaState)) {
        if (sym === 'ε') continue
        const t = transitions.get(sym) ?? new Set()
        targets.forEach(x => t.add(x))
        transitions.set(sym, t)
      }
    }
 
    const dfaTrans = new Map<string, string>()
    for (const [sym, raw] of transitions) {
      const next = epsilonClosure(nfa, raw)
      const nextKey = [...next].sort().join(',')
      dfaTrans.set(sym, nextKey)
      worklist.push(next)
    }
 
    dfaStates.set(key, {
      nfaStates: curr,
      accepting: [...curr].some(s => s === nfa.accept),
      transitions: dfaTrans,
    })
  }
 
  return new DFA(dfaStates, key(start))
}

💡

In the worst case the subset construction produces $2^n$ DFA states from an $n$ -state NFA. In practice, for typical regex patterns the DFA is small. The NFA for (a|b)*abb has 11 states; the DFA has just 5. DFA minimisation (Hopcroft's algorithm) can further reduce the state count.

Why bother converting to a DFA? Because a DFA can be run in $O(n)$ time for an input of length $n$ with no backtracking. Backtracking NFA engines (as used in most language runtimes) can exhibit exponential worst-case behaviour — the infamous catastrophic backtracking vulnerability. V8, for example, switched to a DFA-based engine for certain patterns in response to ReDoS attacks.

Interactive Playground

Try it yourself. The playground parses your regex, runs Thompson's construction, applies subset construction, then renders the resulting DFA. Type an input string to step through the automaton one character at a time.

Regex → DFA Playground

Type a regex to see its deterministic automaton

Browser-only

5 states · 10 transitions · alphabet: {a, b}

Double ring = accepting state · Arrow in = start state · q0 is always the start

Trace an input string

step 5 / 5

aabb

q4·✓ Accepted — all input consumed in accepting state

A few things to notice:

(a|b)*abb — the classic textbook example. Five states; any string ending with abb is accepted.
[0-9]+ — two states. The DFA immediately collapses all ten digit transitions because they all lead to the same NFA state set.
a*b* — three states. The interesting edge is the self-loop on b combined with no way to go back to reading a once you start reading b.

The double-ring marks an accepting state: if the machine finishes in that state the input is accepted. The incoming arrow marks the start state. States are labelled $q_0, q_1, \ldots$ in the order they are discovered by subset construction — so $q_0$ is always the start.

Running Backwards: DFA as a Generator

We've used the DFA to check whether strings match the pattern. But the same machine can also produce strings that match it.

A DFA is just a directed graph with labelled edges. To generate a matching string:

Start at $q_0$
At each state, randomly pick an outgoing transition and follow it, appending the edge's symbol to the output
If the current state is accepting, randomly decide whether to stop or continue
If there are no outgoing transitions, stop (and backtrack if necessary)

generate.ts

function generate(dfa: DFA, maxLen = 20): string | null {
  for (let attempt = 0; attempt < 50; attempt++) {
    let state = dfa.start
    let result = ''
 
    for (let step = 0; step <= maxLen; step++) {
      if (dfa.isAccepting(state) && (Math.random() < 0.5 || step === maxLen)) {
        return result
      }
      const transitions = [...dfa.transitionsFrom(state)]
      if (transitions.length === 0) break
      const [symbol, next] = transitions[Math.floor(Math.random() * transitions.length)]
      result += symbol
      state = next
    }
 
    if (dfa.isAccepting(state)) return result
  }
  return null // retry budget exhausted
}

Regex → DFA Playground

Type a regex to see its deterministic automaton

Browser-only

5 states · 10 transitions · alphabet: {a, b}

Double ring = accepting state · Arrow in = start state · q0 is always the start

Generate matching strings

Click "Generate ×12" to produce sample strings and see the states each one visits

The "Generate ×12" button uses exactly this approach. Click it, then click any generated string to trace it through the automaton.

This is more useful than it might first appear:

Fuzzing: generate random inputs that do match a format, then test the downstream parser
Test data factories: generate(/ISO-8601 date/) gives you valid dates without hardcoding them
Negative test data: generate strings then modify one character; the result is likely to not match
Grammar exploration: generate from a complex regex to see what it actually accepts

Closing the Loop: Property-Based Testing

Here's a neat sanity check you can apply to any regex/generator pair.

The property: every string produced by generate(regex) must be accepted by regex.test(s).

\forall s \in \text{generate}(r) : r.\text{test}(s) = \texttt{true}

This is trivially true if the DFA is correct — but it's a useful regression test when you're building or modifying the DFA construction code. The full pipeline is:

property-test.ts

import fc from 'fast-check'
import { compile, generate } from './regex-dfa'
 
fc.assert(
  fc.property(
    fc.constantFrom('(a|b)*abb', '[0-9]+', 'a?b+c', '(hi|hello)'),
    (pattern) => {
      const dfa = compile(pattern)
      const re = new RegExp(`^(?:${pattern})$`)
 
      for (let i = 0; i < 100; i++) {
        const s = generate(dfa)
        if (s !== null && !re.test(s)) return false
      }
      return true
    }
  )
)

And the dual direction — strings the DFA rejects should be rejected by regex.test:

property-test-reject.ts

fc.assert(
  fc.property(
    fc.constantFrom('(a|b)*abb', '[0-9]+'),
    fc.string(),               // arbitrary strings
    (pattern, s) => {
      const dfa = compile(pattern)
      const re = new RegExp(`^(?:${pattern})$`)
      return dfaAccepts(dfa, s) === re.test(s)  // must agree
    }
  )
)

Running these two checks against a large sample of patterns and inputs gives you very high confidence that your DFA construction is correct. Libraries like fast-check (TypeScript), Hypothesis (Python), or QuickCheck (Haskell) make writing these shrinkable, reproducible property tests straightforward.

💡

Property-based testing shines here because there are infinitely many (pattern, input) combinations to test, and any hand-written test suite will miss edge cases. PBT explores the space systematically and, when it finds a failure, automatically shrinks it to the minimal failing example.

Summary

The pipeline from regex to DFA is three clean steps:

Step	Algorithm	Complexity
Regex → NFA	Thompson's construction	$O(n)$ states, $O(n)$ transitions
NFA → DFA	Subset construction	$O(2^n)$ worst case, $O(n)$ typical
DFA → match	Table-driven simulation	$O(m)$ for input of length $m$

And the DFA is a two-way tool: run it forward to match, run it as a random walk to generate. The round-trip — generate → re-match — is a free correctness check that pairs naturally with property-based testing.

The next time you write a regex, you now know what the engine builds before it reads the first character.

◈

Feb 28, 2026·11 min read

Fuzzy Search: Typos, Tries, and the Algorithms Behind Instant Results

A deep dive into how fuzzy search works — from edit distance and tries to stemming, BK-trees, and how Lucene/Elasticsearch find typo-tolerant matches at scale.

algorithms data-structures javascript

Feb 28, 2026·8 min read

Perfect Hash Functions: Collision-Free by Design

A deep dive into perfect hash functions — what they are, why they matter, how to find one efficiently with parallel search, and how they compare to JavaScript's built-in Map.

algorithms data-structures typescript

◆

Mar 1, 2019·6 min read

Buckaroo: Conflict-Driven Dependency Resolution for C++

How Buckaroo's dependency resolver borrows ideas from SAT solvers — using conflict-driven clause learning to prune the search space and resolve C++ packages blazingly fast.

c++package-management algorithms