A deep-dive into automatic differentiation: symbolic vs. numerical vs. AD, AST transformation, dual numbers, tapes, hybrid methods, and a generic C++ implementation that computes gradients and solves optimisation problems.

·17 min read

Derivatives are everywhere in software: gradient descent for machine learning, Newton's method for root-finding, sensitivity analysis, physics engines. Yet computing them accurately and efficiently is surprisingly tricky. This post explores automatic differentiation (AD) — a technique that produces exact derivatives of arbitrary programs with almost no extra work from the programmer.

Three Ways to Differentiate a Program

Before explaining AD, it helps to see what it is not.

Numerical Differentiation (Finite Differences)

The blunt instrument: just perturb the input by a tiny ε and measure the change in output.

f'(x) ≈ (f(x + ε) - f(x)) / ε
  • ✅ Trivial to implement.
  • Truncation error — ε too large means the approximation is inaccurate.
  • Cancellation error — ε too small and floating-point subtraction destroys precision.
  • ❌ Scales badly: computing the full gradient of an n-dimensional function requires n evaluations.

Symbolic Differentiation

Apply the rules of calculus to the expression itself, producing a new symbolic expression for the derivative.

  • ✅ Produces exact results.
  • Expression swell — the derivative of a moderate-sized formula can be astronomically large after expansion and simplification.
  • ❌ Requires the function to be expressed in a computer-algebra system, not arbitrary imperative code.

Automatic Differentiation

AD threads derivative information through the actual execution of the program. It achieves the exactness of symbolic differentiation without the expression swell, and the generality of numerical differentiation without the precision loss.


Two Flavours of Automatic Differentiation

1 — Source Transformation (AST-based)

A compiler pass or preprocessor rewrites the source code, augmenting every operation with its derivative counterpart. The output is a new program that computes both the function value and the derivative simultaneously.

Original:  z = x * y + sin(x)
 
Augmented: z  = x * y + sin(x)
           z' = x'* y + x * y' + cos(x) * x'

This is exactly what tools like Tapenade (Fortran/C) and Zygote.jl (Julia) do. The technique can also be applied in reverse mode, where a backward pass accumulates gradients from outputs to inputs — the foundation of backpropagation in neural networks.

  • ✅ Can be highly optimised by the compiler.
  • ✅ Reverse mode computes gradients in O(1) evaluations regardless of input dimensionality.
  • ❌ Requires a specialised compiler or build-system integration.

2 — Operator Overloading with Dual Numbers

Instead of rewriting source code, we replace the scalar type with a richer type that carries the derivative alongside the value. No compiler magic required — just the type system.

This is the approach we will implement in C++.

3 — Tapes (Dynamic Computation Graphs)

A tape (also called a Wengert list or trace) is a log of every arithmetic operation executed during the forward pass. Each entry records the operation and its inputs' indices so that a backward sweep can propagate adjoints from outputs back to inputs.

Forward pass: record operations
   node 0: x (leaf, value 3.0)
   node 1: y (leaf, value 2.0)
   node 2: t2 = node0 * node1   (val = 6.0)
   node 3: t3 = sin(node0)      (val = sin(3.0))
   node 4: t4 = t2 + t3         (val = 6.0 + sin(3.0))
 
Backward pass (reverse accumulation from node 4):
   ∂L/∂t4 = 1
   ∂L/∂t2 += ∂L/∂t4 * 1              → 1
   ∂L/∂t3 += ∂L/∂t4 * 1              → 1
   ∂L/∂x  += ∂L/∂t2 * y              → 2.0
   ∂L/∂x  += ∂L/∂t3 * cos(3.0)       → 2.0 + cos(3.0)
   ∂L/∂y  += ∂L/∂t2 * x              → 3.0

The key insight is that a single backward pass through the tape computes the full gradient with respect to all inputs simultaneously, regardless of how many there are. This is reverse-mode AD, and it is the basis of backpropagation in frameworks like PyTorch and TensorFlow.

Tape-based AD via operator overloading is straightforward: override every arithmetic operator on a Variable type to append an entry to a global (or thread-local) tape, then call a backward() method to replay it in reverse.


Hybrid Methods

In practice, production AD systems mix all three strategies:

  • Forward mode for scalar functions and small Jacobians — inexpensive when the number of inputs is small.
  • Reverse mode (tape) for large-input, scalar-output functions — the standard choice for training neural networks.
  • Source transformation for hot inner loops — a compiler pass can fuse the primal and adjoint into a single optimised loop, eliminating tape overhead entirely.

For example, JAX uses XLA compilation (source transformation) to JIT-compile both the forward and backward passes, while also supporting dynamic Python graphs. Enzyme is an LLVM pass that differentiates compiled LLVM IR — it works on any language that targets LLVM, including C++ and Fortran, and can outperform hand-written gradients because it sees optimised machine-level code.


Dual Numbers

A dual number extends the reals with an element ε satisfying ε² = 0 (but ε ≠ 0). Every dual number has the form:

a + bε

Arithmetic works exactly like the reals, except we discard any term containing ε²:

(a + bε)(c + dε) = ac + (ad + bc)ε + bdε²
                 = ac + (ad + bc)ε

The magic: if we seed the computation with x = x₀ + 1·ε, then for any elementary function f:

f(x₀ + ε) = f(x₀) + f'(x₀)·ε

The val component gives f(x₀), and the grad component gives f'(x₀) — for free, as a by-product of ordinary arithmetic.


A Generic C++ Implementation

We represent a dual number as a struct templated on the underlying scalar type T. Templating on T is crucial: it allows us to nest duals (Dual<Dual<double>>) to obtain higher-order derivatives without writing any new code.

template<typename T>
struct Dual {
    using value_type = T;
    T val;   // f(x)
    T grad;  // f'(x)
 
    Dual(T val = T{}, T grad = T{}) : val(val), grad(grad) {}
};
 
// Convenience: wrap a scalar constant (derivative is zero)
template<typename T>
Dual<T> constant(double c) { return Dual<T>{ T(c), T(0) }; }

Arithmetic Operators

The rules for dual arithmetic translate directly to overloaded operators:

template<typename T>
Dual<T> operator+(Dual<T> a, Dual<T> b) {
    return { a.val + b.val, a.grad + b.grad };
}
 
template<typename T>
Dual<T> operator-(Dual<T> a, Dual<T> b) {
    return { a.val - b.val, a.grad - b.grad };
}
 
template<typename T>
Dual<T> operator*(Dual<T> a, Dual<T> b) {
    // product rule: (uv)' = u'v + uv'
    return { a.val * b.val, a.grad * b.val + a.val * b.grad };
}
 
template<typename T>
Dual<T> operator/(Dual<T> a, Dual<T> b) {
    // quotient rule: (u/v)' = (u'v - uv') / v²
    return { a.val / b.val,
             (a.grad * b.val - a.val * b.grad) / (b.val * b.val) };
}

Lifting Elementary Functions

Each standard function must be "lifted" to dual numbers using its derivative. Note that the inner calls are unqualified — this lets the C++ name lookup (ADL) recursively find the overloaded versions when T is itself a Dual, enabling nesting.

template<typename T>
Dual<T> sin(Dual<T> x) {
    using std::sin; using std::cos;
    return { sin(x.val), x.grad * cos(x.val) };
}
 
template<typename T>
Dual<T> cos(Dual<T> x) {
    using std::sin; using std::cos;
    return { cos(x.val), x.grad * (-sin(x.val)) };
}
 
template<typename T>
Dual<T> exp(Dual<T> x) {
    using std::exp;
    T e = exp(x.val);
    return { e, x.grad * e };
}
 
template<typename T>
Dual<T> log(Dual<T> x) {
    using std::log;
    return { log(x.val), x.grad / x.val };
}
 
template<typename T>
Dual<T> pow(Dual<T> x, double n) {
    using std::pow;
    return { pow(x.val, n),
             x.grad * T(n) * pow(x.val, n - 1.0) };
}

A One-Line Differentiator

With the infrastructure above, differentiating any templated function reduces to a single helper:

template<typename F, typename T>
T diff(F f, T x) {
    return f(Dual<T>{ x, T(1) }).grad;
}

We seed the dual with val = x and grad = 1 (representing the "direction" dx = 1), then simply read off the gradient component of the result.


Newton's Method with Automatic Differentiation

Newton's method finds a root of f by iterating:

x_{n+1} = x_n - f(x_n) / f'(x_n)

With dual numbers, we get f(x) and f'(x) from a single evaluation:

template<typename F, typename T>
T newton(F f, T x, int maxIter = 50, T tol = T(1e-12)) {
    for (int i = 0; i < maxIter; ++i) {
        auto d = f(Dual<T>{ x, T(1) });
        T step = d.val / d.grad;   // f(x) / f'(x)
        x -= step;
        if (std::abs(step) < tol) break;
    }
    return x;
}

Example: Finding a Root

Let's find the real root of g(x) = x³ − 2x − 5, classically known to be ≈ 2.0945514815423265:

auto g = [](auto x) {
    using T = typename decltype(x)::value_type;
    return pow(x, 3.0) - constant<T>(2.0) * x - constant<T>(5.0);
};
 
double root = newton(g, 2.0);
// root = 2.09455148154233

Optimisation via Second-Order Newton

To minimise a function h, we look for a stationary point — a root of h'. Newton's method on h' converges quadratically, but it needs h''. With nested duals, we get both h' and h'' from one evaluation.

Nesting Duals for Second Derivatives

Instantiate Dual<Dual<double>> and seed carefully:

using Inner = Dual<double>;
using Outer = Dual<Inner>;
 
// x = x₀ + 1·ε₁ + 1·ε₂ + 0·ε₁ε₂
Outer xd{ Inner{x, 1.0}, Inner{1.0, 0.0} };
auto r = h(xd);
 
double hp  = r.val.grad;   // h'(x)  — inner ε component of val
double hpp = r.grad.grad;  // h''(x) — inner ε component of grad

This works because our lifted functions use unqualified calls. When T = Dual<double>, name lookup finds sin(Dual<double>) rather than std::sin, so the chain rule is applied a second time automatically.

Example: Minimising h(x) = (x − 3)² + sin(x)

auto h = [](auto x) {
    using T = typename decltype(x)::value_type;
    auto dx = x - constant<T>(3.0);
    return dx * dx + sin(x);
};
 
double x = 2.5;
for (int i = 0; i < 50; ++i) {
    using Inner = Dual<double>;
    using Outer = Dual<Inner>;
 
    Outer xd{ Inner{x, 1.0}, Inner{1.0, 0.0} };
    auto r      = h(xd);
    double hp   = r.val.grad;   // h'(x)
    double hpp  = r.grad.grad;  // h''(x)
    double step = hp / hpp;
    x -= step;
    if (std::abs(step) < 1e-12) break;
}
// x ≈ 3.47282  (the minimum near x = 3)
// h(x_min) ≈ −0.10165

The same lambda h that computes function values also drives the second-order Newton loop — no manual differentiation anywhere.


Second-Degree Dual Numbers

The nested-dual trick is elegant but somewhat opaque — the first and second derivatives are hidden inside a two-level struct. An alternative is to extend the dual algebra directly to second degree: carry three components instead of two.

The Algebra

A second-order dual number has the form a + b·ε + c·ε² where we now keep ε² terms but discard ε³ (and higher). Multiplication becomes:

(a + bε + cε²)(p + qε + rε²)
  = ap  +  (aq + bp)ε  +  (ar + bq + cp)ε²

If we seed a variable with x = x₀ + 1·ε + 0·ε², then propagating through any smooth function f gives:

f(x₀ + ε) = f(x₀) + f'(x₀)·ε + ½·f''(x₀)·ε²

So val = f(x₀), d1 = f'(x₀), and d2 = f''(x₀) — all three from a single evaluation.

C++ Implementation

template<typename T>
struct Dual2 {
    T val, d1, d2;  // f(x), f'(x), f''(x)
    Dual2(T val = T{}, T d1 = T{}, T d2 = T{}) : val(val), d1(d1), d2(d2) {}
};
 
template<typename T>
Dual2<T> operator*(Dual2<T> a, Dual2<T> b) {
    return { a.val*b.val,
             a.d1*b.val  + a.val*b.d1,
             a.d2*b.val  + a.d1*b.d1 + a.val*b.d2 };  // ε² coefficient
}

The lifted functions apply the chain rule twice. For sin:

template<typename T>
Dual2<T> sin(Dual2<T> x) {
    using std::sin, std::cos;
    T s = sin(x.val), c = cos(x.val);
    // (sin∘u)''  = u''·cos(u) - (u')²·sin(u)
    return { s,  x.d1*c,  x.d2*c - x.d1*x.d1*s };
}

A helper to seed a scalar variable x₀:

template<typename T>
Dual2<T> variable(T x) { return { x, T(1), T(0) }; }

Verifying Derivatives

// f(x) = x³ − 2x + 1
auto f = [](Dual2<double> x) {
    return pow(x, 3.0) - constant2<double>(2.0)*x + constant2<double>(1.0);
};
 
auto r = f(variable(2.0));
// r.val  = 5    (f(2)   = 8 - 4 + 1)
// r.d1   = 10   (f'(2)  = 3·4 - 2)
// r.d2   = 12   (f''(2) = 6·2)

Newton's Method in One Shot

With Dual2, Newton's method for minimising h — which needs both h' and h'' — becomes a clean single-pass loop:

template<typename F, typename T>
T newton2(F f, T x, int maxIter=50, T tol=T(1e-12)) {
    for (int i = 0; i < maxIter; ++i) {
        auto r  = f(variable(x));
        T step  = r.d1 / r.d2;   // h'(x) / h''(x)
        x -= step;
        if (std::abs(step) < tol) break;
    }
    return x;
}
 
// Minimise h(x) = (x−3)² + sin(x)
auto h = [](Dual2<double> x) {
    auto dx = x - constant2<double>(3.0);
    return dx*dx + sin(x);
};
 
double xmin = newton2(h, 2.5);
// xmin ≈ 3.47282

Dual vs Nested Dual

Dual<Dual<double>>Dual2<double>
Components4 (two levels of val/grad)3 (val, d1, d2)
Function templatesUnchanged — nesting is freeNeed Dual2-specific lifts
ClaritySeeding is subtleSeeding is explicit
PerformanceSlightly heavier (4 multiplies per *)3 multiplies per *

Both are correct. Dual2 is lighter and more readable; nested duals are more compositional (no new code needed).


Putting It All Together

Here is the complete, self-contained program that verifies the derivative, finds a root, and minimises a function:

#include <iostream>
#include <cmath>
 
template<typename T>
struct Dual {
    using value_type = T;
    T val, grad;
    Dual(T val = T{}, T grad = T{}) : val(val), grad(grad) {}
};
 
template<typename T>
Dual<T> constant(double c) { return { T(c), T(0) }; }
 
template<typename T> Dual<T> operator+(Dual<T> a, Dual<T> b) { return { a.val+b.val, a.grad+b.grad }; }
template<typename T> Dual<T> operator-(Dual<T> a, Dual<T> b) { return { a.val-b.val, a.grad-b.grad }; }
template<typename T> Dual<T> operator*(Dual<T> a, Dual<T> b) { return { a.val*b.val, a.grad*b.val + a.val*b.grad }; }
template<typename T> Dual<T> operator/(Dual<T> a, Dual<T> b) { return { a.val/b.val, (a.grad*b.val - a.val*b.grad)/(b.val*b.val) }; }
 
template<typename T> Dual<T> sin(Dual<T> x) { using std::sin,std::cos; return { sin(x.val), x.grad*cos(x.val) }; }
template<typename T> Dual<T> cos(Dual<T> x) { using std::sin,std::cos; return { cos(x.val), x.grad*(-sin(x.val)) }; }
template<typename T> Dual<T> exp(Dual<T> x) { using std::exp; T e=exp(x.val); return { e, x.grad*e }; }
template<typename T> Dual<T> log(Dual<T> x) { using std::log; return { log(x.val), x.grad/x.val }; }
template<typename T> Dual<T> pow(Dual<T> x, double n) { using std::pow; return { pow(x.val,n), x.grad*T(n)*pow(x.val,n-1.0) }; }
 
template<typename F, typename T>
T diff(F f, T x) { return f(Dual<T>{x, T(1)}).grad; }
 
template<typename F, typename T>
T newton(F f, T x, int maxIter=50, T tol=T(1e-12)) {
    for (int i = 0; i < maxIter; ++i) {
        auto d = f(Dual<T>{x, T(1)});
        T step = d.val / d.grad;
        x -= step;
        if (std::abs(step) < tol) break;
    }
    return x;
}
 
int main() {
    // 1. Verify derivative: f(x) = x^3 - 2x + 1,  f'(x) = 3x^2 - 2
    auto f = [](auto x) {
        using T = typename decltype(x)::value_type;
        return pow(x, 3.0) - constant<T>(2.0)*x + constant<T>(1.0);
    };
    std::cout << "f'(2) via AD     = " << diff(f, 2.0) << "\n"; // 10
    std::cout << "f'(2) analytical = " << 3*4.0 - 2    << "\n"; // 10
 
    // 2. Root of x^3 - 2x - 5 = 0
    auto g = [](auto x) {
        using T = typename decltype(x)::value_type;
        return pow(x, 3.0) - constant<T>(2.0)*x - constant<T>(5.0);
    };
    std::cout << "Root: " << newton(g, 2.0) << "\n"; // 2.09455148154233
 
    // 3. Minimise h(x) = (x-3)^2 + sin(x) using second-order Newton
    auto h = [](auto x) {
        using T = typename decltype(x)::value_type;
        auto dx = x - constant<T>(3.0);
        return dx*dx + sin(x);
    };
    double x = 2.5;
    for (int i = 0; i < 50; ++i) {
        using Inner = Dual<double>;
        Dual<Inner> xd{ {x, 1.0}, {1.0, 0.0} };
        auto r      = h(xd);
        double step = r.val.grad / r.grad.grad;  // h'(x) / h''(x)
        x -= step;
        if (std::abs(step) < 1e-12) break;
    }
    std::cout << "Minimiser: " << x << "\n"; // 3.47282
}

Output:

f'(2) via AD     = 10
f'(2) analytical = 10
 
Root: 2.09455
 
Minimiser: 3.47282

Extending to Multiple Dimensions

For a function f : ℝⁿ → ℝ, the gradient is a vector of n partial derivatives. Forward-mode AD computes one directional derivative per pass. To get the full gradient we run n passes, each time seeding a different input with grad = 1.

The Gradient Function

// F takes std::array<Dual<double>, N> and returns Dual<double>.
// Pass i seeds xs[i].grad = 1; all other grads are 0.
template<std::size_t N, typename F>
std::array<double, N> gradient(F f, std::array<double, N> x) {
    std::array<double, N> grad{};
    for (std::size_t i = 0; i < N; ++i) {
        std::array<Dual<double>, N> xd;
        for (std::size_t j = 0; j < N; ++j)
            xd[j] = { x[j], j == i ? 1.0 : 0.0 };
        grad[i] = f(xd).grad;
    }
    return grad;
}

Example: Gradient of a 2-D Function

using D = Dual<double>;
 
// f(x,y) = (x-1)^2 + (y-2)^2
auto f = [](std::array<D,2> xs) {
    auto dx = xs[0] - D{1.0, 0.0};
    auto dy = xs[1] - D{2.0, 0.0};
    return dx*dx + dy*dy;
};
 
// ∇f(3,5) = (2*(3-1), 2*(5-2)) = (4, 6)
auto g = gradient<2>(f, {3.0, 5.0});
// g[0] = 4, g[1] = 6

Gradient Descent

With gradient<N> in hand, a fully generic gradient-descent loop follows immediately:

template<std::size_t N, typename F>
std::array<double, N> gradientDescent(F f, std::array<double, N> x,
                                      double lr=0.1, int maxIter=200, double tol=1e-9) {
    for (int iter = 0; iter < maxIter; ++iter) {
        auto g = gradient<N>(f, x);
        double norm2 = 0;
        for (std::size_t i = 0; i < N; ++i) { x[i] -= lr*g[i]; norm2 += g[i]*g[i]; }
        if (norm2 < tol*tol) break;
    }
    return x;
}
 
// Minimise f(x,y) = (x-1)^2 + (y-2)^2 starting at (3, 5)
auto [mx, my] = gradientDescent<2>(f, {3.0, 5.0});
// mx ≈ 1.0, my ≈ 2.0

The same gradient helper works for any dimension and any differentiable lambda — including those that mix sin, exp, pow, and arbitrary control flow.

Complexity Note

Forward-mode AD needs n passes to compute an n-dimensional gradient, making it O(n) evaluations. Tape-based reverse mode computes the same gradient in a single forward + backward pass — an O(1) cost in terms of function evaluations, at the cost of storing the tape. For large n (e.g. millions of neural network parameters), reverse mode is indispensable.


Summary

TechniqueAccuracyCost (gradient)Handles arbitrary code
Finite differences❌ approximateO(n) evaluations
Symbolic differentiation✅ exactO(1) (often)
AD — source transformation✅ exactO(1) reverse mode
AD — dual numbers✅ exactO(n) forward mode

The dual-number approach is particularly elegant in C++ because the type system does all the heavy lifting. A single templated Dual<T> struct, a handful of operator overloads, and any existing function template becomes automatically differentiable — including to arbitrary order by nesting.

Related Articles