A deep-dive into automatic differentiation: symbolic vs. numerical vs. AD, AST transformation, dual numbers, tapes, hybrid methods, and a generic C++ implementation that computes gradients and solves optimisation problems.
Derivatives are everywhere in software: gradient descent for machine learning, Newton's method for root-finding, sensitivity analysis, physics engines. Yet computing them accurately and efficiently is surprisingly tricky. This post explores automatic differentiation (AD) — a technique that produces exact derivatives of arbitrary programs with almost no extra work from the programmer.
Three Ways to Differentiate a Program
Before explaining AD, it helps to see what it is not.
Numerical Differentiation (Finite Differences)
The blunt instrument: just perturb the input by a tiny ε and measure the change in output.
f'(x) ≈ (f(x + ε) - f(x)) / ε- ✅ Trivial to implement.
- ❌ Truncation error — ε too large means the approximation is inaccurate.
- ❌ Cancellation error — ε too small and floating-point subtraction destroys precision.
- ❌ Scales badly: computing the full gradient of an n-dimensional function requires n evaluations.
Symbolic Differentiation
Apply the rules of calculus to the expression itself, producing a new symbolic expression for the derivative.
- ✅ Produces exact results.
- ❌ Expression swell — the derivative of a moderate-sized formula can be astronomically large after expansion and simplification.
- ❌ Requires the function to be expressed in a computer-algebra system, not arbitrary imperative code.
Automatic Differentiation
AD threads derivative information through the actual execution of the program. It achieves the exactness of symbolic differentiation without the expression swell, and the generality of numerical differentiation without the precision loss.
Two Flavours of Automatic Differentiation
1 — Source Transformation (AST-based)
A compiler pass or preprocessor rewrites the source code, augmenting every operation with its derivative counterpart. The output is a new program that computes both the function value and the derivative simultaneously.
Original: z = x * y + sin(x)
Augmented: z = x * y + sin(x)
z' = x'* y + x * y' + cos(x) * x'This is exactly what tools like Tapenade (Fortran/C) and Zygote.jl (Julia) do. The technique can also be applied in reverse mode, where a backward pass accumulates gradients from outputs to inputs — the foundation of backpropagation in neural networks.
- ✅ Can be highly optimised by the compiler.
- ✅ Reverse mode computes gradients in O(1) evaluations regardless of input dimensionality.
- ❌ Requires a specialised compiler or build-system integration.
2 — Operator Overloading with Dual Numbers
Instead of rewriting source code, we replace the scalar type with a richer type that carries the derivative alongside the value. No compiler magic required — just the type system.
This is the approach we will implement in C++.
3 — Tapes (Dynamic Computation Graphs)
A tape (also called a Wengert list or trace) is a log of every arithmetic operation executed during the forward pass. Each entry records the operation and its inputs' indices so that a backward sweep can propagate adjoints from outputs back to inputs.
Forward pass: record operations
node 0: x (leaf, value 3.0)
node 1: y (leaf, value 2.0)
node 2: t2 = node0 * node1 (val = 6.0)
node 3: t3 = sin(node0) (val = sin(3.0))
node 4: t4 = t2 + t3 (val = 6.0 + sin(3.0))
Backward pass (reverse accumulation from node 4):
∂L/∂t4 = 1
∂L/∂t2 += ∂L/∂t4 * 1 → 1
∂L/∂t3 += ∂L/∂t4 * 1 → 1
∂L/∂x += ∂L/∂t2 * y → 2.0
∂L/∂x += ∂L/∂t3 * cos(3.0) → 2.0 + cos(3.0)
∂L/∂y += ∂L/∂t2 * x → 3.0The key insight is that a single backward pass through the tape computes the full gradient with respect to all inputs simultaneously, regardless of how many there are. This is reverse-mode AD, and it is the basis of backpropagation in frameworks like PyTorch and TensorFlow.
Tape-based AD via operator overloading is straightforward: override every arithmetic operator on a Variable type to append an entry to a global (or thread-local) tape, then call a backward() method to replay it in reverse.
Hybrid Methods
In practice, production AD systems mix all three strategies:
- Forward mode for scalar functions and small Jacobians — inexpensive when the number of inputs is small.
- Reverse mode (tape) for large-input, scalar-output functions — the standard choice for training neural networks.
- Source transformation for hot inner loops — a compiler pass can fuse the primal and adjoint into a single optimised loop, eliminating tape overhead entirely.
For example, JAX uses XLA compilation (source transformation) to JIT-compile both the forward and backward passes, while also supporting dynamic Python graphs. Enzyme is an LLVM pass that differentiates compiled LLVM IR — it works on any language that targets LLVM, including C++ and Fortran, and can outperform hand-written gradients because it sees optimised machine-level code.
Dual Numbers
A dual number extends the reals with an element ε satisfying ε² = 0 (but ε ≠ 0). Every dual number has the form:
a + bεArithmetic works exactly like the reals, except we discard any term containing ε²:
(a + bε)(c + dε) = ac + (ad + bc)ε + bdε²
= ac + (ad + bc)εThe magic: if we seed the computation with x = x₀ + 1·ε, then for any elementary function f:
f(x₀ + ε) = f(x₀) + f'(x₀)·εThe val component gives f(x₀), and the grad component gives f'(x₀) — for free, as a by-product of ordinary arithmetic.
A Generic C++ Implementation
We represent a dual number as a struct templated on the underlying scalar type T. Templating on T is crucial: it allows us to nest duals (Dual<Dual<double>>) to obtain higher-order derivatives without writing any new code.
template<typename T>
struct Dual {
using value_type = T;
T val; // f(x)
T grad; // f'(x)
Dual(T val = T{}, T grad = T{}) : val(val), grad(grad) {}
};
// Convenience: wrap a scalar constant (derivative is zero)
template<typename T>
Dual<T> constant(double c) { return Dual<T>{ T(c), T(0) }; }Arithmetic Operators
The rules for dual arithmetic translate directly to overloaded operators:
template<typename T>
Dual<T> operator+(Dual<T> a, Dual<T> b) {
return { a.val + b.val, a.grad + b.grad };
}
template<typename T>
Dual<T> operator-(Dual<T> a, Dual<T> b) {
return { a.val - b.val, a.grad - b.grad };
}
template<typename T>
Dual<T> operator*(Dual<T> a, Dual<T> b) {
// product rule: (uv)' = u'v + uv'
return { a.val * b.val, a.grad * b.val + a.val * b.grad };
}
template<typename T>
Dual<T> operator/(Dual<T> a, Dual<T> b) {
// quotient rule: (u/v)' = (u'v - uv') / v²
return { a.val / b.val,
(a.grad * b.val - a.val * b.grad) / (b.val * b.val) };
}Lifting Elementary Functions
Each standard function must be "lifted" to dual numbers using its derivative. Note that the inner calls are unqualified — this lets the C++ name lookup (ADL) recursively find the overloaded versions when T is itself a Dual, enabling nesting.
template<typename T>
Dual<T> sin(Dual<T> x) {
using std::sin; using std::cos;
return { sin(x.val), x.grad * cos(x.val) };
}
template<typename T>
Dual<T> cos(Dual<T> x) {
using std::sin; using std::cos;
return { cos(x.val), x.grad * (-sin(x.val)) };
}
template<typename T>
Dual<T> exp(Dual<T> x) {
using std::exp;
T e = exp(x.val);
return { e, x.grad * e };
}
template<typename T>
Dual<T> log(Dual<T> x) {
using std::log;
return { log(x.val), x.grad / x.val };
}
template<typename T>
Dual<T> pow(Dual<T> x, double n) {
using std::pow;
return { pow(x.val, n),
x.grad * T(n) * pow(x.val, n - 1.0) };
}A One-Line Differentiator
With the infrastructure above, differentiating any templated function reduces to a single helper:
template<typename F, typename T>
T diff(F f, T x) {
return f(Dual<T>{ x, T(1) }).grad;
}We seed the dual with val = x and grad = 1 (representing the "direction" dx = 1), then simply read off the gradient component of the result.
Newton's Method with Automatic Differentiation
Newton's method finds a root of f by iterating:
x_{n+1} = x_n - f(x_n) / f'(x_n)With dual numbers, we get f(x) and f'(x) from a single evaluation:
template<typename F, typename T>
T newton(F f, T x, int maxIter = 50, T tol = T(1e-12)) {
for (int i = 0; i < maxIter; ++i) {
auto d = f(Dual<T>{ x, T(1) });
T step = d.val / d.grad; // f(x) / f'(x)
x -= step;
if (std::abs(step) < tol) break;
}
return x;
}Example: Finding a Root
Let's find the real root of g(x) = x³ − 2x − 5, classically known to be ≈ 2.0945514815423265:
auto g = [](auto x) {
using T = typename decltype(x)::value_type;
return pow(x, 3.0) - constant<T>(2.0) * x - constant<T>(5.0);
};
double root = newton(g, 2.0);
// root = 2.09455148154233Optimisation via Second-Order Newton
To minimise a function h, we look for a stationary point — a root of h'. Newton's method on h' converges quadratically, but it needs h''. With nested duals, we get both h' and h'' from one evaluation.
Nesting Duals for Second Derivatives
Instantiate Dual<Dual<double>> and seed carefully:
using Inner = Dual<double>;
using Outer = Dual<Inner>;
// x = x₀ + 1·ε₁ + 1·ε₂ + 0·ε₁ε₂
Outer xd{ Inner{x, 1.0}, Inner{1.0, 0.0} };
auto r = h(xd);
double hp = r.val.grad; // h'(x) — inner ε component of val
double hpp = r.grad.grad; // h''(x) — inner ε component of gradThis works because our lifted functions use unqualified calls. When T = Dual<double>, name lookup finds sin(Dual<double>) rather than std::sin, so the chain rule is applied a second time automatically.
Example: Minimising h(x) = (x − 3)² + sin(x)
auto h = [](auto x) {
using T = typename decltype(x)::value_type;
auto dx = x - constant<T>(3.0);
return dx * dx + sin(x);
};
double x = 2.5;
for (int i = 0; i < 50; ++i) {
using Inner = Dual<double>;
using Outer = Dual<Inner>;
Outer xd{ Inner{x, 1.0}, Inner{1.0, 0.0} };
auto r = h(xd);
double hp = r.val.grad; // h'(x)
double hpp = r.grad.grad; // h''(x)
double step = hp / hpp;
x -= step;
if (std::abs(step) < 1e-12) break;
}
// x ≈ 3.47282 (the minimum near x = 3)
// h(x_min) ≈ −0.10165The same lambda h that computes function values also drives the second-order Newton loop — no manual differentiation anywhere.
Second-Degree Dual Numbers
The nested-dual trick is elegant but somewhat opaque — the first and second derivatives are hidden inside a two-level struct. An alternative is to extend the dual algebra directly to second degree: carry three components instead of two.
The Algebra
A second-order dual number has the form a + b·ε + c·ε² where we now keep ε² terms but discard ε³ (and higher). Multiplication becomes:
(a + bε + cε²)(p + qε + rε²)
= ap + (aq + bp)ε + (ar + bq + cp)ε²If we seed a variable with x = x₀ + 1·ε + 0·ε², then propagating through any smooth function f gives:
f(x₀ + ε) = f(x₀) + f'(x₀)·ε + ½·f''(x₀)·ε²So val = f(x₀), d1 = f'(x₀), and d2 = f''(x₀) — all three from a single evaluation.
C++ Implementation
template<typename T>
struct Dual2 {
T val, d1, d2; // f(x), f'(x), f''(x)
Dual2(T val = T{}, T d1 = T{}, T d2 = T{}) : val(val), d1(d1), d2(d2) {}
};
template<typename T>
Dual2<T> operator*(Dual2<T> a, Dual2<T> b) {
return { a.val*b.val,
a.d1*b.val + a.val*b.d1,
a.d2*b.val + a.d1*b.d1 + a.val*b.d2 }; // ε² coefficient
}The lifted functions apply the chain rule twice. For sin:
template<typename T>
Dual2<T> sin(Dual2<T> x) {
using std::sin, std::cos;
T s = sin(x.val), c = cos(x.val);
// (sin∘u)'' = u''·cos(u) - (u')²·sin(u)
return { s, x.d1*c, x.d2*c - x.d1*x.d1*s };
}A helper to seed a scalar variable x₀:
template<typename T>
Dual2<T> variable(T x) { return { x, T(1), T(0) }; }Verifying Derivatives
// f(x) = x³ − 2x + 1
auto f = [](Dual2<double> x) {
return pow(x, 3.0) - constant2<double>(2.0)*x + constant2<double>(1.0);
};
auto r = f(variable(2.0));
// r.val = 5 (f(2) = 8 - 4 + 1)
// r.d1 = 10 (f'(2) = 3·4 - 2)
// r.d2 = 12 (f''(2) = 6·2)Newton's Method in One Shot
With Dual2, Newton's method for minimising h — which needs both h' and h'' — becomes a clean single-pass loop:
template<typename F, typename T>
T newton2(F f, T x, int maxIter=50, T tol=T(1e-12)) {
for (int i = 0; i < maxIter; ++i) {
auto r = f(variable(x));
T step = r.d1 / r.d2; // h'(x) / h''(x)
x -= step;
if (std::abs(step) < tol) break;
}
return x;
}
// Minimise h(x) = (x−3)² + sin(x)
auto h = [](Dual2<double> x) {
auto dx = x - constant2<double>(3.0);
return dx*dx + sin(x);
};
double xmin = newton2(h, 2.5);
// xmin ≈ 3.47282Dual vs Nested Dual
Dual<Dual<double>> | Dual2<double> | |
|---|---|---|
| Components | 4 (two levels of val/grad) | 3 (val, d1, d2) |
| Function templates | Unchanged — nesting is free | Need Dual2-specific lifts |
| Clarity | Seeding is subtle | Seeding is explicit |
| Performance | Slightly heavier (4 multiplies per *) | 3 multiplies per * |
Both are correct. Dual2 is lighter and more readable; nested duals are more compositional (no new code needed).
Putting It All Together
Here is the complete, self-contained program that verifies the derivative, finds a root, and minimises a function:
#include <iostream>
#include <cmath>
template<typename T>
struct Dual {
using value_type = T;
T val, grad;
Dual(T val = T{}, T grad = T{}) : val(val), grad(grad) {}
};
template<typename T>
Dual<T> constant(double c) { return { T(c), T(0) }; }
template<typename T> Dual<T> operator+(Dual<T> a, Dual<T> b) { return { a.val+b.val, a.grad+b.grad }; }
template<typename T> Dual<T> operator-(Dual<T> a, Dual<T> b) { return { a.val-b.val, a.grad-b.grad }; }
template<typename T> Dual<T> operator*(Dual<T> a, Dual<T> b) { return { a.val*b.val, a.grad*b.val + a.val*b.grad }; }
template<typename T> Dual<T> operator/(Dual<T> a, Dual<T> b) { return { a.val/b.val, (a.grad*b.val - a.val*b.grad)/(b.val*b.val) }; }
template<typename T> Dual<T> sin(Dual<T> x) { using std::sin,std::cos; return { sin(x.val), x.grad*cos(x.val) }; }
template<typename T> Dual<T> cos(Dual<T> x) { using std::sin,std::cos; return { cos(x.val), x.grad*(-sin(x.val)) }; }
template<typename T> Dual<T> exp(Dual<T> x) { using std::exp; T e=exp(x.val); return { e, x.grad*e }; }
template<typename T> Dual<T> log(Dual<T> x) { using std::log; return { log(x.val), x.grad/x.val }; }
template<typename T> Dual<T> pow(Dual<T> x, double n) { using std::pow; return { pow(x.val,n), x.grad*T(n)*pow(x.val,n-1.0) }; }
template<typename F, typename T>
T diff(F f, T x) { return f(Dual<T>{x, T(1)}).grad; }
template<typename F, typename T>
T newton(F f, T x, int maxIter=50, T tol=T(1e-12)) {
for (int i = 0; i < maxIter; ++i) {
auto d = f(Dual<T>{x, T(1)});
T step = d.val / d.grad;
x -= step;
if (std::abs(step) < tol) break;
}
return x;
}
int main() {
// 1. Verify derivative: f(x) = x^3 - 2x + 1, f'(x) = 3x^2 - 2
auto f = [](auto x) {
using T = typename decltype(x)::value_type;
return pow(x, 3.0) - constant<T>(2.0)*x + constant<T>(1.0);
};
std::cout << "f'(2) via AD = " << diff(f, 2.0) << "\n"; // 10
std::cout << "f'(2) analytical = " << 3*4.0 - 2 << "\n"; // 10
// 2. Root of x^3 - 2x - 5 = 0
auto g = [](auto x) {
using T = typename decltype(x)::value_type;
return pow(x, 3.0) - constant<T>(2.0)*x - constant<T>(5.0);
};
std::cout << "Root: " << newton(g, 2.0) << "\n"; // 2.09455148154233
// 3. Minimise h(x) = (x-3)^2 + sin(x) using second-order Newton
auto h = [](auto x) {
using T = typename decltype(x)::value_type;
auto dx = x - constant<T>(3.0);
return dx*dx + sin(x);
};
double x = 2.5;
for (int i = 0; i < 50; ++i) {
using Inner = Dual<double>;
Dual<Inner> xd{ {x, 1.0}, {1.0, 0.0} };
auto r = h(xd);
double step = r.val.grad / r.grad.grad; // h'(x) / h''(x)
x -= step;
if (std::abs(step) < 1e-12) break;
}
std::cout << "Minimiser: " << x << "\n"; // 3.47282
}Output:
f'(2) via AD = 10
f'(2) analytical = 10
Root: 2.09455
Minimiser: 3.47282Extending to Multiple Dimensions
For a function f : ℝⁿ → ℝ, the gradient is a vector of n partial derivatives. Forward-mode AD computes one directional derivative per pass. To get the full gradient we run n passes, each time seeding a different input with grad = 1.
The Gradient Function
// F takes std::array<Dual<double>, N> and returns Dual<double>.
// Pass i seeds xs[i].grad = 1; all other grads are 0.
template<std::size_t N, typename F>
std::array<double, N> gradient(F f, std::array<double, N> x) {
std::array<double, N> grad{};
for (std::size_t i = 0; i < N; ++i) {
std::array<Dual<double>, N> xd;
for (std::size_t j = 0; j < N; ++j)
xd[j] = { x[j], j == i ? 1.0 : 0.0 };
grad[i] = f(xd).grad;
}
return grad;
}Example: Gradient of a 2-D Function
using D = Dual<double>;
// f(x,y) = (x-1)^2 + (y-2)^2
auto f = [](std::array<D,2> xs) {
auto dx = xs[0] - D{1.0, 0.0};
auto dy = xs[1] - D{2.0, 0.0};
return dx*dx + dy*dy;
};
// ∇f(3,5) = (2*(3-1), 2*(5-2)) = (4, 6)
auto g = gradient<2>(f, {3.0, 5.0});
// g[0] = 4, g[1] = 6Gradient Descent
With gradient<N> in hand, a fully generic gradient-descent loop follows immediately:
template<std::size_t N, typename F>
std::array<double, N> gradientDescent(F f, std::array<double, N> x,
double lr=0.1, int maxIter=200, double tol=1e-9) {
for (int iter = 0; iter < maxIter; ++iter) {
auto g = gradient<N>(f, x);
double norm2 = 0;
for (std::size_t i = 0; i < N; ++i) { x[i] -= lr*g[i]; norm2 += g[i]*g[i]; }
if (norm2 < tol*tol) break;
}
return x;
}
// Minimise f(x,y) = (x-1)^2 + (y-2)^2 starting at (3, 5)
auto [mx, my] = gradientDescent<2>(f, {3.0, 5.0});
// mx ≈ 1.0, my ≈ 2.0The same gradient helper works for any dimension and any differentiable lambda — including those that mix sin, exp, pow, and arbitrary control flow.
Complexity Note
Forward-mode AD needs n passes to compute an n-dimensional gradient, making it O(n) evaluations. Tape-based reverse mode computes the same gradient in a single forward + backward pass — an O(1) cost in terms of function evaluations, at the cost of storing the tape. For large n (e.g. millions of neural network parameters), reverse mode is indispensable.
Summary
| Technique | Accuracy | Cost (gradient) | Handles arbitrary code |
|---|---|---|---|
| Finite differences | ❌ approximate | O(n) evaluations | ✅ |
| Symbolic differentiation | ✅ exact | O(1) (often) | ❌ |
| AD — source transformation | ✅ exact | O(1) reverse mode | ✅ |
| AD — dual numbers | ✅ exact | O(n) forward mode | ✅ |
The dual-number approach is particularly elegant in C++ because the type system does all the heavy lifting. A single templated Dual<T> struct, a handful of operator overloads, and any existing function template becomes automatically differentiable — including to arbitrary order by nesting.
Related
- Experimenting with Small-Buffer Optimization — More template tricks for performance in C++.
- Either vs Error-Codes — Compiler optimisations that eliminate abstraction overhead in C++.
Related Articles
The Fourier Transform: From Sines to Signals
A visual and mathematical journey through one of the most beautiful ideas in all of mathematics — decomposing any signal into its frequency components.
Detecting Hallucinations in LLM Summaries
LLMs write convincingly but fabricate facts. A practical tour of automated detection techniques: BERTScore, embedding similarity, ROUGE/n-gram overlap, NER-based cross-referencing, and QAEVAL.
Why Tuning a Piano is Mathematically Impossible
Perfect harmony is a mathematical impossibility. Here's the beautiful, maddening reason why — and how piano tuners have learned to live with it.