Part I · Foundations Week 1 Published

Linear ODEs as state-space systems

Continuous-time linear systems, the matrix exponential as fundamental solution, the damped oscillator as canonical example, and the eigenvalue classification of phase portraits.

Linear ODEs as state-space systems

1.1 What is a state-space system?

A linear state-space system is the pair of equations

ddth(t)=Ah(t)+Bu(t),y(t)=Ch(t)+Du(t),\ddt \statevec(t) = \statemat \statevec(t) + \inputmat u(t), \qquad y(t) = \outputmat \statevec(t) + \feedmat u(t),

where h(t)RN\statevec(t) \in \R^N is the state at time t0t \ge 0, u(t)RDu(t) \in \R^D is the input signal, y(t)RMy(t) \in \R^M is the output, and the four matrices (A,B,C,D)(\statemat, \inputmat, \outputmat, \feedmat) are real-valued with shapes N×NN \times N, N×DN \times D, M×NM \times N, M×DM \times D. The first equation is the state equation — a first-order linear ODE describing how the state evolves. The second is the output equation — a static linear readout. The integer NN is the state dimension; it controls how much history the model can carry forward.

The pair is “linear” because both equations are linear in h\statevec and uu separately. It is “state-space” because h(t)\statevec(t) summarizes everything the system needs from the past in order to predict the future given future inputs — a property called the Markov property of state. Given the present state and the input from now onward, the system’s entire future is determined.

A few conventions before going further. We treat h(t)\statevec(t) as a column vector and apply A\statemat on the left: Ah\statemat \statevec, not hA\statevec \statemat. We absorb the feedthrough term Du(t)\feedmat u(t) into the output equation only — it appears nowhere in the state dynamics — and we will often set D=0\feedmat = 0 in examples to keep notation light. When the input is absent (u0u \equiv 0), the state equation reduces to

ddth(t)=Ah(t),\ddt \statevec(t) = \statemat \statevec(t),

the homogeneous linear ODE, whose solution is the focus of §1.2.

Why this object

Three reasons the state-space form is the right starting point for a book on sequence-model architectures:

  1. It separates dynamics from input. Whatever the input signal looks like, the dynamics matrix A\statemat alone determines the system’s long-run behavior — its stability, oscillation frequencies, and decay rates. Most of the analysis in Chapter 2 looks only at A\statemat.
  2. It generalizes immediately. Replacing the scalar derivative ddt\ddt with a finite difference yields a discrete recurrence (Chapter 4); replacing R\R with C\C admits oscillatory modes (Chapter 6, Chapter 10); replacing the constant A\statemat with an input-dependent A(ut)\statemat(u_t) yields the selective-SSM family (Chapter 9). Each generalization preserves the state-space skeleton.
  3. It’s the form in which continuous-time models are discretized. Every SSM derivation in the literature starts from a continuous state equation and applies a numerical integration scheme to get a discrete recurrence. If you don’t know the continuous object, the discrete one is opaque.

1.2 The matrix exponential and fundamental solutions

The homogeneous ODE ddth=Ah\ddt \statevec = \statemat \statevec with initial condition h(0)=h0\statevec(0) = \statevec_0 has a unique solution for every h0RN\statevec_0 \in \R^N — and that solution is

h(t)=eAth0,\statevec(t) = e^{\statemat t} \statevec_0,

where eAte^{\statemat t} is the matrix exponential of At\statemat t, defined by the same series as the scalar exponential:

.

For any square matrix MRN×NM \in \R^{N \times N}, the matrix exponential is

eM:=k=0Mkk!=I+M+12M2+16M3+e^M := \sum_{k=0}^{\infty} \frac{M^k}{k!} = I + M + \tfrac{1}{2} M^2 + \tfrac{1}{6} M^3 + \cdots

The series converges absolutely for every MM (the entrywise 1\ell^1-norm of the partial sums is bounded by eMe^{\norm{M}}), so eMe^M is always well-defined.

The fundamental solution property — that teAth0t \mapsto e^{\statemat t} \statevec_0 solves the homogeneous ODE — is verified by differentiating the series term-by-term and using ddt(At)kk!=Aktk1(k1)!\frac{d}{dt} \frac{(\statemat t)^k}{k!} = \frac{\statemat^k t^{k-1}}{(k-1)!}. The resulting derivative is AeAth0\statemat \cdot e^{\statemat t} \statevec_0, which matches Ah(t)\statemat \statevec(t). Existence and uniqueness for the inhomogeneous case (u≢0u \not\equiv 0) follow the variation of parameters formula

h(t)=eAth0+0teA(ts)Bu(s)ds,\statevec(t) = e^{\statemat t} \statevec_0 + \int_0^t e^{\statemat (t-s)} \inputmat u(s) \, ds,

an integral that recurs (in discretized form) as the convolution kernel of every LTI SSM in Chapter 8.

Eigenvalue structure determines everything

The matrix exponential’s behavior is governed entirely by the eigenvalues of A\statemat. If A\statemat is diagonalizable, A=VΛV1\statemat = V \Lambda V^{-1} with Λ=diag(λ1,,λN)\Lambda = \diag(\lambda_1, \ldots, \lambda_N), then

eAt=VeΛtV1=Vdiag(eλ1t,,eλNt)V1.e^{\statemat t} = V \, e^{\Lambda t} \, V^{-1} = V \, \diag(e^{\lambda_1 t}, \ldots, e^{\lambda_N t}) \, V^{-1}.

So eAte^{\statemat t} acts on the eigenbasis as independent scalar exponentials. The real parts of λi\lambda_i control decay or growth: (λi)<0\Re(\lambda_i) < 0 means eλit0e^{\lambda_i t} \to 0; (λi)>0\Re(\lambda_i) > 0 means blow-up. The imaginary parts control oscillation frequency. Three regimes recur throughout the book:

  1. All eigenvalues with negative real part — the system is asymptotically stable: every trajectory h(t)0\statevec(t) \to 0 as tt \to \infty. This is the regime in which a recurrence “forgets” its initial state, and it is the design target for the A\statemat matrices of S4, Mamba, and friends.
  2. At least one eigenvalue with positive real part — the system is unstable: some initial conditions blow up. A trained SSM whose A\statemat drifts into this regime is the dominant numerical failure mode (Chapter 5).
  3. Purely imaginary eigenvalues — the system is marginally stable and oscillates without decay. These are the energy-preserving modes of Hamiltonian systems and are the natural home of symplectic discretization (Chapter 6).

When A\statemat is not diagonalizable — when it has repeated eigenvalues with deficient eigenspaces — the matrix exponential picks up polynomial-in-tt factors on top of the eλite^{\lambda_i t} terms. The Jordan normal form (Chapter 3, §3.1) gives the precise statement. In practice almost every A\statemat arising from random initialization or from physical systems is diagonalizable, and the polynomial-factor case shows up only at codimension-one boundaries in parameter space.

1.3 The damped harmonic oscillator

The canonical small example. A unit mass on a spring with stiffness k>0k > 0 and damping coefficient c>0c > 0, displacement q(t)q(t), evolves according to Newton’s second law,

q¨(t)+cq˙(t)+kq(t)=0.\ddot q(t) + c \, \dot q(t) + k \, q(t) = 0.

This is a second-order scalar ODE, not first-order, so it does not yet fit the state-space template. We lift it by treating velocity as an extra state coordinate: let h(t)=(q(t),q˙(t))R2\statevec(t) = (q(t), \dot q(t))^\top \in \R^2. Then ddth=Ah\ddt \statevec = \statemat \statevec with

A=(01kc).\statemat = \begin{pmatrix} 0 & 1 \\ -k & -c \end{pmatrix}.

This lifting trick — introduce auxiliary states to reduce ODE order to one — is the same procedure by which every higher-order ODE becomes a first-order state-space system, and it foreshadows how implicit higher-order schemes (Chapter 6) work internally.

The eigenvalues of A\statemat are the roots of the characteristic polynomial λ2+cλ+k=0\lambda^2 + c \lambda + k = 0:

λ±=c2±12c24k.\lambda_\pm = -\frac{c}{2} \pm \frac{1}{2}\sqrt{c^2 - 4k}.

Three sub-cases follow from the sign of the discriminant c24kc^2 - 4k:

  • Overdamped (c2>4kc^2 > 4k): two real negative eigenvalues. Both modes decay monotonically; the displacement approaches zero without oscillation.
  • Critically damped (c2=4kc^2 = 4k): a repeated negative eigenvalue λ=c/2\lambda = -c/2. The matrix A\statemat is in this case not diagonalizable, and the solution contains a teλtt \cdot e^{\lambda t} term.
  • Underdamped (c2<4kc^2 < 4k): a complex-conjugate eigenpair λ±=c/2±iω\lambda_\pm = -c/2 \pm i\omega, with ω=124kc2\omega = \tfrac{1}{2}\sqrt{4k - c^2}. Trajectories spiral toward the origin: oscillation at angular frequency ω\omega with envelope decaying as ect/2e^{-ct/2}.

The total energy E(t)=12(q˙2+kq2)E(t) = \tfrac{1}{2}(\dot q^2 + k q^2) is a useful diagnostic. Its time derivative is E˙=cq˙20\dot E = -c \dot q^2 \le 0, i.e. energy decreases monotonically whenever q˙0\dot q \ne 0. Both stationary and oscillating trajectories satisfy this. The companion damped_oscillator.py simulates the underdamped regime and plots E(t)E(t) alongside the trajectory; the energy curve gives an at-a-glance check that the numerical integration preserves the dissipation structure of the continuous system.

Energy decay of the damped harmonic oscillator over time, plotted on linear and log scales.
Energy E(t) of the underdamped oscillator (k=4, c=0.2) decaying exponentially in time. Left: linear scale, showing the envelope shape. Right: log scale, where the constant slope confirms the exponential decay rate of c/2 in the energy norm. Produced by companions/ch01/jax/damped_oscillator.py.

1.4 Coupled systems and the Jacobian

The damped oscillator is a 2-dimensional toy. Real systems — and trained SSMs — live in much higher dimensions, and they typically arise from coupling many simpler subsystems. The bookkeeping is straightforward once you accept that the state vector h\statevec stacks all subsystem states and A\statemat encodes both intra-subsystem dynamics and inter-subsystem coupling.

A useful larger example is a ring of coupled oscillators: nn identical damped oscillators arranged on a circle, each coupled to its two nearest neighbors by springs of stiffness κ\kappa. The displacements q1,,qnq_1, \ldots, q_n and velocities q˙1,,q˙n\dot q_1, \ldots, \dot q_n stack into a state vector hR2n\statevec \in \R^{2n}. The A\statemat matrix has a 2×22 \times 2 block-diagonal structure for the per-oscillator dynamics, plus an off-diagonal coupling pattern that links each oscillator’s velocity equation to its neighbors’ positions:

q¨i=kqicq˙i+κ(qi12qi+qi+1),i=1,,n,\ddot q_i = -k q_i - c \dot q_i + \kappa (q_{i-1} - 2 q_i + q_{i+1}), \qquad i = 1, \ldots, n,

with indices taken mod nn. The bracketed term qi12qi+qi+1q_{i-1} - 2 q_i + q_{i+1} is the discrete Laplacian of qq around the ring; it’s the same discretization of 2/x2\partial^2/\partial x^2 you would use in a finite-difference method for the wave equation.

The 2n×2n2n \times 2n Jacobian matrix J\jacobian — which for a linear system is just A\statemat — has eigenvalues that are easy to compute analytically because of the ring’s circulant structure. They come in complex-conjugate pairs

λ±(m)=c2±ik+2κ(1cos(2πm/n))c2/4,m=0,1,,n1,\lambda_\pm^{(m)} = -\frac{c}{2} \pm i \sqrt{k + 2\kappa(1 - \cos(2\pi m/n)) - c^2/4}, \qquad m = 0, 1, \ldots, n-1,

(for the underdamped regime). Each mm indexes a standing wave mode on the ring with wavenumber 2πm/n2\pi m / n; the m=0m=0 mode is uniform translation (no spatial structure), m=n/2m = n/2 (when nn is even) is the maximally-oscillating mode where adjacent oscillators move in antiphase. The companion coupled_oscillators.py constructs A\statemat for n=8n = 8 and visualizes the eigenvalues in the complex plane.

Eigenvalues of the coupled-oscillator Jacobian in the complex plane, showing two arcs corresponding to the underdamped modes.
Eigenvalues of the ring-of-oscillators Jacobian for n=8, k=4, c=0.2, κ=1. All 16 eigenvalues sit in the left half-plane (asymptotically stable) and arrange into two arcs corresponding to the two underdamped families. The arc curvature reflects the dispersion relation ω(m) of the spatial Laplacian. Produced by companions/ch01/jax/coupled_oscillators.py.

The key qualitative lesson: once you have the right state-space lift, even a system with rich spatial structure reduces to “compute eigenvalues of A\statemat, classify by real and imaginary parts.” The eigenvalue distribution of a trained SSM’s A\statemat — pre-training, mid-training, and converged — is one of the most informative diagnostics you can compute (Chapter 2, §2.2; Chapter 5).

The Jacobian beyond linear systems

For a nonlinear system ddth=f(h)\ddt \statevec = f(\statevec), the Jacobian J(h)=f/h\jacobian(\statevec) = \partial f / \partial \statevec is the matrix of partial derivatives at h\statevec. The linearization ddtδh=J(h)δh\ddt \delta\statevec = \jacobian(\statevec^*) \delta\statevec around a fixed point h\statevec^* (where f(h)=0f(\statevec^*) = 0) describes how small perturbations δh\delta\statevec evolve. The eigenvalues of J(h)\jacobian(\statevec^*) then determine local stability of h\statevec^* by the same classification as in §1.2.

Trained recurrent networks — including SSMs in inference mode — are formally nonlinear (the state update is state = f(state, input) where f includes the input-dependent matrices). The linearization-around-a-state gives a local state-space system at each time step, and tracking how that local A\statemat varies along a trajectory is the foundation of the Lyapunov-exponent analysis in Chapter 2.

1.5 Phase portraits and the eigenvalue classification

A phase portrait is the geometric picture of all trajectories of an autonomous system in state space, drawn as oriented curves. For 2-dimensional linear systems ddth=Ah\ddt \statevec = \statemat \statevec on R2\R^2, the topology of the portrait is determined entirely by the trace tr(A)\tr(\statemat) and the determinant det(A)\det(\statemat) — equivalently, by the eigenvalues — and the standard classification gives six qualitative behaviors:

Region in (tr,det)(\tr, \det)-planeEigenvaluesPhase portrait
det>0\det > 0, tr24det>0\tr^2 - 4\det > 0, tr<0\tr < 0Real negativeStable node
det>0\det > 0, tr24det>0\tr^2 - 4\det > 0, tr>0\tr > 0Real positiveUnstable node
det>0\det > 0, tr24det<0\tr^2 - 4\det < 0, tr<0\tr < 0Complex conjugate, <0\Re < 0Stable spiral
det>0\det > 0, tr24det<0\tr^2 - 4\det < 0, tr>0\tr > 0Complex conjugate, >0\Re > 0Unstable spiral
det>0\det > 0, tr=0\tr = 0Pure imaginaryCenter (closed orbits)
det<0\det < 0Real, opposite signsSaddle

The damped oscillator’s three sub-cases (over/critical/under-damped) realize three of these: stable node, stable degenerate node, stable spiral. The undamped limit c0c \to 0 slides the eigenvalues onto the imaginary axis, turning the spiral into a center; this is the Hamiltonian limit where energy is conserved exactly.

For higher-dimensional systems the topology is richer (think: 3-D systems can have spirals around line-shaped invariant manifolds), but the building blocks are the same: each pair of complex-conjugate eigenvalues contributes a 2-D spiral subspace, each real eigenvalue contributes a 1-D direction of decay or growth. This decomposition by invariant subspaces — equivalently, by the Jordan blocks of A\statemat — is the geometric content of the matrix exponential’s eigenvalue structure (§1.2).

1.6 A preview of frequency response

The variation-of-parameters formula

h(t)=eAth0+0teA(ts)Bu(s)ds\statevec(t) = e^{\statemat t} \statevec_0 + \int_0^t e^{\statemat (t-s)} \inputmat u(s) \, ds

expresses the state as a convolution of the input with the kernel K(t):=eAtBK(t) := e^{\statemat t} \inputmat. The output y(t)=Ch(t)y(t) = \outputmat \statevec(t) (taking D=0\feedmat = 0) is therefore

y(t)=CeAth0+0tCeA(ts)Bu(s)ds.y(t) = \outputmat e^{\statemat t} \statevec_0 + \int_0^t \outputmat e^{\statemat (t-s)} \inputmat \, u(s) \, ds.

The function h(t):=CeAtBh(t) := \outputmat e^{\statemat t} \inputmat for t0t \ge 0 (and zero otherwise) is the impulse response of the system: the output you would observe if the input were a Dirac delta u(t)=δ(t)u(t) = \delta(t).

Taking the Laplace transform turns convolution into multiplication. With y^(s):=0y(t)estdt\widehat y(s) := \int_0^\infty y(t) e^{-st} dt and similarly for u^\widehat u, you get

y^(s)=H(s)u^(s)+(initial-condition terms),\widehat y(s) = H(s) \, \widehat u(s) + \text{(initial-condition terms)},

where H(s):=C(sIA)1BH(s) := \outputmat (s I - \statemat)^{-1} \inputmat is the transfer function. The eigenvalues of A\statemat — already shown to govern eAte^{\statemat t} — appear as the poles of H(s)H(s), since (sIA)1(sI - \statemat)^{-1} blows up exactly when ss equals an eigenvalue of A\statemat. The poles’ location in the complex ss-plane mirrors the eigenvalue classification of §1.5: left-half-plane poles ⇔ stable system; right-half-plane poles ⇔ unstable; imaginary-axis poles ⇔ marginally stable / oscillatory.

This connection is the bridge to Chapter 8, where you’ll see that the S4 layer’s convolutional view is literally a numerical evaluation of the impulse response h(t)h(t) at uniformly spaced time points, and to Chapter 11, where the spectral structure of long convolution kernels (Hyena, RetNet) is analyzed via the same transfer-function framing.

1.7 What’s next

The next two chapters develop the analytical machinery that this chapter has only sketched. Chapter 2 makes the eigenvalue-stability story rigorous (Lyapunov, A-stability, BIBO) and adds the QR-based computational method for tracking eigenvalue trajectories during training. Chapter 3 introduces the structured-matrix vocabulary (Toeplitz, semiseparable, Vandermonde, Cauchy) needed to discuss the kernel constructions of S4 and friends. Chapters 4–6 then take the continuous state-space system here and discretize it — first crudely (ZOH, bilinear), then with full numerical-analysis machinery (Butcher tableau, A-stability), then via implicit and structure-preserving methods (Gauss–Legendre, symplectic integrators), which is where the active research pilot’s anchor (the C1 pilot) lives.

You can also profitably jump ahead: Chapter 8’s LTI SSM presentation reuses everything in this chapter and is self-contained for readers who want to see the SSM application before working through the full math foundation.

1.8 Exercises

Seven problems mixing computation and theory. Short/numerical exercises (1–4) have inline collapsible solutions; long/proof exercises (5–7) have full worked solutions in §1.9.

Exercise 1.1 (computation)

Compute the matrix exponential of A=(0110)\statemat = \begin{pmatrix} 0 & 1 \\ -1 & 0 \end{pmatrix} by hand, term-by-term, and verify it equals the rotation matrix R(t)=(costsintsintcost)R(t) = \begin{pmatrix} \cos t & \sin t \\ -\sin t & \cos t \end{pmatrix}.

Solution

Compute powers: A2=(1001)=I\statemat^2 = \begin{pmatrix} -1 & 0 \\ 0 & -1 \end{pmatrix} = -I, so A3=A\statemat^3 = -\statemat, A4=I\statemat^4 = I, and the pattern repeats with period 4. Splitting the series by parity:

eAt=k(At)kk!=Ik(1)kt2k(2k)!+Ak(1)kt2k+1(2k+1)!=Icost+Asint,e^{\statemat t} = \sum_{k} \frac{(\statemat t)^k}{k!} = I \sum_{k} \frac{(-1)^k t^{2k}}{(2k)!} + \statemat \sum_k \frac{(-1)^k t^{2k+1}}{(2k+1)!} = I \cos t + \statemat \sin t,

which expands to (costsintsintcost)=R(t)\begin{pmatrix} \cos t & \sin t \\ -\sin t & \cos t \end{pmatrix} = R(t). ∎

Exercise 1.2 (computation)

For the damped oscillator q¨+cq˙+kq=0\ddot q + c\dot q + k q = 0 with k=1k = 1 and c=0.5c = 0.5, classify the damping regime and find the eigenvalues of A\statemat.

Solution

Discriminant: c24k=0.254=3.75<0c^2 - 4k = 0.25 - 4 = -3.75 < 0, so the system is underdamped. Eigenvalues: λ±=0.25±i123.750.25±0.968i\lambda_\pm = -0.25 \pm i \tfrac{1}{2}\sqrt{3.75} \approx -0.25 \pm 0.968 i. The trajectory spirals toward the origin with envelope decay rate 0.250.25 and angular frequency 0.968\approx 0.968.

Exercise 1.3 (computation)

Lift the third-order ODE q+2q+q=u(t)q''' + 2 q'' + q = u(t) into a 3-dimensional state-space system ddth=Ah+Bu\ddt \statevec = \statemat \statevec + \inputmat u. Write A\statemat and B\inputmat explicitly.

Solution

Let h=(q,q,q)\statevec = (q, q', q'')^\top. Then h˙=(q,q,q)\dot \statevec = (q', q'', q''')^\top, and from the ODE q=2qq+uq''' = -2 q'' - q + u, so

A=(010001102),B=(001).\statemat = \begin{pmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ -1 & 0 & -2 \end{pmatrix}, \qquad \inputmat = \begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix}.

The bottom row encodes the ODE’s coefficient signature; the upper-triangular ones implement the bookkeeping qq,qqq' \mapsto q', q'' \mapsto q''. This is the companion form of the ODE.

Exercise 1.4 (computation)

For the ring of n=4n = 4 identical damped oscillators with k=1k = 1, c=0c = 0, κ=0.5\kappa = 0.5 (no damping, symmetric coupling), use the formula in §1.4 to find the eigenvalues of A\statemat analytically. Confirm they lie on the imaginary axis (since c=0c = 0) and verify by running companions/ch01/jax/coupled_oscillators.py with n=4n = 4.

Solution

With c=0c = 0, λ±(m)=±ik+2κ(1cos(2πm/n))\lambda_\pm^{(m)} = \pm i\sqrt{k + 2\kappa(1 - \cos(2\pi m/n))} for m=0,1,2,3m = 0, 1, 2, 3. Numerically:

  • m=0m = 0: λ±=±i1=±i\lambda_\pm = \pm i\sqrt{1} = \pm i
  • m=1m = 1: λ±=±i1+20.51=±i2\lambda_\pm = \pm i\sqrt{1 + 2 \cdot 0.5 \cdot 1} = \pm i\sqrt{2}
  • m=2m = 2: λ±=±i1+20.52=±i3\lambda_\pm = \pm i\sqrt{1 + 2 \cdot 0.5 \cdot 2} = \pm i\sqrt{3}
  • m=3m = 3: same as m=1m = 1 by symmetry ⇒ ±i2\pm i\sqrt{2}

All 8 eigenvalues are pure imaginary (centers), as expected for an undamped conservative system. The numerical verification reproduces these (modulo double-precision noise).

Exercise 1.5 (theory) — solution in §1.9

Prove that the matrix-exponential series kMk/k!\sum_k M^k/k! converges absolutely (entrywise) for every square matrix MM, with bound eMFeMF\norm{e^M}_F \le e^{\norm{M}_F} where F\norm{\cdot}_F is the Frobenius norm.

Exercise 1.6 (theory) — solution in §1.9

Show that if A\statemat is skew-symmetric (A=A\statemat^\top = -\statemat), then eAte^{\statemat t} is orthogonal for every tRt \in \R (that is, (eAt)eAt=I(e^{\statemat t})^\top e^{\statemat t} = I). Use this to give a geometric reason why the rotation matrix of Exercise 1.1 is orthogonal.

Exercise 1.7 (theory) — solution in §1.9

Prove the variation of parameters formula: the unique solution of ddth=Ah+Bu(t)\ddt \statevec = \statemat \statevec + \inputmat u(t) with h(0)=h0\statevec(0) = \statevec_0 is

h(t)=eAth0+0teA(ts)Bu(s)ds.\statevec(t) = e^{\statemat t} \statevec_0 + \int_0^t e^{\statemat (t-s)} \inputmat u(s) \, ds.

1.9 Full solutions to theory exercises

Solution to Exercise 1.5

Let F\norm{\cdot}_F denote the Frobenius norm, which satisfies the sub-multiplicative property ABFAFBF\norm{AB}_F \le \norm{A}_F \norm{B}_F. Then MkFMFk\norm{M^k}_F \le \norm{M}_F^k by induction. Therefore

k=0KMkk!Fk=0KMkFk!k=0KMFkk!eMFas K.\left\| \sum_{k=0}^{K} \frac{M^k}{k!} \right\|_F \le \sum_{k=0}^K \frac{\norm{M^k}_F}{k!} \le \sum_{k=0}^K \frac{\norm{M}_F^k}{k!} \to e^{\norm{M}_F} \quad \text{as } K \to \infty.

The partial sums are entrywise bounded by terms of a convergent scalar series, so each matrix entry is Cauchy in KK and converges. The limit is what we call eMe^M. The bound eMFeMF\norm{e^M}_F \le e^{\norm{M}_F} follows by passing to the limit. ∎

Solution to Exercise 1.6

We have ddteAt=AeAt\ddt e^{\statemat t} = \statemat e^{\statemat t} and ddt(eAt)=(eAt)A\ddt (e^{\statemat t})^\top = (e^{\statemat t})^\top \statemat^\top. Let Q(t):=(eAt)eAtQ(t) := (e^{\statemat t})^\top e^{\statemat t}. Then

ddtQ=(eAt)AeAt+(eAt)AeAt=(eAt)(A+A)eAt=0,\ddt Q = (e^{\statemat t})^\top \statemat^\top e^{\statemat t} + (e^{\statemat t})^\top \statemat e^{\statemat t} = (e^{\statemat t})^\top (\statemat^\top + \statemat) e^{\statemat t} = 0,

since A+A=0\statemat^\top + \statemat = 0 by skew-symmetry. So Q(t)=Q(0)=IQ(t) = Q(0) = I for all tt, i.e. eAte^{\statemat t} is orthogonal.

Geometric interpretation: the rotation matrix R(t)R(t) of Exercise 1.1 was generated by A=(0110)\statemat = \begin{pmatrix} 0 & 1 \\ -1 & 0 \end{pmatrix}, which is skew-symmetric. So R(t)R(t) is orthogonal — which we already knew, but this derivation pins down why: orthogonality follows structurally from skew-symmetry of the generator, not from any property of the cos/sin functions. ∎

Solution to Exercise 1.7

Define h(t)\statevec(t) by the formula on the right-hand side. We verify (a) the initial condition and (b) the ODE.

Initial condition. At t=0t = 0 the integral vanishes (zero-length interval) and eA0=Ie^{\statemat \cdot 0} = I, so h(0)=h0\statevec(0) = \statevec_0. ✓

ODE. Differentiate using the product rule on the integral (Leibniz’s rule for differentiating under the integral sign):

ddth(t)=ddt[eAth0]+ddt0teA(ts)Bu(s)ds=AeAth0+[eA(tt)Bu(t)+0tteA(ts)Bu(s)ds]=AeAth0+Bu(t)+A0teA(ts)Bu(s)ds=A(eAth0+0teA(ts)Bu(s)ds)+Bu(t)=Ah(t)+Bu(t).\begin{aligned} \ddt \statevec(t) &= \ddt\left[ e^{\statemat t} \statevec_0 \right] + \ddt \int_0^t e^{\statemat (t-s)} \inputmat u(s) ds \\ &= \statemat e^{\statemat t} \statevec_0 + \left[ e^{\statemat(t-t)} \inputmat u(t) + \int_0^t \frac{\partial}{\partial t} e^{\statemat(t-s)} \inputmat u(s) ds \right] \\ &= \statemat e^{\statemat t} \statevec_0 + \inputmat u(t) + \statemat \int_0^t e^{\statemat(t-s)} \inputmat u(s) ds \\ &= \statemat \left( e^{\statemat t} \statevec_0 + \int_0^t e^{\statemat(t-s)} \inputmat u(s) ds \right) + \inputmat u(t) \\ &= \statemat \statevec(t) + \inputmat u(t). \quad \checkmark \end{aligned}

Uniqueness follows from the classical Picard–Lindelöf theorem applied to the right-hand side f(h,t)=Ah+Bu(t)f(\statevec, t) = \statemat \statevec + \inputmat u(t), which is Lipschitz in h\statevec uniformly in tt with constant A\norm{\statemat} (see Hairer–Nørsett–Wanner Hairer et al. (1993) for the standard proof). ∎

1.10 Companion code

Three runnable companions in companions/ch01/jax/:

  • damped_oscillator.py — simulates the underdamped oscillator, plots energy decay (Figure 1.1)
  • coupled_oscillators.py — constructs the ring-Laplacian state matrix, plots eigenvalue spectrum (Figure 1.2)
  • matrix_exponential.py — compares scipy.linalg.expm against truncated series sums; the truncated series catastrophically diverges for matrices with large spectral radius (auxiliary, used by Exercise 1.5)

All companions import their plotting style from companions/_shared/plot_utils.py. To run from the repo root:

PYTHONPATH=. python companions/ch01/jax/damped_oscillator.py
PYTHONPATH=. python companions/ch01/jax/coupled_oscillators.py
PYTHONPATH=. python companions/ch01/jax/matrix_exponential.py

Figures are written to public/figures/ch01/ and referenced from the <Figure> components above.

Part I · Foundations Week 2 Published

Stability theory: Lyapunov, A-stability, BIBO

Three distinct notions of stability — Lyapunov (eigenvalue criterion + QR-based computation), A-stability (regions of absolute stability for ODE integrators), and BIBO (bounded-input bounded-output for LTI systems).

Stability theory: Lyapunov, A-stability, BIBO

2.1 Three notions of stability

For the rest of this chapter, the system under analysis is the LTI state-space model from Chapter 1:

ddth(t)=Ah(t)+Bu(t),y(t)=Ch(t),\ddt \statevec(t) = \statemat \statevec(t) + \inputmat u(t), \qquad y(t) = \outputmat \statevec(t),

with ARN×N\statemat \in \R^{N \times N}. The three notions:

  1. Lyapunov stability asks: with u0u \equiv 0, do trajectories h(t)=eAth0\statevec(t) = e^{\statemat t} \statevec_0 stay bounded as tt \to \infty? Asymptotically stable if h(t)0\statevec(t) \to 0.
  2. A-stability is a property of a numerical integrator, not of the system. It asks: does the integrator preserve Lyapunov stability for every stable LTI test problem, at every step size Δ>0\stepsize > 0? If yes, the method is A-stable.
  3. BIBO stability asks: for the input–output mapping uyu \mapsto y, does every bounded input (suptu(t)<\sup_t \abs{u(t)} < \infty) produce a bounded output? The system is BIBO-stable if yes.

When A\statemat has all eigenvalues in the open left half-plane (LHP), all three coincide for the LTI case. But:

  • Lyapunov-stable systems can be Lyapunov-stable but not asymptotically stable (centers: pure imaginary eigenvalues, oscillations don’t decay).
  • A-stability is a integrator property; the explicit Euler method is not A-stable, while the implicit Euler method is. The same continuous system therefore may need a stable integrator to preserve its stability.
  • BIBO depends on the transfer function H(s)=C(sIA)1BH(s) = \outputmat (sI - \statemat)^{-1} \inputmat and on the choice of (A,B,C)(\statemat, \inputmat, \outputmat) realization. Two systems with the same A\statemat can have different BIBO properties if their C\outputmat “hides” unstable modes.

The rest of the chapter develops each notion in turn.

2.2 Lyapunov stability: theory

We restrict to the homogeneous case ddth=Ah\ddt \statevec = \statemat \statevec (zero input). The eigenvalue criterion is essentially the content of Chapter 1, §1.2, formalized:

.

Let ARN×N\statemat \in \R^{N \times N} with eigenvalues λ1,,λN\lambda_1, \ldots, \lambda_N. The system ddth=Ah\ddt \statevec = \statemat \statevec is:

  1. Lyapunov-stable (trajectories bounded) iff every λi\lambda_i satisfies (λi)0\Re(\lambda_i) \le 0, and for every λi\lambda_i on the imaginary axis ((λi)=0\Re(\lambda_i) = 0), its algebraic multiplicity equals its geometric multiplicity (no defective Jordan blocks).

  2. Asymptotically stable (trajectories 0\to 0) iff every λi\lambda_i satisfies (λi)<0\Re(\lambda_i) < 0, i.e. all eigenvalues are in the open left half-plane.

  3. Unstable otherwise.

The proof reduces to the matrix-exponential formula. If A=VΛV1\statemat = V \Lambda V^{-1} is diagonalizable, then h(t)=VeΛtV1h0\statevec(t) = V e^{\Lambda t} V^{-1} \statevec_0 and the ii-th eigencomponent decays like e(λi)te^{\Re(\lambda_i) t}. The Jordan-block caveat in case (1) handles the case where a defective imaginary eigenvalue produces polynomial growth in tt even though the real part is zero.

For nonlinear systems, the local-linearization picture from Chapter 1, §1.4 makes the same theorem applicable at each fixed point: the fixed point is asymptotically stable if the Jacobian’s eigenvalues all sit in the LHP. This is the principle that makes “check the eigenvalues” the first move in nearly every SSM stability analysis.

The Lyapunov equation

There’s an alternative characterization of asymptotic stability that doesn’t require computing eigenvalues. The continuous-time Lyapunov equation

AP+PA=Q\statemat^\top P + P \statemat = -Q

(with QQ a given symmetric positive-definite matrix) has a unique symmetric positive-definite solution PP if and only if A\statemat is asymptotically stable. The function V(h)=hPhV(\statevec) = \statevec^\top P \statevec is then a Lyapunov function for the system: V˙=hQh<0\dot V = -\statevec^\top Q \statevec < 0 for all nonzero h\statevec, so VV decreases along trajectories and forces them to the origin.

For SSM analysis, the Lyapunov-equation view is occasionally useful — for example, the controllability and observability Gramians of an LTI realization are Lyapunov solutions Antoulas (2005) — but the eigenvalue criterion is the more direct tool.

2.3 Lyapunov exponents: computation

The eigenvalues of A\statemat work as a stability diagnostic for LTI systems, where A\statemat is constant. For systems where the linearization J(t)\jacobian(t) varies with time (e.g. a trained recurrent network in inference mode, where the per-step Jacobian depends on the input), eigenvalues at any single tt don’t tell the long-run story. The right tool is the Lyapunov exponent, which generalizes the eigenvalue’s real part to time-varying systems.

.

Given a sequence of transition matrices J1,J2,\jacobian_1, \jacobian_2, \ldots (the Jacobians of an iterated map at points along a trajectory), let ΦT:=JTJT1J1\Phi_T := \jacobian_T \jacobian_{T-1} \cdots \jacobian_1. The ii-th Lyapunov exponent is

λi:=limT1Tlogσi(ΦT),\lyapexp_i := \lim_{T \to \infty} \frac{1}{T} \log \sigma_i(\Phi_T),

where σ1(ΦT)σ2(ΦT)\sigma_1(\Phi_T) \ge \sigma_2(\Phi_T) \ge \cdots are the singular values of ΦT\Phi_T. The vector (λ1,,λN)(\lyapexp_1, \ldots, \lyapexp_N) is the Lyapunov spectrum.

Three things make Lyapunov exponents the right object:

  1. They generalize eigenvalues to time-varying systems. For a constant Jacobian JtA\jacobian_t \equiv \statemat, λi=logμi(A)\lyapexp_i = \log\abs{\mu_i(\statemat)} where μi\mu_i are the eigenvalues’ magnitudes (in the discrete case) or (λi)\Re(\lambda_i) (in the continuous case). The connection back to Chapter 1’s eigenvalue classification is exact in the autonomous case.
  2. The maximal Lyapunov exponent λ1\lyapexp_1 measures sensitive dependence. A positive λ1\lyapexp_1 means infinitesimally close initial conditions diverge exponentially (chaos); λ1=0\lyapexp_1 = 0 is the edge-of-chaos regime; λ1<0\lyapexp_1 < 0 means contracting dynamics.
  3. They are robust to coordinate changes. Eigenvalues of the per-step Jacobian depend on the basis you write the state in; Lyapunov exponents do not.

The QR algorithm

Computing σi(ΦT)\sigma_i(\Phi_T) for large TT directly is numerically impossible: even for a well-behaved system with λ1=0.1\lyapexp_1 = 0.1, after T=100T = 100 steps the largest singular value is e1022000e^{10} \approx 22000 and the smallest may be e105×105e^{-10} \approx 5 \times 10^{-5}. The product matrix ΦT\Phi_T becomes either numerical zero or numerical infinity in finite-precision arithmetic, and finite-precision singular values lose all meaning. The fix, due to Benettin et al. Benettin et al. (1980) , is to factor the growth out at every step using QR decomposition.

The QR-based Lyapunov algorithm (the version that has become standard in dynamical-systems software):

Initialize Q_0 = I (identity), running sums L_i = 0 for i = 1..N.
For t = 1, 2, ..., T:
    M_t = J_t @ Q_{t-1}             # one-step forward propagation
    Q_t, R_t = qr(M_t)              # extract orthogonal + upper triangular
    Update L_i += log |R_t[i,i]| for each i
At the end:
    Lyapunov_i = L_i / T

The key insight: by re-orthonormalizing the propagated frame at every step, we factor the exponential growth into the diagonal of the upper-triangular RtR_t rather than letting it accumulate in ΦT\Phi_T. The diagonal entries of RtR_t are well-conditioned numbers near 1, and their logarithms sum nicely to give the Lyapunov exponents.

The companion lyapunov_qr.py implements this algorithm and applies it to the ring of coupled oscillators from Chapter 1, §1.4. The resulting spectrum’s structure — all exponents negative for the damped case, all near zero for the undamped case — matches the eigenvalue analysis exactly, validating the implementation.

Lyapunov spectrum of the ring-of-oscillators system for damped and undamped regimes.
Lyapunov spectrum (sorted exponents λ_1 ≥ λ_2 ≥ ...) computed via the QR algorithm for the ring of n=8 oscillators. Damped case (c=0.2): all 16 exponents are negative, indicating asymptotically stable dynamics — every direction in state space contracts. Undamped case (c=0): all exponents are zero, indicating marginally stable (energy-preserving) dynamics. The agreement with the eigenvalue real-parts of the autonomous Jacobian validates the implementation. Produced by companions/ch02/jax/lyapunov_qr.py.

2.4 A-stability: integrator regions of absolute stability

Switch perspectives. We now have a continuous system ddth=Ah\ddt \statevec = \statemat \statevec and we discretize it with some numerical integrator at step size Δ\stepsize. The integrator approximates the true continuous trajectory by a discrete recurrence

hn+1=R(ΔA)hn,\statevec_{n+1} = R(\stepsize \statemat) \statevec_n,

where R(z)R(z) is the integrator’s stability function, evaluated at the matrix argument ΔA\stepsize \statemat via the matrix-functional-calculus convention. Different integrators give different RR:

  • Forward (explicit) Euler: hn+1=hn+ΔAhn\statevec_{n+1} = \statevec_n + \stepsize \statemat \statevec_n, so RFE(z)=1+zR_{\text{FE}}(z) = 1 + z.
  • Backward (implicit) Euler: hn+1=hn+ΔAhn+1\statevec_{n+1} = \statevec_n + \stepsize \statemat \statevec_{n+1}, giving RBE(z)=1/(1z)R_{\text{BE}}(z) = 1/(1-z).
  • Bilinear (Tustin / trapezoidal): Rbil(z)=(1+z/2)/(1z/2)R_{\text{bil}}(z) = (1 + z/2)/(1 - z/2).
  • Exact (zero-order hold): RZOH(z)=ezR_{\text{ZOH}}(z) = e^z. This is the integrator S4 uses by default.

The discrete recurrence is Lyapunov-stable if and only if every eigenvalue μi\mu_i of R(ΔA)R(\stepsize \statemat) satisfies μi1\abs{\mu_i} \le 1 (the discrete-time analog of the LHP criterion is the unit disk). Since the eigenvalues of R(ΔA)R(\stepsize \statemat) are R(Δλi)R(\stepsize \lambda_i) where λi\lambda_i are the eigenvalues of A\statemat, the question reduces to: for which z=Δλiz = \stepsize \lambda_i in the complex plane is R(z)1\abs{R(z)} \le 1?

.

The region of absolute stability of a numerical integrator with stability function R(z)R(z) is

S:={zC:R(z)1}.S := \{ z \in \C : \abs{R(z)} \le 1 \}.

The integrator is A-stable if SS contains the entire closed left half-plane.

A-stability matters because it means: if the continuous system is asymptotically stable, the integrator preserves that stability at every step size Δ>0\stepsize > 0. Without A-stability, you have to pick Δ\stepsize small enough that Δλi\stepsize \lambda_i lands inside the (bounded) stability region.

The four examples above:

  • Forward Euler: 1+z1\abs{1 + z} \le 1zz in the closed disk of radius 1 around 1-1. Tiny region; not A-stable. For a continuous system with eigenvalue λ=100\lambda = -100 (fast decay), forward Euler requires Δ<2/100\stepsize < 2/100 to avoid blowup.
  • Backward Euler: 1/(1z)1\abs{1/(1-z)} \le 11z1\abs{1-z} \ge 1zz outside the open disk of radius 1 around 11. This contains the entire closed LHP. A-stable.
  • Bilinear (trapezoidal): Rbil(z)1\abs{R_{\text{bil}}(z)} \le 1(z)0\Re(z) \le 0 exactly. The stability region is exactly the closed LHP. A-stable, and the boundary coincides with the imaginary axis — a property called L-stability when combined with R(z)0R(z) \to 0 as (z)\Re(z) \to -\infty, which bilinear does not satisfy (Rbil()=1R_{\text{bil}}(-\infty) = -1).
  • Zero-order hold (matrix exponential): ez=e(z)1\abs{e^z} = e^{\Re(z)} \le 1(z)0\Re(z) \le 0. The stability region is the closed LHP. A-stable and L-stable. This is the gold standard for LTI discretization — and it’s what S4 uses internally.
Stability regions in the complex plane for forward Euler, backward Euler, bilinear, and ZOH integrators.
Stability regions S = { z : |R(z)| ≤ 1 } for four common integrators. Forward Euler's tiny disk around -1 makes it conditionally stable only; backward Euler and bilinear/trapezoidal are A-stable; ZOH (the matrix exponential) is A-stable and L-stable, with stability region exactly the closed left half-plane. The eigenvalue distribution of a learned SSM's A matrix overlaid on these regions tells you immediately which integrators are safe to use. Produced by companions/ch02/jax/stability_regions.py.

Why this matters for SSMs

The S4 family uses ZOH discretization, and is therefore stable at every step size for which the continuous system is stable. Mamba-1 uses ZOH as well. Mamba-3’s exponential-trapezoidal rule (Chapter 10) is a second-order generalization of bilinear that retains A-stability. The C1 research pilot (see Chapter 6) asks whether symplectic integrators — A-stable methods designed to preserve geometric invariants — can do better for complex-state SSMs whose A\statemat has near-imaginary eigenvalues.

When a stability-region analysis fails, it usually fails dramatically: a forward-Euler discretization of a stiff system blows up in a handful of steps. This is exactly the failure mode that careful integrator choice prevents.

For full coverage of A-stability theory and a wider zoo of integrator stability functions, Hairer–Wanner’s stiff-systems volume Hairer & Wanner (1996) is the canonical reference.

2.5 BIBO stability

The third notion looks at the input–output mapping rather than the autonomous dynamics. For the system ddth=Ah+Bu\ddt \statevec = \statemat \statevec + \inputmat u, y=Chy = \outputmat \statevec, the output under a unit-impulse input is the impulse response

h(t)=CeAtBfor t0.h(t) = \outputmat e^{\statemat t} \inputmat \quad \text{for } t \ge 0.

The system is BIBO-stable if its impulse response satisfies 0h(t)dt<\int_0^\infty \norm{h(t)} \, dt < \infty, equivalently if hL1h \in L^1. Under this condition, any bounded input uu produces a bounded output yy — Young’s inequality on convolutions gives the precise bound supty(t)hL1suptu(t)\sup_t \norm{y(t)} \le \norm{h}_{L^1} \cdot \sup_t \abs{u(t)}.

.

A causal LTI system (A,B,C)(\statemat, \inputmat, \outputmat) is BIBO-stable iff all poles of its transfer function H(s):=C(sIA)1BH(s) := \outputmat (sI - \statemat)^{-1} \inputmat lie in the open left half-plane.

Poles of H(s)H(s) are the values of ss where det(sIA)=0\det(sI - \statemat) = 0 — i.e. the eigenvalues of A\statematunless a pole–zero cancellation occurs because C\outputmat doesn’t “see” some eigenmode. This cancellation matters: a system can be internally unstable (some λi\lambda_i in the right half-plane) but BIBO-stable if those unstable modes are unobservable from C\outputmat and unreachable from B\inputmat. The Kalman canonical-decomposition theorem makes this precise; for our purposes, the eigenvalues of A\statemat in the LHP are sufficient for both internal and BIBO stability.

For an SSM viewed as a sequence-to-sequence map, BIBO is the natural input–output stability concept: bounded inputs (typical of token embeddings) should produce bounded outputs (typical of pre-softmax logits). When training an SSM produces an A\statemat whose eigenvalues drift into the right half-plane, both internal (Lyapunov) and input–output (BIBO) stability fail simultaneously, and the model’s outputs blow up. Tracking the eigenvalues of A\statemat during training is therefore a single diagnostic for both failure modes.

2.6 Connection to Lyapunov functions

The Lyapunov-function method gives a non-spectral way to prove stability. For nonlinear systems where eigenvalues don’t directly apply, finding a function V:RNRV: \R^N \to \R with V(0)=0V(0) = 0, V(h)>0V(\statevec) > 0 for h0\statevec \ne 0, and V˙(h):=V(h)f(h)<0\dot V(\statevec) := \nabla V(\statevec) \cdot f(\statevec) < 0 for all h0\statevec \ne 0 proves asymptotic stability of h=0\statevec^* = 0.

For our LTI case f(h)=Ahf(\statevec) = \statemat \statevec, the Lyapunov equation AP+PA=Q\statemat^\top P + P \statemat = -Q (with QQ p.d.) gives V(h)=hPhV(\statevec) = \statevec^\top P \statevec as a quadratic Lyapunov function — and its existence with PP p.d. is equivalent to asymptotic stability (as noted in §2.2). The interest in the function-method picture for our context is mainly conceptual: it explains why “energy” arguments work to prove stability of mechanical systems (the energy is the Lyapunov function), and it generalizes immediately to nonlinear systems where the eigenvalue criterion only gives local information at fixed points.

This connects forward to Chapter 6, where the question of which discretization preserves a Lyapunov function (or, more strongly, an energy invariant) leads naturally into symplectic methods.

2.7 What’s next

Chapter 3 develops the structured-linear-algebra vocabulary (SVD, Jordan form, condition number, Toeplitz/Vandermonde/Cauchy/semiseparable structure) needed for the SSM kernel constructions in Chapters 7–9. Chapter 4 takes the discretization story started in §2.4 and develops it systematically: order conditions, accuracy classes, the Butcher tableau, the bilinear and ZOH derivations in detail. Chapter 6 picks up the A-stability theme and pushes into implicit and structure-preserving integrators — symplectic schemes, geometric integration, and the C1 pilot’s home territory.

2.8 Exercises

Six problems. Inline-collapsible solutions for the shorter ones; full solutions for the longer theory problems in §2.9.

Exercise 2.1 (computation)

Use the eigenvalue criterion (Theorem 2.1) to classify the stability of ddth=Ah\ddt \statevec = \statemat \statevec for A=(1411)\statemat = \begin{pmatrix} -1 & 4 \\ -1 & -1 \end{pmatrix}.

Solution

Characteristic polynomial: det(λIA)=(λ+1)2+4=λ2+2λ+5\det(\lambda I - \statemat) = (\lambda+1)^2 + 4 = \lambda^2 + 2\lambda + 5. Roots: λ=1±2i\lambda = -1 \pm 2i. Both have (λ)=1<0\Re(\lambda) = -1 < 0, so the system is asymptotically stable. The non-zero imaginary part means trajectories spiral toward the origin (stable spiral).

Exercise 2.2 (computation)

For the matrix A=(0110)\statemat = \begin{pmatrix} 0 & 1 \\ -1 & 0 \end{pmatrix} of Exercise 1.1, find the eigenvalues and classify the Lyapunov stability. Is the system asymptotically stable?

Solution

Eigenvalues: λ=±i\lambda = \pm i. Both have (λ)=0\Re(\lambda) = 0, so the system is Lyapunov-stable but not asymptotically stable — trajectories are bounded (centers / closed orbits) but do not decay to zero. This is the undamped harmonic oscillator’s marginal case.

Exercise 2.3 (computation)

The forward Euler method has stability function R(z)=1+zR(z) = 1 + z. For the continuous test problem x˙=λx\dot x = \lambda x with λ=10\lambda = -10 (a stiff system with fast time scale), find the maximum step size Δ\stepsize for which forward Euler is stable.

Solution

Stability condition: 1+Δλ1\abs{1 + \stepsize \lambda} \le 1 with λ=10\lambda = -10. Algebra: 110Δ1\abs{1 - 10\stepsize} \le 1010Δ20 \le 10\stepsize \le 2Δ0.2\stepsize \le 0.2. So forward Euler requires Δ0.2\stepsize \le 0.2. Any Δ>0.2\stepsize > 0.2 produces exponentially growing iterates even though the continuous system decays. This is the classic motivation for implicit methods on stiff problems.

Exercise 2.4 (computation)

Show that the bilinear method Rbil(z)=(1+z/2)/(1z/2)R_{\text{bil}}(z) = (1 + z/2)/(1 - z/2) maps the closed left half-plane {(z)0}\{\Re(z) \le 0\} onto the closed unit disk {w1}\{\abs{w} \le 1\}. (This is the Möbius map picture of the bilinear transform.)

Solution

Let z=a+biz = a + bi with a0a \le 0. Compute:

Rbil(z)2=(1+a/2)2+(b/2)2(1a/2)2+(b/2)2.\abs{R_{\text{bil}}(z)}^2 = \frac{(1+a/2)^2 + (b/2)^2}{(1-a/2)^2 + (b/2)^2}.

The denominator minus the numerator equals (1a/2)2(1+a/2)2=2a0(1-a/2)^2 - (1+a/2)^2 = -2a \ge 0 (since a0a \le 0). So numerator \le denominator and Rbil(z)1\abs{R_{\text{bil}}(z)} \le 1. Equality iff a=0a = 0 (purely imaginary zz), so the imaginary axis maps to the unit circle and the open LHP maps to the open unit disk. ∎

Exercise 2.5 (theory) — solution in §2.9

State and prove the Lyapunov equation characterization of asymptotic stability: A\statemat is asymptotically stable iff for every symmetric positive-definite QQ, the equation AP+PA=Q\statemat^\top P + P \statemat = -Q has a unique symmetric positive-definite solution PP.

Exercise 2.6 (theory) — solution in §2.9

Prove the BIBO criterion of Theorem 2.3: a causal LTI system is BIBO-stable iff all poles of its transfer function lie in the open left half-plane. (Hint: Use the Laplace-transform / convolution duality, and exploit the representation of the impulse response as a sum of exponentials.)

2.9 Full solutions to theory exercises

Solution to Exercise 2.5

Forward direction (Lyapunov equation ⇒ asymptotic stability): Suppose PP is symmetric positive-definite and satisfies AP+PA=Q\statemat^\top P + P \statemat = -Q for some symmetric positive-definite QQ. Define V(h)=hPhV(\statevec) = \statevec^\top P \statevec. Then V>0V > 0 for h0\statevec \ne 0 (since P0P \succ 0), and along trajectories ddth=Ah\ddt \statevec = \statemat \statevec:

V˙=h˙Ph+hPh˙=hAPh+hPAh=h(Q)h<0.\dot V = \dot \statevec^\top P \statevec + \statevec^\top P \dot \statevec = \statevec^\top \statemat^\top P \statevec + \statevec^\top P \statemat \statevec = \statevec^\top (-Q) \statevec < 0.

So VV is a strict Lyapunov function and trajectories satisfy V(h(t))V(h(0))eαtV(\statevec(t)) \le V(\statevec(0)) e^{-\alpha t} for α>0\alpha > 0 determined by the smallest eigenvalue of QQ over the largest of PP. Hence h(t)0\statevec(t) \to 0 and the origin is asymptotically stable.

Reverse direction (asymptotic stability ⇒ Lyapunov equation has p.d. solution): Suppose A\statemat is asymptotically stable. Define

P:=0eAtQeAtdt.P := \int_0^\infty e^{\statemat^\top t} Q \, e^{\statemat t} \, dt.

The integral converges absolutely because eAte^{\statemat t} decays exponentially. PP is symmetric (transpose preserves the integrand structure) and positive-definite (since Q0Q \succ 0 and eAte^{\statemat t} is invertible). Direct computation:

AP+PA=0(AeAtQeAt+eAtQeAtA)dt=0ddt(eAtQeAt)dt=[eAtQeAt]0=Q,\statemat^\top P + P \statemat = \int_0^\infty \left( \statemat^\top e^{\statemat^\top t} Q e^{\statemat t} + e^{\statemat^\top t} Q e^{\statemat t} \statemat \right) dt = \int_0^\infty \ddt \left( e^{\statemat^\top t} Q e^{\statemat t} \right) dt = [e^{\statemat^\top t} Q e^{\statemat t}]_0^\infty = -Q,

using the decay of eAte^{\statemat t} at \infty. Uniqueness of PP follows from the linearity of the Lyapunov operator L(P):=AP+PA\mathcal{L}(P) := \statemat^\top P + P \statemat and the fact that its kernel is trivial when A\statemat has no purely imaginary eigenvalues. ∎

Solution to Exercise 2.6

(⇒) BIBO-stable implies poles in open LHP. Suppose the system is BIBO-stable, i.e. hL1h \in L^1. The Laplace transform H(s)=0h(t)estdtH(s) = \int_0^\infty h(t) e^{-st} dt then exists and is analytic for (s)0\Re(s) \ge 0 (since the integrand is bounded by h(t)\abs{h(t)} which is integrable, the integral converges uniformly). HH is rational (the matrix-inverse formula gives H(s)=C(sIA)1BH(s) = \outputmat (sI - \statemat)^{-1} \inputmat, a ratio of polynomials with denominator det(sIA)\det(sI - \statemat)). A rational function analytic on the closed right half-plane has no poles there — so all poles of H(s)H(s) lie in the open LHP.

(⇐) Poles in open LHP implies BIBO-stable. Suppose all poles of H(s)H(s) lie in the open LHP. Partial-fraction decomposition gives H(s)=kck(spk)mkH(s) = \sum_k \frac{c_k}{(s - p_k)^{m_k}} where pkp_k are the poles with multiplicity mkm_k, and (pk)<0\Re(p_k) < 0 for all kk. The inverse Laplace transform of each term is cktmk1(mk1)!epkt\frac{c_k t^{m_k - 1}}{(m_k - 1)!} e^{p_k t}, which is L1L^1 on [0,)[0, \infty) because the exponential decay beats any polynomial growth in tt. Sum of L1L^1 functions is L1L^1, so hL1h \in L^1 and BIBO holds.

(For the multi-input multi-output case, apply the same argument entrywise to H(s)CM×DH(s) \in \C^{M \times D}.) ∎

2.10 Companion code

Two runnable companions in companions/ch02/jax/:

  • lyapunov_qr.py — implements the QR-based Lyapunov-exponent algorithm and applies it to the ring-of-oscillators system from Chapter 1
  • stability_regions.py — plots the regions of absolute stability for forward Euler, backward Euler, bilinear (trapezoidal), and ZOH in the complex plane

To run from the repo root:

PYTHONPATH=. python companions/ch02/jax/lyapunov_qr.py
PYTHONPATH=. python companions/ch02/jax/stability_regions.py

Figures land in public/figures/ch02/ (referenced from §2.3 and §2.4 above).

Part I · Foundations Week 3 Published

Linear algebra for sequence models: structured matrices and conditioning

Eigenvalue and SVD decompositions, condition number, the four structured-matrix families (Toeplitz, Vandermonde, Cauchy, semiseparable), low-rank updates, and a Krylov-subspace primer — the linear-algebra vocabulary later chapters assume.

Linear algebra for sequence models: structured matrices and conditioning

3.1 Eigenvalue decomposition and Jordan normal form

Chapter 1’s matrix exponential built directly on the spectral structure of A\statemat. The fundamental theorem is:

.

For every matrix ACN×N\statemat \in \C^{N \times N}, there exist matrices VV (invertible) and JJ (block-diagonal with Jordan blocks on the diagonal) such that

A=VJV1.\statemat = V J V^{-1}.

JJ is the Jordan normal form of A\statemat; the diagonal entries of JJ are the eigenvalues of A\statemat (counted with algebraic multiplicity); the size of each Jordan block equals the difference between the algebraic and geometric multiplicities. If every eigenvalue has equal algebraic and geometric multiplicity, all Jordan blocks have size 1 and JJ is diagonal — A\statemat is diagonalizable.

A Jordan block of size kk for eigenvalue λ\lambda looks like

Jk(λ)=(λ1λ1λ1λ)Ck×k,J_k(\lambda) = \begin{pmatrix} \lambda & 1 & & \\ & \lambda & 1 & \\ & & \ddots & \ddots \\ & & & \lambda & 1 \\ & & & & \lambda \end{pmatrix} \in \C^{k \times k},

i.e. eigenvalue on the diagonal, ones on the superdiagonal, zeros elsewhere. The matrix exponential of a Jordan block is

eJk(λ)t=eλt(1tt2/2!tk1/(k1)!1ttk2/(k2)!1t1),e^{J_k(\lambda) t} = e^{\lambda t} \begin{pmatrix} 1 & t & t^2/2! & \cdots & t^{k-1}/(k-1)! \\ & 1 & t & \cdots & t^{k-2}/(k-2)! \\ & & \ddots & \ddots & \vdots \\ & & & 1 & t \\ & & & & 1 \end{pmatrix},

which is the source of the polynomial-in-tt factors mentioned in Chapter 1, §1.2 — they appear exactly when there are Jordan blocks of size >1> 1.

For SSM analysis, the diagonalizable case dominates. Almost every A\statemat arising from random initialization or from training is diagonalizable; the non-diagonalizable case is a codimension-one subset of parameter space and shows up only at carefully-tuned boundaries. In practice the assumption ”A\statemat is diagonalizable” is essentially always valid; the Jordan form is the worst-case fallback.

3.2 Singular value decomposition

The eigenvalue decomposition requires A\statemat to be square; for general (rectangular or possibly non-diagonalizable) matrices, the right tool is the singular value decomposition.

.

For every matrix ARm×n\statemat \in \R^{m \times n}, there exist orthogonal matrices URm×mU \in \R^{m \times m}, VRn×nV \in \R^{n \times n}, and a diagonal-with-zeros matrix ΣRm×n\Sigma \in \R^{m \times n} with non-negative entries σ1σ2σmin(m,n)0\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_{\min(m,n)} \ge 0 on the diagonal, such that

A=UΣV.\statemat = U \Sigma V^\top.

The σi\sigma_i are the singular values of A\statemat; they are uniquely determined. The number of nonzero σi\sigma_i equals rank(A)\rank(\statemat).

The SVD has three properties that make it the workhorse of numerical linear algebra:

  1. It always exists. Unlike the eigenvalue decomposition, no diagonalizability or even squareness assumption is needed.
  2. The singular values give the Frobenius and operator norms. A2=σ1\norm{\statemat}_2 = \sigma_1 (operator norm, equal to the largest singular value); AF=iσi2\norm{\statemat}_F = \sqrt{\sum_i \sigma_i^2}.
  3. It reveals low-rank structure cleanly. Truncating the SVD at rank r<rank(A)r < \rank(\statemat) — keeping only the top rr singular values and corresponding columns of U,VU, V — gives the best rank-rr approximation to A\statemat in both Frobenius and operator norms (the Eckart–Young theorem).

For square A\statemat, the singular values are related to but distinct from the eigenvalues. The relationship σi(A)2=λi(AA)\sigma_i(\statemat)^2 = \lambda_i(\statemat^\top \statemat) (using descending order on both sides) lets you compute singular values via an eigenvalue problem on the Gram matrix AA\statemat^\top \statemat — though in practice the QR-iteration-based SVD algorithm (LAPACK’s gesdd) is more numerically stable.

The SVD shows up explicitly in:

  • Chapter 2’s Lyapunov-exponent computation (singular values of the propagated frame matrix).
  • Chapter 8’s HiPPO matrix analysis (the conditioning of the projection operator is given by its SVD).
  • Chapter 12’s delta-rule lineage (DeltaNet’s state update is a rank-1 SVD correction).

For a textbook treatment of the SVD’s properties and algorithms, Trefethen–Bau Trefethen & Bau (1997) Chapters 4–5 are the standard reference. Golub–Van Loan Golub & Van Loan (2013) covers the algorithmic details exhaustively.

3.3 Condition number

The condition number of a matrix ARN×N\statemat \in \R^{N \times N} (with respect to the operator norm) is

κ(A):=A2A12=σ1(A)σN(A),\kappa(\statemat) := \norm{\statemat}_2 \cdot \norm{\statemat^{-1}}_2 = \frac{\sigma_1(\statemat)}{\sigma_N(\statemat)},

the ratio of the largest to smallest singular value (when A\statemat is invertible; otherwise κ=\kappa = \infty). The condition number measures how much A\statemat amplifies relative input perturbations into relative output perturbations: if Ah=b\statemat \statevec = b and bb has relative error δb/b\delta b / \norm{b}, then the relative error in the computed solution can be as large as κ(A)δb/b\kappa(\statemat) \cdot \delta b / \norm{b}.

The qualitative scale: κ=1\kappa = 1 is the orthogonal case (no amplification); κ10k\kappa \approx 10^k means losing roughly kk decimal digits of precision when solving a linear system. Double-precision floating point has about 16 decimal digits; matrices with κ1016\kappa \gtrsim 10^{16} are numerically singular.

For SSMs, condition number matters in three places:

  1. HiPPO matrix construction. The HiPPO-LegS matrix has structure that keeps its condition number bounded as NN grows — this is why HiPPO is the standard initialization. Random initializations typically give κ\kappa growing as NO(1)N^{O(1)} or worse.
  2. S4 kernel computation. S4’s Vandermonde-Cauchy kernel construction requires inverting a matrix with structured but potentially ill-conditioned columns. The paper carefully handles the conditioning; naive implementations don’t.
  3. Mamba-3’s complex-state design. Chapter 10 discusses how Mamba-3 deliberately places eigenvalues in a region of the complex plane where the discrete-time map remains well-conditioned across the integration step.
Growth of the condition number of various structured matrices as size N increases.
Condition number κ(A) as a function of matrix size N for: a random Gaussian matrix (κ grows roughly linearly in N, modulo log factors); a Hilbert matrix (κ grows exponentially — the textbook example of ill-conditioning); the HiPPO-LegS matrix (κ stays bounded as N grows — its design feature). Produced by companions/ch03/jax/condition_number.py.

3.4 Structured matrix families

Four classes of structured matrices appear throughout the SSM literature. Each is parameterized by O(N)O(N) numbers rather than the O(N2)O(N^2) a general matrix needs, and each admits fast matrix-vector products via specialized algorithms.

Toeplitz matrices

A Toeplitz matrix is constant along each diagonal:

T=(t0t1t2t(n1)t1t0t1t2t1t0t2t1tn1t2t1t0).T = \begin{pmatrix} t_0 & t_{-1} & t_{-2} & \cdots & t_{-(n-1)} \\ t_1 & t_0 & t_{-1} & & \vdots \\ t_2 & t_1 & t_0 & \ddots & t_{-2} \\ \vdots & & \ddots & \ddots & t_{-1} \\ t_{n-1} & \cdots & t_2 & t_1 & t_0 \end{pmatrix}.

Parameterized by 2n12n - 1 values (t(n1),,tn1)(t_{-(n-1)}, \ldots, t_{n-1}). Toeplitz matrices are the matrix form of convolutions: if TT is Toeplitz with first column (t0,t1,,tn1)(t_0, t_1, \ldots, t_{n-1}) and first row (t0,0,,0)(t_0, 0, \ldots, 0) (lower-triangular Toeplitz), then ThT \statevec is the discrete convolution of the kernel (t0,t1,)(t_0, t_1, \ldots) with h\statevec. The FFT-based convolution algorithm computes ThT \statevec in O(nlogn)O(n \log n) time.

LTI SSM convolutional view (Chapter 8) is exactly this: the operator that maps the input sequence uu to the output sequence yy via y(t)=0th(ts)u(s)dsy(t) = \int_0^t h(t-s) u(s) ds is a Toeplitz matrix at the discretization level, with first column equal to the discretized impulse response h(0),h(Δ),h(2Δ),h(0), h(\stepsize), h(2\stepsize), \ldots.

Vandermonde matrices

A Vandermonde matrix has the form

V=(1x1x12x1n11x2x22x2n11xnxn2xnn1),V = \begin{pmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^{n-1} \\ 1 & x_2 & x_2^2 & \cdots & x_2^{n-1} \\ \vdots & & & & \vdots \\ 1 & x_n & x_n^2 & \cdots & x_n^{n-1} \end{pmatrix},

parameterized by nn values (x1,,xn)(x_1, \ldots, x_n). The defining property is that (Vh)i=j=0n1vjxij(V \statevec)_i = \sum_{j=0}^{n-1} v_j x_i^j is a polynomial in xix_i evaluated at the node xix_i. So Vandermonde matrices implement polynomial evaluation at nn points as a linear map.

Vandermonde matrices are notoriously ill-conditioned for nodes on the real line (condition number can grow exponentially), but well-behaved for nodes on the unit circle — which is why the FFT (Vandermonde with xix_i being roots of unity) is numerically stable.

The S4 kernel computation uses Vandermonde structure: evaluating jcjλjk\sum_j c_j \lambda_j^k over k=0,1,,L1k = 0, 1, \ldots, L-1 is exactly the Vandermonde-style polynomial evaluation. The S4 paper Gu et al. (2022) uses Cauchy-matrix tricks (next subsection) to make this evaluation stable.

Cauchy matrices

A Cauchy matrix has entries

Cij=1xiyj,C_{ij} = \frac{1}{x_i - y_j},

parameterized by n+mn + m values (x1,,xn,y1,,ym)(x_1, \ldots, x_n, y_1, \ldots, y_m) with the {xi}{yj}\{x_i\} \cap \{y_j\} empty (to avoid zero denominators). Cauchy matrices are dense — every entry depends on both indices — but the very structured dependence enables fast algorithms: a Cauchy matrix-vector product can be computed in O(nlog2n)O(n \log^2 n) time using the fast multipole method.

Cauchy matrices appear in two places in the SSM literature: the S4 paper uses them as a numerically stable replacement for direct Vandermonde-style kernel evaluation, and the diagonal-plus-low-rank parametrization of A\statemat in S4 has a Cauchy-matrix interpretation when viewed through the partial-fraction decomposition of its transfer function.

Semiseparable matrices

A rank-kk semiseparable matrix has the property that every submatrix lying strictly above the main diagonal (and every submatrix lying strictly below) has rank at most kk. Equivalently, the upper and lower triangular parts each have rank-kk structure.

The 1-semiseparable case — every off-diagonal block has rank at most 1 — is the structure exploited by Mamba-2’s SSD framework Dao & Gu (2024) . The 1-semiseparable lower-triangular matrix corresponding to a scalar-times-identity SSM is, explicitly,

M=(1a11a1a2a21a1a2a3a2a3a31),M = \begin{pmatrix} 1 & & & \\ a_1 & 1 & & \\ a_1 a_2 & a_2 & 1 & \\ a_1 a_2 a_3 & a_2 a_3 & a_3 & 1 \\ \vdots & & & \ddots \end{pmatrix},

where aia_i is the (scalar) recurrence coefficient at step ii. The (i,j)(i, j) entry for i>ji > j is the product aj+1aj+2aia_{j+1} a_{j+2} \cdots a_i. The SSD insight is that this matrix-vector product can be computed two ways: as a scan (the SSM view, O(L)O(L) time) or as a structured matrix multiply (the attention view, O(L2)O(L^2) time but matmul-friendly on GPUs).

When LL is moderate (say 8192\le 8192) and the GPU favors matmul over scan, the matrix view wins; when LL is huge, the scan view wins. Mamba-2’s chunked algorithm interpolates: matrix-multiplies within chunks of size 256\sim 256, scans across chunks. This is the structural reason Mamba-2 is faster than Mamba-1 on long sequences without sacrificing parallelism.

Heatmap visualizations of Toeplitz, Vandermonde, Cauchy, and semiseparable matrices showing their structural patterns.
Structure of the four matrix families for N=8: Toeplitz (constant along diagonals — convolution); Vandermonde (powers of node values in rows); Cauchy (each entry is 1/(x_i - y_j)); 1-semiseparable lower triangular (each lower-triangular entry is the product of a prefix of off-diagonal factors). Color encodes log|entry|. Produced by companions/ch03/jax/structured_matrices.py.

3.5 Low-rank corrections and rank-1 updates

A recurring pattern: take a structured matrix SS (diagonal, banded, or semiseparable) and add a small low-rank correction UVU V^\top where U,VRn×kU, V \in \R^{n \times k} with knk \ll n. The resulting matrix S+UVS + U V^\top retains nearly the storage and matvec efficiency of SS, and admits a closed-form inverse via the Sherman–Morrison–Woodbury identity:

(S+UV)1=S1S1U(I+VS1U)1VS1.(S + U V^\top)^{-1} = S^{-1} - S^{-1} U (I + V^\top S^{-1} U)^{-1} V^\top S^{-1}.

The cost is dominated by inverting the k×kk \times k matrix I+VS1UI + V^\top S^{-1} U, not the n×nn \times n outer matrix.

This pattern appears in S4 explicitly: the HiPPO-LegS matrix isn’t diagonal, but it is “diagonal plus low-rank” — specifically, normal plus low-rank, which is what makes the kernel computation tractable. The S4 paper’s main algorithmic contribution is a specialized Sherman–Morrison-style trick for this exact decomposition.

The pattern recurs even more visibly in DeltaNet (Chapter 12), where the state update is

St+1=St+βt(vtStkt)kt=St(Iβtktkt)+βtvtkt.S_{t+1} = S_t + \beta_t (v_t - S_t k_t) k_t^\top = S_t (I - \beta_t k_t k_t^\top) + \beta_t v_t k_t^\top.

The middle factor is a rank-1 correction (specifically, an identity minus a rank-1 outer product), and the whole expression is one explicit-Euler step of an online gradient-descent update on the per-token association loss. The Longhorn paper Dao & Gu (2024) (well, the linear-attention lineage more generally) frames this as a discretization of an implicit-rank-1-correction online ODE.

The key takeaway: structured + low-rank corrections give you efficient matrix-vector products without giving up expressiveness. This is the design pattern unifying S4, DeltaNet, GLA, and the Mamba-2 SSD framework — each is a different choice of base structure and correction pattern.

3.6 Krylov subspaces: a primer

The Krylov subspace of order kk generated by a matrix A\statemat and a vector bb is

Kk(A,b):=span{b,Ab,A2b,,Ak1b}.\mathcal{K}_k(\statemat, b) := \text{span}\{ b, \statemat b, \statemat^2 b, \ldots, \statemat^{k-1} b \}.

Krylov subspaces are the workhorse of iterative methods for large sparse linear systems: GMRES, conjugate gradient, Arnoldi iteration, Lanczos. All of them construct an orthonormal basis for Kk(A,b)\mathcal{K}_k(\statemat, b) and solve a small (k×kk \times k) projected problem instead of the original n×nn \times n one.

For SSMs, the Krylov picture is conceptual rather than algorithmic. The point is that Kk(A,b)\mathcal{K}_k(\statemat, b) contains all the information the recurrence ht+1=Aht+But\statevec_{t+1} = \statemat \statevec_t + \inputmat u_t can extract from the initial condition bb in kk steps. If the system has nkn - k eigenvalues that are decoupled from the initial direction bb (the so-called “unreachable subspace”), the recurrence cannot recover them, and the effective dimension of the SSM is k<nk < n. This is one rigorous reading of the Mamba copying-limitation results — the recurrence’s expressive ceiling is set by the Krylov dimension of A\statemat relative to the input.

A full treatment of Krylov methods would fill its own chapter; the curriculum revisits the picture in Chapter 8 (where the S4 kernel can be viewed as a structured Krylov-projection problem) and in Chapter 16’s empirical-methodology discussion of why some architectures can copy strings exponentially longer than others.

For a textbook coverage of Krylov-subspace methods, Trefethen–Bau Trefethen & Bau (1997) Chapters 32–40 give the full algorithmic treatment.

3.7 What’s next

You now have the linear-algebra vocabulary the rest of the book assumes. Chapter 4 picks up the discretization thread (started in Chapter 2, §2.4) and develops it systematically: order conditions, accuracy classes, the Butcher tableau, the bilinear and ZOH derivations in detail. Chapter 7 introduces the HiPPO theory that connects orthogonal-basis approximation theory to the A\statemat matrix structure of S4. Chapter 8 then shows how all four structured-matrix families combine in the S4 / S4D / S5 family.

If you’re impatient, Chapter 9’s Mamba-1/2/3 presentation is the payoff for the SSD discussion of §3.4 — the selective scan’s matmul-friendly chunkwise algorithm is exactly the 1-semiseparable matrix product algorithm.

3.8 Exercises

Six problems. Inline solutions for the shorter ones; full proofs for the theory exercises in §3.9.

Exercise 3.1 (computation)

Compute the Jordan normal form of A=(2102)\statemat = \begin{pmatrix} 2 & 1 \\ 0 & 2 \end{pmatrix}.

Solution

The matrix is already in Jordan form: a single 2×22 \times 2 Jordan block with eigenvalue λ=2\lambda = 2. There is one eigenvalue (algebraic multiplicity 2) but only one linearly independent eigenvector (geometric multiplicity 1), giving a defective J2(2)J_2(2). The transformation matrix VV is the identity (J=AJ = \statemat directly).

Exercise 3.2 (computation)

Compute the singular values of A=(300400)\statemat = \begin{pmatrix} 3 & 0 \\ 0 & 4 \\ 0 & 0 \end{pmatrix} and its operator norm.

Solution

The matrix is already in SVD form (with U=I3U = I_3 and V=I2V = I_2). Singular values are σ1=4\sigma_1 = 4, σ2=3\sigma_2 = 3. Operator norm: A2=σ1=4\norm{\statemat}_2 = \sigma_1 = 4. (Note that swapping the diagonal entries doesn’t change the SVD; singular values are always sorted descending.)

Exercise 3.3 (computation)

For the Vandermonde matrix with nodes (x1,x2,x3)=(1,2,3)(x_1, x_2, x_3) = (1, 2, 3), write out the matrix and compute its determinant. Compare to the closed-form Vandermonde determinant detV=i<j(xjxi)\det V = \prod_{i < j} (x_j - x_i).

Solution

V=(111124139).V = \begin{pmatrix} 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & 3 & 9 \end{pmatrix}.

Direct computation: detV=1(2943)1(1941)+1(1321)=65+1=2\det V = 1 \cdot (2 \cdot 9 - 4 \cdot 3) - 1 \cdot (1 \cdot 9 - 4 \cdot 1) + 1 \cdot (1 \cdot 3 - 2 \cdot 1) = 6 - 5 + 1 = 2.

Closed form: (21)(31)(32)=121=2(2-1)(3-1)(3-2) = 1 \cdot 2 \cdot 1 = 2. ✓

Exercise 3.4 (computation)

Verify the Sherman–Morrison identity numerically: pick a random invertible 3×33 \times 3 matrix SS, vectors u,vR3u, v \in \R^3, and check that (S+uv)1(S + u v^\top)^{-1} matches the closed-form expression to machine precision.

Solution
import numpy as np
rng = np.random.default_rng(0)
S = rng.standard_normal((3, 3))
u = rng.standard_normal(3)
v = rng.standard_normal(3)
direct = np.linalg.inv(S + np.outer(u, v))
S_inv = np.linalg.inv(S)
factor = 1.0 + v @ S_inv @ u
sherman = S_inv - np.outer(S_inv @ u, v @ S_inv) / factor
print(np.allclose(direct, sherman))  # True

The factor 1+vS1u1 + v^\top S^{-1} u must be non-zero (the Sherman–Morrison identity fails when it is, indicating that S+uvS + u v^\top is singular).

Exercise 3.5 (theory) — solution in §3.9

Prove the Eckart–Young theorem: the best rank-kk approximation to a matrix A\statemat in the Frobenius norm is obtained by truncating its SVD at rank kk. That is, if A=UΣV\statemat = U \Sigma V^\top with singular values σ1σn\sigma_1 \ge \cdots \ge \sigma_n, then Ak:=ikσiuivi\statemat_k := \sum_{i \le k} \sigma_i u_i v_i^\top minimizes ABF\norm{\statemat - B}_F over all matrices BB of rank k\le k.

Exercise 3.6 (theory) — solution in §3.9

Prove that a Toeplitz matrix TRn×nT \in \R^{n \times n} with Th2Ch2\norm{T \statevec}_2 \le C \norm{\statevec}_2 for all h\statevec (operator-norm-bounded TT) admits a matrix-vector product in O(nlogn)O(n \log n) time via the FFT. Specifically, show that any n×nn \times n Toeplitz TT can be embedded in a 2n×2n2n \times 2n circulant matrix, whose matvec is exactly the FFT-IFFT pair applied to the kernel and input.

3.9 Full solutions to theory exercises

Solution to Exercise 3.5

The proof uses two ingredients: (a) the SVD’s unitary invariance of the Frobenius norm, and (b) the optimality of truncated diagonal matrices.

Setup. Let A=UΣV\statemat = U \Sigma V^\top be the SVD with Σ=diag(σ1,,σn)\Sigma = \diag(\sigma_1, \ldots, \sigma_n) and σ1σn0\sigma_1 \ge \cdots \ge \sigma_n \ge 0. For any BB of rank k\le k, write B=UB~VB = U \widetilde B V^\top where B~:=UBV\widetilde B := U^\top B V also has rank k\le k (rank is invariant under invertible transformations). The Frobenius norm is invariant under orthogonal transformations:

ABF=U(ΣB~)VF=ΣB~F.\norm{\statemat - B}_F = \norm{U(\Sigma - \widetilde B) V^\top}_F = \norm{\Sigma - \widetilde B}_F.

So the problem reduces to: minimize ΣB~F\norm{\Sigma - \widetilde B}_F over rank-k\le k matrices B~\widetilde B.

Diagonal reduction. Write B~\widetilde B with its own SVD B~=XDY\widetilde B = X D Y^\top where D=diag(d1,,dk,0,,0)D = \diag(d_1, \ldots, d_k, 0, \ldots, 0). Then

ΣB~F2=tr((ΣB~)(ΣB~))=ΣF22tr(ΣB~)+B~F2.\norm{\Sigma - \widetilde B}_F^2 = \tr((\Sigma - \widetilde B)^\top (\Sigma - \widetilde B)) = \norm{\Sigma}_F^2 - 2 \tr(\Sigma \widetilde B^\top) + \norm{\widetilde B}_F^2.

The middle trace term tr(ΣB~)ikσidi\tr(\Sigma \widetilde B^\top) \le \sum_{i \le k} \sigma_i d_i by the von-Neumann trace inequality (with equality when B~\widetilde B‘s singular vectors align with Σ\Sigma‘s — i.e. B~\widetilde B is diagonal in the same basis as Σ\Sigma). Optimizing did_i to maximize this term subject to rank k\le k gives di=σid_i = \sigma_i for iki \le k and di=0d_i = 0 for i>ki > k, i.e. B~=diag(σ1,,σk,0,,0)\widetilde B = \diag(\sigma_1, \ldots, \sigma_k, 0, \ldots, 0).

Substituting back, the minimum value is ΣB~F2=i>kσi2\norm{\Sigma - \widetilde B^*}_F^2 = \sum_{i > k} \sigma_i^2, achieved by the truncated SVD Ak=Udiag(σ1,,σk,0,)V=ikσiuivi\statemat_k = U \diag(\sigma_1, \ldots, \sigma_k, 0, \ldots) V^\top = \sum_{i \le k} \sigma_i u_i v_i^\top. ∎

(The proof for the operator norm follows the same structure but uses Weyl’s interlacing inequality instead of von-Neumann’s trace inequality; see Golub–Van Loan Golub & Van Loan (2013) for the detailed argument.)

Solution to Exercise 3.6

Let TRn×nT \in \R^{n \times n} be Toeplitz with entries Tij=tijT_{ij} = t_{i-j} for some kernel (t(n1),,tn1)(t_{-(n-1)}, \ldots, t_{n-1}).

Step 1 — Circulant embedding. Define a circulant matrix CR2n×2nC \in \R^{2n \times 2n} with first column

c=(t0,t1,t2,,tn1,0,t(n1),t(n2),,t1).c = (t_0, t_1, t_2, \ldots, t_{n-1}, 0, t_{-(n-1)}, t_{-(n-2)}, \ldots, t_{-1})^\top.

The first nn rows and first nn columns of CC exactly reproduce TT (the zero in the middle of cc is the “buffer” that prevents wrap-around contamination).

Step 2 — Padded matvec. Given hRn\statevec \in \R^n, form h~=(h,0)R2n\widetilde \statevec = (\statevec, 0)^\top \in \R^{2n} (zero-padded). Then (Ch~)1:n=Th(C \widetilde \statevec)_{1:n} = T \statevec: the first nn entries of the circulant product are exactly the Toeplitz product, because the zero-padding ensures the wrap-around portion of the convolution doesn’t contaminate the top half.

Step 3 — Circulant matvec via FFT. Every 2n×2n2n \times 2n circulant matrix CC is diagonalized by the discrete Fourier transform: C=F1diag(Fc)FC = F^{-1} \diag(F c) F, where FF is the 2n2n-point DFT matrix. So

Ch~=F1(diag(Fc)Fh~)=F1((Fc)(Fh~)),C \widetilde \statevec = F^{-1} \, (\diag(F c) \cdot F \widetilde \statevec) = F^{-1} \, ((F c) \odot (F \widetilde \statevec)),

where \odot is element-wise product. Both FFTs and the IFFT take O(nlogn)O(n \log n) time; the element-wise product is O(n)O(n).

Total cost. O(nlogn)O(n \log n) for the FFTs + O(n)O(n) for the element-wise multiply + O(n)O(n) to extract the first nn entries = O(nlogn)O(n \log n). ∎

This is the exact algorithm used to compute LTI SSM convolutions in S4-family implementations: precompute the kernel h(0),h(Δ),,h((L1)Δ)h(0), h(\stepsize), \ldots, h((L-1)\stepsize) once, then convolve with any input in O(LlogL)O(L \log L) time.

3.10 Companion code

Two runnable companions in companions/ch03/jax/:

  • condition_number.py — plots κ(A) growth for random Gaussian, Hilbert, and HiPPO-LegS matrices as N grows; produces Figure 3.1
  • structured_matrices.py — constructs Toeplitz / Vandermonde / Cauchy / 1-semiseparable matrices for N=8 and visualizes structural patterns; produces Figure 3.2

To run from the repo root:

PYTHONPATH=. python companions/ch03/jax/condition_number.py
PYTHONPATH=. python companions/ch03/jax/structured_matrices.py

Figures land in public/figures/ch03/.

Part I · Foundations Week 4 Published

Discretization theory: ZOH, bilinear, exponential families

One-step discretizations of linear ODEs — zero-order hold, bilinear (Tustin), and exponential-trapezoidal — with the Lax equivalence theorem as the unifying convergence criterion.

Discretization theory: ZOH, bilinear, exponential families

4.1 The discretization problem

Chapter 1 wrote the linear state-space system as a continuous-time ODE

ddth(t)=Ah(t)+Bu(t),y(t)=Ch(t),\ddt \statevec(t) = \statemat \statevec(t) + \inputmat u(t), \qquad y(t) = \outputmat \statevec(t),

defined for every t0t \ge 0. Computers do not handle “every tt”; they handle a finite grid tk=kΔt_k = k \stepsize for k=0,1,2,k = 0, 1, 2, \ldots with step size Δ>0\stepsize > 0. A discretization is a rule for building a recurrence

hk+1=Aˉhk+Bˉuk\statevec_{k+1} = \discA \statevec_k + \discB u_k

whose iterates hk\statevec_k approximate the continuous trajectory at the grid points, hkh(tk)\statevec_k \approx \statevec(t_k). The two discrete matrices Aˉ,Bˉ\discA, \discB are functions of the continuous matrices A,B\statemat, \inputmat and of the step size Δ\stepsize. Different functions give different discretizations.

There is no single canonical choice. Every discrete recurrence in this book — the S4 layer, the Mamba selective scan, the exp-trapezoidal scheme of Mamba-3 — picks a specific rule for translating (A,B,Δ)(Aˉ,Bˉ)(\statemat, \inputmat, \stepsize) \mapsto (\discA, \discB). The choice matters: a poor discretization can take a perfectly stable continuous system and produce a discrete recurrence that diverges, fails to converge as Δ0\stepsize \to 0, or accumulates phase error on long horizons. Numerical-analysis textbooks Hairer et al. (1993) spend hundreds of pages on these failure modes. We will compress them into three properties and three schemes.

A useful warm-up is forward Euler, the simplest possible scheme:

hk+1=hk+Δ(Ahk+Buk)=(I+ΔA)hk+ΔBuk.\statevec_{k+1} = \statevec_k + \stepsize (\statemat \statevec_k + \inputmat u_k) = (I + \stepsize \statemat) \statevec_k + \stepsize \inputmat u_k.

This is what you get by replacing ddth\ddt \statevec with the forward difference (hk+1hk)/Δ(\statevec_{k+1} - \statevec_k)/\stepsize and freezing the input at uku_k. The discrete matrices are Aˉ=I+ΔA\discA = I + \stepsize \statemat and Bˉ=ΔB\discB = \stepsize \inputmat. Forward Euler is the cleanest illustration of what can go wrong: it is only conditionally stable, meaning that for any fixed A\statemat with Re(λi)<0\operatorname{Re}(\lambda_i) < 0 there is a maximum step size beyond which Aˉ=I+ΔA\discA = I + \stepsize \statemat acquires an eigenvalue of modulus >1> 1 and the recurrence blows up. The three schemes in §4.3–§4.5 fix this defect — each in its own way, with its own trade-off.

4.2 The Lax equivalence theorem

To compare schemes you need a definition of “correct.” The standard one comes from the Lax equivalence theorem Lax & Richtmyer (1956) , which decomposes correctness into two ingredients: consistency and stability.

A one-step scheme hk+1=Φ(Δ)(hk,uk)\statevec_{k+1} = \Phi(\stepsize)(\statevec_k, u_k) is consistent with the ODE if its local truncation error — the residual you obtain by plugging the exact continuous trajectory into the discrete recurrence — vanishes faster than Δ\stepsize as Δ0\stepsize \to 0. Formally,

τ(Δ):=h(t+Δ)Φ(Δ)(h(t),u(t))Δ0as Δ0.\tau(\stepsize) := \frac{\statevec(t + \stepsize) - \Phi(\stepsize)(\statevec(t), u(t))}{\stepsize} \to 0 \quad \text{as } \stepsize \to 0.

If τ(Δ)=O(Δp)\tau(\stepsize) = O(\stepsize^p) the scheme is pp-th order consistent. Higher pp means the per-step error shrinks faster with Δ\stepsize.

A scheme is zero-stable if small perturbations δ\delta injected at one step propagate boundedly to all future steps. For linear schemes this reduces to a uniform bound on the powers of Aˉ\discA: there exists KK independent of Δ\stepsize such that AˉkK\norm{\discA^k} \le K for all kk with kΔTk \stepsize \le T, for any fixed horizon TT. Equivalently, the spectral radius ρ(Aˉ)1+O(Δ)\rho(\discA) \le 1 + O(\stepsize).

The two ingredients combine.

.

(Lax equivalence.) For a linear, well-posed initial-value problem, a one-step scheme is convergent — meaning maxkΔThkh(tk)0\max_{k \stepsize \le T} \norm{\statevec_k - \statevec(t_k)} \to 0 as Δ0\stepsize \to 0 — if and only if it is both consistent and zero-stable.

Convergence is what you actually want: that the discrete iterates approach the true continuous trajectory as you refine the grid. The theorem says you cannot get there by being consistent alone (you also need stability) or by being stable alone (you also need consistency). Both are necessary, and together they are sufficient. This is the convergence criterion every scheme in this chapter must pass.

For a stable LTI test problem — meaning A\statemat has all eigenvalues in the open left half-plane — there is a stronger notion sitting on top of zero-stability: a scheme is A-stable if ρ(Aˉ)1\rho(\discA) \le 1 for every step size Δ>0\stepsize > 0, not just for sufficiently small Δ\stepsize. Forward Euler is not A-stable; ZOH, bilinear, and exp-trapezoidal are. The geometry of A-stability — the set of AΔ\statemat \stepsize values for which the scheme behaves well — is the subject of Chapter 5.

4.3 Zero-order hold (ZOH)

The first scheme assumes the input is piecewise constant between samples: u(t)=uku(t) = u_k for t[tk,tk+1)t \in [t_k, t_{k+1}). Under that assumption the inhomogeneous ODE solves exactly on a single interval. Starting from h(tk)=hk\statevec(t_k) = \statevec_k,

h(tk+1)=eAΔhk+tktk+1eA(tk+1s)Bukds=eAΔhk+A1 ⁣(eAΔI)Buk.\statevec(t_{k+1}) = e^{\statemat \stepsize} \statevec_k + \int_{t_k}^{t_{k+1}} e^{\statemat (t_{k+1} - s)} \inputmat \, u_k \, ds = e^{\statemat \stepsize} \statevec_k + \statemat^{-1}\!\left(e^{\statemat \stepsize} - I\right) \inputmat \, u_k.

Identifying the two matrices,

Aˉ=eAΔ,Bˉ=A1 ⁣(eAΔI)B.\discA = e^{\statemat \stepsize}, \qquad \discB = \statemat^{-1}\!\left(e^{\statemat \stepsize} - I\right) \inputmat.

This is the zero-order hold (ZOH) discretization. The hold refers to the input assumption; “zero-order” refers to the polynomial degree of the held signal (a constant is a zero-degree polynomial).

A practical wrinkle: writing Bˉ\discB as A1(eAΔI)B\statemat^{-1}(e^{\statemat \stepsize} - I) \inputmat requires A\statemat to be invertible. The HiPPO matrix of Chapter 7 is invertible; some structured A\statemat in later chapters are not. The augmented matrix exponential trick handles both cases uniformly. Build the block matrix

M=(AΔBΔ00)R(N+P)×(N+P),M = \begin{pmatrix} \statemat \stepsize & \inputmat \stepsize \\ 0 & 0 \end{pmatrix} \in \R^{(N+P) \times (N+P)},

where PP is the input dimension. Then

exp(M)=(eAΔA1(eAΔI)B0I),\exp(M) = \begin{pmatrix} e^{\statemat \stepsize} & \statemat^{-1}(e^{\statemat \stepsize} - I) \inputmat \\ 0 & I \end{pmatrix},

and a single expm call extracts both Aˉ\discA (top-left block) and Bˉ\discB (top-right block) without ever inverting A\statemat. The S4 reference implementation uses this trick; so does the JAX companion discretization_comparison.py.

.

(ZOH preserves Lyapunov stability.) If every eigenvalue λi\lambda_i of A\statemat has Re(λi)<0\operatorname{Re}(\lambda_i) < 0 and Δ>0\stepsize > 0, then every eigenvalue μi=eλiΔ\mu_i = e^{\lambda_i \stepsize} of Aˉ=eAΔ\discA = e^{\statemat \stepsize} satisfies μi<1\abs{\mu_i} < 1. ZOH is therefore A-stable.

The proof is one line: eλiΔ=eRe(λi)Δ<1\abs{e^{\lambda_i \stepsize}} = e^{\operatorname{Re}(\lambda_i) \stepsize} < 1 when Re(λi)<0\operatorname{Re}(\lambda_i) < 0 and Δ>0\stepsize > 0. Stability is preserved for every positive step size, which is what A-stability means.

Two further properties are worth highlighting. First, for the autonomous ODE (u0u \equiv 0), ZOH is exact: the recurrence hk+1=eAΔhk\statevec_{k+1} = e^{\statemat \stepsize} \statevec_k matches h(tk+1)=eAΔh(tk)\statevec(t_{k+1}) = e^{\statemat \stepsize} \statevec(t_k) without any error at all. This is unusual — most discretizations have nonzero error even on autonomous systems. Second, for forced systems with smooth uu, ZOH is first-order accurate: the per-step error in the forcing integral is O(Δ2)O(\stepsize^2) (because the integrand eA(tk+1s)B(u(s)uk)e^{\statemat (t_{k+1} - s)} \inputmat (u(s) - u_k) is O(Δ)O(\stepsize) over an interval of length Δ\stepsize), and after T/ΔT/\stepsize steps the cumulative error is O(Δ)O(\stepsize). The empirical convergence rate plotted in the companion log-log error figure confirms slope 1\approx 1.

Continuous eigenvalues in the complex plane (left half-plane) map to discrete eigenvalues inside the unit disk after ZOH discretization.
ZOH preserves stability: continuous eigenvalues in the open left half-plane (left panel) map to discrete eigenvalues strictly inside the unit disk (right panel) via $\mu_i = e^{\lambda_i \Delta}$. The mapping is bijective and depends on the step size $\Delta$. Produced by companions/ch04/jax/discretization_comparison.py.

4.4 The bilinear (Tustin) transform

The second scheme drops the piecewise-constant assumption and instead approximates the integral on [tk,tk+1][t_k, t_{k+1}] by the trapezoidal rule: tktk+1f(s)dsΔ2(f(tk)+f(tk+1))\int_{t_k}^{t_{k+1}} f(s) \, ds \approx \tfrac{\stepsize}{2}(f(t_k) + f(t_{k+1})). Applied to the inhomogeneous ODE this gives an implicit recurrence,

hk+1=hk+Δ2(Ahk+Buk)+Δ2(Ahk+1+Buk+1),\statevec_{k+1} = \statevec_k + \tfrac{\stepsize}{2}(\statemat \statevec_k + \inputmat u_k) + \tfrac{\stepsize}{2}(\statemat \statevec_{k+1} + \inputmat u_{k+1}),

which one solves for hk+1\statevec_{k+1}:

hk+1=(IΔ2A)1[(I+Δ2A)hk+Δ2B(uk+uk+1)].\statevec_{k+1} = \left(I - \tfrac{\stepsize}{2}\statemat\right)^{-1}\left[\left(I + \tfrac{\stepsize}{2}\statemat\right) \statevec_k + \tfrac{\stepsize}{2} \inputmat (u_k + u_{k+1})\right].

Reading off Aˉ\discA and Bˉ\discB,

Aˉ=(IΔ2A)1 ⁣(I+Δ2A),Bˉ=(IΔ2A)1ΔB.\discA = \left(I - \tfrac{\stepsize}{2}\statemat\right)^{-1}\!\left(I + \tfrac{\stepsize}{2}\statemat\right), \qquad \discB = \left(I - \tfrac{\stepsize}{2}\statemat\right)^{-1} \stepsize \, \inputmat.

This is the bilinear transform, also called the Tustin transform after Arnold Tustin’s 1947 paper introducing it in control theory Tustin (1947) . The S4D paper uses bilinear discretization; so do many signal-processing toolboxes.

The bilinear formula has a striking geometric interpretation. Restricted to a single eigenvalue λ\lambda of A\statemat (commuting through the simultaneous diagonalization), the map λμ\lambda \mapsto \mu that sends a continuous eigenvalue to its discrete image is

μ=1+Δ2λ1Δ2λ.\mu = \frac{1 + \tfrac{\stepsize}{2}\lambda}{1 - \tfrac{\stepsize}{2}\lambda}.

This is a Möbius transformation (a.k.a. fractional linear transformation) of the complex plane: z(az+b)/(cz+d)z \mapsto (az+b)/(cz+d) with adbc0ad - bc \ne 0. Möbius transformations are conformal — angle-preserving — and they map circles-or-lines to circles-or-lines. The specific Möbius map above sends the imaginary axis exactly to the unit circle and sends the open left half-plane exactly to the open unit disk.

.

(Bilinear preserves Lyapunov stability.) If every eigenvalue λi\lambda_i of A\statemat has Re(λi)<0\operatorname{Re}(\lambda_i) < 0 and Δ>0\stepsize > 0, then every eigenvalue μi\mu_i of Aˉ=(IΔ2A)1(I+Δ2A)\discA = (I - \tfrac{\stepsize}{2}\statemat)^{-1}(I + \tfrac{\stepsize}{2}\statemat) satisfies μi<1\abs{\mu_i} < 1. Bilinear is therefore A-stable.

The proof is geometric: z:=Δ2λiz := \tfrac{\stepsize}{2}\lambda_i has Re(z)<0\operatorname{Re}(z) < 0, and 1+z<1z\abs{1+z} < \abs{1-z} iff Re(z)<0\operatorname{Re}(z) < 0 (square both sides and cancel). Algebraically, (1+z)/(1z)2=1+z2/1z2<1\abs{(1+z)/(1-z)}^2 = \abs{1+z}^2/\abs{1-z}^2 < 1, which is exactly μi<1\abs{\mu_i} < 1. The full proof, with the Möbius geometry spelled out, is Exercise 4.5 in §4.9.

Bilinear is second-order accurate for smooth forcing — the trapezoidal-rule local truncation error is O(Δ3)O(\stepsize^3), giving global error O(Δ2)O(\stepsize^2). Unlike ZOH, it is not exact even on autonomous systems: the (IΔ2A)1(I+Δ2A)(I - \tfrac{\stepsize}{2}\statemat)^{-1}(I + \tfrac{\stepsize}{2}\statemat) Padé approximation of eAΔe^{\statemat \stepsize} agrees with the true exponential only through second order. The trade is one of structure: bilinear maps the imaginary axis exactly to the unit circle, so marginally stable continuous modes (purely imaginary eigenvalues) become marginally stable discrete modes (eigenvalues exactly on the unit circle). ZOH crushes them to the unit-disk interior at rate e0Δ=1e^{0 \cdot \stepsize} = 1 (marginal in the limit but never exactly on the circle for finite Δ\stepsize). This makes bilinear the natural choice when the system has oscillatory modes you want to preserve — a property that becomes load-bearing for the symplectic methods of Chapter 6.

4.5 Exponential-family discretizations

The third family takes a different approach: rather than approximating eAΔe^{\statemat \stepsize} by a low-order Padé form (as bilinear does) or by the identity-plus-linear truncation I+ΔAI + \stepsize \statemat (as forward Euler does), it computes eAΔe^{\statemat \stepsize} exactly and approximates only the forcing integral. This is the philosophy of exponential integrators Hochbruck & Ostermann (2010) .

The exact one-step formula from variation of parameters (Chapter 1, §1.6) is

h(tk+1)=eAΔhk+0ΔeA(Δs)Bu(tk+s)ds.\statevec(t_{k+1}) = e^{\statemat \stepsize} \statevec_k + \int_0^{\stepsize} e^{\statemat (\stepsize - s)} \inputmat u(t_k + s) \, ds.

ZOH approximates u(tk+s)uku(t_k + s) \approx u_k on [0,Δ][0, \stepsize]; bilinear approximates the integral by the trapezoidal rule with a Padé-approximated exponential. The exponential-trapezoidal scheme keeps the exact exponential and uses a linear interpolation of the input: u(tk+s)uk+(s/Δ)(uk+1uk)u(t_k + s) \approx u_k + (s/\stepsize)(u_{k+1} - u_k). Substituting and evaluating the resulting two integrals gives

hk+1=eAΔhk+Δφ1(AΔ)Buk+Δφ2(AΔ)B(uk+1uk),\statevec_{k+1} = e^{\statemat \stepsize} \statevec_k + \stepsize \, \varphi_1(\statemat \stepsize) \, \inputmat \, u_k + \stepsize \, \varphi_2(\statemat \stepsize) \, \inputmat \, (u_{k+1} - u_k),

where φ1\varphi_1 and φ2\varphi_2 are the first two φ\varphi-functions of the exponential family:

φ1(z)=ez1z,φ2(z)=ez1zz2.\varphi_1(z) = \frac{e^z - 1}{z}, \qquad \varphi_2(z) = \frac{e^z - 1 - z}{z^2}.

Both are entire functions (the apparent singularities at z=0z = 0 are removable: φ1(0)=1\varphi_1(0) = 1, φ2(0)=1/2\varphi_2(0) = 1/2). Applied to the matrix AΔ\statemat \stepsize, they produce matrix-valued objects φ1(AΔ),φ2(AΔ)\varphi_1(\statemat \stepsize), \varphi_2(\statemat \stepsize) that can be computed via the same augmented-matrix-exponential trick as ZOH.

Reading off the discrete matrices in the form hk+1=Aˉhk+Bˉ0uk+Bˉ1(uk+1uk)\statevec_{k+1} = \discA \statevec_k + \discB_0 u_k + \discB_1 (u_{k+1} - u_k):

Aˉ=eAΔ,Bˉ0=Δφ1(AΔ)B,Bˉ1=Δφ2(AΔ)B.\discA = e^{\statemat \stepsize}, \qquad \discB_0 = \stepsize \, \varphi_1(\statemat \stepsize) \, \inputmat, \qquad \discB_1 = \stepsize \, \varphi_2(\statemat \stepsize) \, \inputmat.

Notice that Aˉ\discA is the same as ZOH, so exp-trapezoidal preserves stability identically to ZOH: A-stable, exact for autonomous systems, eigenvalues in the unit disk for any Δ>0\stepsize > 0. The improvement is in the order of accuracy: by interpolating uu linearly rather than holding it constant, exp-trapezoidal becomes second-order accurate (provided uu is C1C^1).

Exp-trapezoidal is the discretization of choice when the continuous dynamics are stiff — when A\statemat has eigenvalues with very different magnitudes, so the linear part is the hard part of the problem. By treating eAΔe^{\statemat \stepsize} exactly the scheme sidesteps the step-size restriction that explicit methods inherit from the stiff eigenvalues, and the matrix exponential is computed once per layer in practice. Chapter 10 returns to this story for Mamba-3, which adopts exp-trapezoidal precisely because the input-dependent A(ut)\statemat(u_t) in selective SSMs produces stiff dynamics that ZOH and bilinear handle poorly.

A subtle implementation point: φ2(AΔ)\varphi_2(\statemat \stepsize) is not numerically equal to (eAΔIAΔ)/(AΔ)2(e^{\statemat \stepsize} - I - \statemat \stepsize)/(\statemat \stepsize)^2 when this formula is computed naively, because catastrophic cancellation in the numerator destroys precision for small Δ\stepsize. The standard remedy is to compute all φ\varphi-functions simultaneously from one augmented matrix exponential, again using the trick that produced Bˉ\discB for ZOH. Both companions (exp_trapezoidal.py in JAX and discretization_atlas.jl in Julia) implement the augmented form.

Log-log error vs step size for exp-trapezoidal, ZOH, and bilinear on a forced damped oscillator.
Exp-trapezoidal (rose) achieves the same slope-2 convergence as bilinear (gold) while inheriting ZOH's autonomous-exactness — the best of both. ZOH (navy) lags at slope 1. Empirical fit on the two finest step sizes confirms the theoretical orders. Produced by companions/ch04/jax/exp_trapezoidal.py.

4.6 Order of accuracy and the discretization hierarchy

The three schemes are summarized in the table below. “Autonomous-exact” means the scheme produces zero error for the homogeneous problem (u0u \equiv 0). “A-stable” means ρ(Aˉ)1\rho(\discA) \le 1 for every Δ>0\stepsize > 0 on every stable LTI test problem.

SchemeAˉ\discAOrderAutonomous-exactA-stable
Forward EulerI+ΔAI + \stepsize \statemat1NoNo
ZOHeAΔe^{\statemat \stepsize}1YesYes
Bilinear (Tustin)(IΔ2A)1(I+Δ2A)(I - \tfrac{\stepsize}{2}\statemat)^{-1}(I + \tfrac{\stepsize}{2}\statemat)2NoYes
Exp-trapezoidaleAΔe^{\statemat \stepsize}2YesYes

The empirical order can be read directly off a log-log plot of error against step size: a pp-th-order scheme has error Δp\propto \stepsize^p, so on a log-log plot the data lie on a line of slope pp. The Julia companion discretization_atlas.jl runs this sweep against a high-accuracy Tsit5 reference solution from DifferentialEquations.jl; the JAX companion discretization_comparison.py performs the same sweep using scipy.integrate.solve_ivp (Radau) as the reference.

Log-log plot of max error versus step size for ZOH (slope ~1), bilinear and exp-trapezoidal (slope ~2) on a forced damped oscillator.
Empirical order of accuracy of the three schemes on the forced damped oscillator $\\ddot q + 0.5 \\dot q + 2 q = \\sin(2t)$. ZOH (navy): slope 1 on the log-log plot, confirming first-order convergence. Bilinear (gold) and exp-trapezoidal (rose): slope 2, confirming second-order convergence. Produced by companions/ch04/jax/discretization_comparison.py.

The hierarchy is not strict: ZOH’s autonomous-exactness can outweigh its first-order accuracy when the forcing is mild relative to the homogeneous decay. Bilinear’s exact imaginary-axis preservation can outweigh its non-exactness on autonomous problems when the system has long-lived oscillations. Exp-trapezoidal combines the best of both — exact on autonomous, second-order on forced — at the cost of computing more φ\varphi-functions per step. The trade-offs are recurring themes throughout the SSM literature: S4 uses ZOH for its simplicity and exactness on the (autonomous) HiPPO drift; S4D and S5 follow suit; Mamba-3 switches to exp-trapezoidal once the input-dependence makes the dynamics stiff enough that the second-order error compounding matters.

4.7 What’s next

This chapter introduced three schemes and proved each preserves Lyapunov stability. Chapter 5 zooms in on the stability region of each scheme — the set of AΔ\statemat \stepsize values for which ρ(Aˉ)1\rho(\discA) \le 1 — and develops the Butcher-tableau machinery for systematically constructing higher-order Runge–Kutta methods. Chapter 6 then asks what to do when even A-stability is not enough — when the system is stiff or Hamiltonian and you need L-stable implicit methods, or symplectic methods that preserve geometric invariants. The exp-trapezoidal scheme of §4.5 turns out to be a member of a much larger family of exponential integrators, all of which deserve a place in your numerical-analysis toolkit.

Forward-looking SSM connections: ZOH is the default in Chapters 7–9 (HiPPO, S4, S4D, S5, Mamba-1/2). Bilinear shows up in S4D when the HiPPO matrix is diagonalized. Exp-trapezoidal is the Mamba-3 scheme, treated in Chapter 10.

4.8 Exercises

Six problems mixing computation and theory. Short/numerical (4.1–4.3) have inline collapsible solutions; long/proof exercises (4.4–4.6) have full worked solutions in §4.9.

Exercise 4.1 (computation)

Compute Aˉ\discA for ZOH on the 1-D test problem A=2\statemat = -2, B=1\inputmat = 1, Δ=0.1\stepsize = 0.1. Then verify by hand that Aˉ<1\abs{\discA} < 1.

Solution

Aˉ=e20.1=e0.20.8187\discA = e^{-2 \cdot 0.1} = e^{-0.2} \approx 0.8187. The discrete eigenvalue is 0.81870.8187, strictly inside the unit disk. The continuous decay rate is 22 per unit time; the discrete decay factor per step is e0.20.8187e^{-0.2} \approx 0.8187, so after 10 steps (one continuous-time unit) the state is reduced by a factor 0.8187100.135=e20.8187^{10} \approx 0.135 = e^{-2} — matching the continuous decay exactly. ZOH is autonomous-exact.

Exercise 4.2 (computation)

Compute Aˉ\discA for the bilinear transform on the same 1-D problem (A=2\statemat = -2, Δ=0.1\stepsize = 0.1). Compare Aˉbilinear\abs{\discA}_{\text{bilinear}} against AˉZOH=e0.2\abs{\discA}_{\text{ZOH}} = e^{-0.2} — which is smaller, and why?

Solution

Aˉbilinear=(1+(0.1/2)(2))/(1(0.1/2)(2))=0.9/1.10.8182\discA_{\text{bilinear}} = (1 + (0.1/2)(-2))/(1 - (0.1/2)(-2)) = 0.9/1.1 \approx 0.8182. ZOH gives e0.20.8187e^{-0.2} \approx 0.8187. The bilinear value is slightly smaller, meaning bilinear overdamps slightly relative to the true exponential. This is the second-order Padé approximation underestimating e0.2e^{-0.2}: the truncation in φ1\varphi_1 removes positive higher-order terms. For small Δλ\stepsize \cdot \abs{\lambda} the gap is O(Δ3)O(\stepsize^3); here the difference is about 51045 \cdot 10^{-4}.

Exercise 4.3 (computation + code)

Run companions/ch04/jax/discretization_comparison.py and verify empirically that ZOH has slope 1\approx 1 on the log-log error-vs-Δ\stepsize plot, bilinear and exp-trapezoidal have slope 2\approx 2. What happens to the ZOH slope if you set the forcing u(t)0u(t) \equiv 0 (autonomous case)?

Solution

The companion’s error_sweep function prints slopes near 1.00, 2.00, 2.00. Setting u0u \equiv 0 in the autonomous case makes ZOH exact — the error drops to roundoff (1014\sim 10^{-14} in float64) and the slope is meaningless. This is consistent with the autonomous-exactness property of §4.3: ZOH’s only error source for forced systems is the piecewise-constant approximation of uu.

Exercise 4.4 (theory) — solution in §4.9

Prove the augmented matrix exponential trick used in §4.3: that

exp ⁣(M1M200)=(eM1M11(eM1I)M20I)\exp\!\begin{pmatrix} M_1 & M_2 \\ 0 & 0 \end{pmatrix} = \begin{pmatrix} e^{M_1} & M_1^{-1}(e^{M_1} - I)\, M_2 \\ 0 & I \end{pmatrix}

for invertible M1M_1, by computing the series term-by-term.

Exercise 4.5 (theory) — solution in §4.9

Prove that the Möbius transformation z(1+z/2)/(1z/2)z \mapsto (1 + z/2)/(1 - z/2) sends the open left half-plane {z:Re(z)<0}\set{z : \operatorname{Re}(z) < 0} bijectively to the open unit disk {w:w<1}\set{w : \abs{w} < 1}, and sends the imaginary axis bijectively to the unit circle (minus the point w=1w = -1).

Exercise 4.6 (theory) — solution in §4.9

Derive the exponential-trapezoidal scheme of §4.5 from the variation-of-parameters formula by substituting the linear interpolation u(tk+s)=uk+(s/Δ)(uk+1uk)u(t_k + s) = u_k + (s/\stepsize)(u_{k+1} - u_k) into the forcing integral and evaluating term-by-term using the φ\varphi-function identities.

4.9 Full solutions to theory exercises

Solution to Exercise 4.4

The matrix M:=(M1M200)M := \begin{pmatrix} M_1 & M_2 \\ 0 & 0 \end{pmatrix} satisfies, for k1k \ge 1,

Mk=(M1kM1k1M200).M^k = \begin{pmatrix} M_1^k & M_1^{k-1} M_2 \\ 0 & 0 \end{pmatrix}.

This is verified by induction: M1M^1 matches by construction, and Mk+1=MMk=(M1M200)(M1kM1k1M200)=(M1k+1M1kM200)M^{k+1} = M \cdot M^k = \begin{pmatrix} M_1 & M_2 \\ 0 & 0 \end{pmatrix} \begin{pmatrix} M_1^k & M_1^{k-1} M_2 \\ 0 & 0 \end{pmatrix} = \begin{pmatrix} M_1^{k+1} & M_1^k M_2 \\ 0 & 0 \end{pmatrix} as required. Now sum the matrix-exponential series:

eM=I+k=1Mkk!=(I00I)+(k1M1kk!k1M1k1k!M200)=(eM1SM20I),e^M = I + \sum_{k=1}^{\infty} \frac{M^k}{k!} = \begin{pmatrix} I & 0 \\ 0 & I \end{pmatrix} + \begin{pmatrix} \sum_{k \ge 1} \tfrac{M_1^k}{k!} & \sum_{k \ge 1} \tfrac{M_1^{k-1}}{k!} M_2 \\ 0 & 0 \end{pmatrix} = \begin{pmatrix} e^{M_1} & S \, M_2 \\ 0 & I \end{pmatrix},

where S:=k1M1k1/k!=M11k1M1k/k!=M11(eM1I)S := \sum_{k \ge 1} M_1^{k-1}/k! = M_1^{-1} \sum_{k \ge 1} M_1^k / k! = M_1^{-1}(e^{M_1} - I), using invertibility of M1M_1 to extract the leading M11M_1^{-1}. Plugging back gives the claimed result. ∎

This is exactly the identity used to compute Aˉ\discA and Bˉ\discB jointly from one expm call in §4.3: take M1=AΔM_1 = \statemat \stepsize and M2=BΔM_2 = \inputmat \stepsize.

Solution to Exercise 4.5

Let T(z):=(1+z/2)/(1z/2)T(z) := (1 + z/2)/(1 - z/2). Imaginary axis to unit circle: for z=iωz = i\omega with ωR\omega \in \R,

T(iω)2=1+iω/221iω/22=1+ω2/41+ω2/4=1,\abs{T(i\omega)}^2 = \frac{\abs{1 + i\omega/2}^2}{\abs{1 - i\omega/2}^2} = \frac{1 + \omega^2/4}{1 + \omega^2/4} = 1,

so T(iω)T(i\omega) lies on the unit circle. The map ωT(iω)\omega \mapsto T(i\omega) traces the unit circle once as ω\omega ranges over R\R, missing only the limit point T()=1T(\infty) = -1.

Left half-plane to unit disk: write z=x+iyz = x + iy with x<0x < 0. Then

T(z)2=(1+x/2)2+(y/2)2(1x/2)2+(y/2)2.\abs{T(z)}^2 = \frac{(1 + x/2)^2 + (y/2)^2}{(1 - x/2)^2 + (y/2)^2}.

Since x<0x < 0, 1+x/2<1x/2\abs{1 + x/2} < \abs{1 - x/2} (both for x<2\abs{x} < 2 where the signs match the absolute values, and for x>2\abs{x} > 2 where they reverse). The yy-terms are identical in numerator and denominator. Therefore the numerator is strictly less than the denominator, so T(z)<1\abs{T(z)} < 1.

Bijectivity: Möbius transformations z(az+b)/(cz+d)z \mapsto (az + b)/(cz + d) with adbc0ad - bc \ne 0 are bijections of the Riemann sphere C{}\C \cup \set{\infty} to itself. Restriction to the half-plane gives a bijection to the disk. The inverse is w(2(w1))/(w+1)w \mapsto (2(w-1))/(w+1), which can be verified by direct computation. ∎

This Möbius geometry is the algebraic content of A-stability for bilinear: a scheme is A-stable iff its eigenvalue map sends the open left half-plane into the closed unit disk. Bilinear achieves this bijectively; ZOH does so as well but non-bijectively (it folds an infinite strip of width 2π/Δ2\pi/\stepsize onto the unit disk, aliasing high-frequency continuous modes onto low-frequency discrete ones).

Solution to Exercise 4.6

Start from variation of parameters on [tk,tk+1][t_k, t_{k+1}]:

hk+1=eAΔhk+0ΔeA(Δs)Bu(tk+s)ds.\statevec_{k+1} = e^{\statemat \stepsize} \statevec_k + \int_0^{\stepsize} e^{\statemat(\stepsize - s)} \inputmat \, u(t_k + s) \, ds.

Substitute the linear interpolant u(tk+s)=uk+(s/Δ)(uk+1uk)u(t_k + s) = u_k + (s/\stepsize)(u_{k+1} - u_k) for s[0,Δ]s \in [0, \stepsize]:

0ΔeA(Δs)B[uk+sΔ(uk+1uk)]ds=I0uk+I1(uk+1uk),\int_0^{\stepsize} e^{\statemat(\stepsize - s)} \inputmat \left[u_k + \frac{s}{\stepsize}(u_{k+1} - u_k)\right] ds = I_0 \, u_k + I_1 \, (u_{k+1} - u_k),

with I0:=0ΔeA(Δs)BdsI_0 := \int_0^{\stepsize} e^{\statemat (\stepsize - s)} \inputmat \, ds and I1:=0ΔsΔeA(Δs)BdsI_1 := \int_0^{\stepsize} \tfrac{s}{\stepsize} e^{\statemat (\stepsize - s)} \inputmat \, ds. Change variables σ=Δs\sigma = \stepsize - s in I0I_0:

I0=0ΔeAσdσB=A1(eAΔI)B=Δφ1(AΔ)B,I_0 = \int_0^{\stepsize} e^{\statemat \sigma} \, d\sigma \cdot \inputmat = \statemat^{-1}(e^{\statemat \stepsize} - I) \, \inputmat = \stepsize \, \varphi_1(\statemat \stepsize) \, \inputmat,

where in the last step we substituted the definition φ1(z)=(ez1)/z\varphi_1(z) = (e^z - 1)/z applied with argument AΔ\statemat \stepsize. For I1I_1, change variables again σ=Δs\sigma = \stepsize - s, so s=Δσs = \stepsize - \sigma and s/Δ=1σ/Δs/\stepsize = 1 - \sigma/\stepsize:

I1=0Δ(1σΔ)eAσdσB=1Δ[ΔI00ΔσeAσdσB].I_1 = \int_0^{\stepsize} \left(1 - \frac{\sigma}{\stepsize}\right) e^{\statemat \sigma} d\sigma \cdot \inputmat = \frac{1}{\stepsize}\left[\stepsize \, I_0 - \int_0^{\stepsize} \sigma \, e^{\statemat \sigma} d\sigma \cdot \inputmat\right].

The remaining σ\sigma-weighted integral evaluates via integration by parts to A2(eAΔ(AΔI)+I)B=Δ2φ2(AΔ)B\statemat^{-2}(e^{\statemat \stepsize}(\statemat \stepsize - I) + I) \inputmat = \stepsize^2 \varphi_2(\statemat \stepsize) \inputmat, using the φ2(z)=(ez1z)/z2\varphi_2(z) = (e^z - 1 - z)/z^2 identity. Carefully tracking the algebra gives I1=Δφ2(AΔ)BI_1 = \stepsize \, \varphi_2(\statemat \stepsize) \, \inputmat as claimed. Combining,

hk+1=eAΔhk+Δφ1(AΔ)Buk+Δφ2(AΔ)B(uk+1uk).\statevec_{k+1} = e^{\statemat \stepsize} \statevec_k + \stepsize \, \varphi_1(\statemat \stepsize) \, \inputmat \, u_k + \stepsize \, \varphi_2(\statemat \stepsize) \, \inputmat \, (u_{k+1} - u_k). \qquad \square

The local truncation error is O(Δ3)O(\stepsize^3) for C1C^1 inputs (the linear interpolation has error O(Δ2)O(\stepsize^2) over an interval of length Δ\stepsize, and the integral picks up an additional Δ\stepsize factor). Global error is therefore O(Δ2)O(\stepsize^2) — second-order, as advertised.

4.10 Companion code

Three runnable companions for Chapter 4:

JAX (companions/ch04/jax/):

  • discretization_comparison.py — implements ZOH, bilinear, and forward Euler in JAX; runs the error-vs-step-size sweep on a forced damped oscillator; emits eigenvalue_migration.png and order_convergence.png.
  • exp_trapezoidal.py — implements the exp-trapezoidal scheme via the augmented matrix exponential; verifies second-order convergence against the same forced oscillator; emits exp_trap_convergence.png.

Julia (companions/ch04/julia/):

  • discretization_atlas.jl — ports the post_transformers Week-9 reference implementation to ssm-foundations; uses DifferentialEquations.jl Tsit5 as the ground-truth reference solver and reports the empirical slope per scheme.
  • Project.toml / Manifest.toml — companion-local Julia environment pinning DifferentialEquations and LinearAlgebra.

To run from the repo root:

# JAX (uses post_transformers .venv with scipy, numpy, matplotlib, jax pre-installed)
PYTHONPATH=. python companions/ch04/jax/discretization_comparison.py
PYTHONPATH=. python companions/ch04/jax/exp_trapezoidal.py

# Julia (first run will precompile DifferentialEquations.jl; ~2–5 minutes)
julia --project=companions/ch04/julia companions/ch04/julia/discretization_atlas.jl

All figures emit to public/figures/ch04/.

Part I · Foundations Week 5 Published

Stability regions, Butcher tableau, and order conditions

Runge–Kutta theory, the Butcher tableau formalism, order conditions, stability regions in the complex plane, A-stability and L-stability, and the Dahlquist barrier — applied to the Chapter 4 discretization atlas.

Stability regions, Butcher tableau, and order conditions

5.1 Runge–Kutta methods and the Butcher tableau

Chapter 4 introduced ZOH, bilinear, and exp-trapezoidal as three specific one-step schemes for the linear state-space system. They are members of a much larger family: one-step methods that compute hk+1\statevec_{k+1} from hk\statevec_k using only one previous state and possibly intermediate stage evaluations of the right-hand side. The Runge–Kutta (RK) family is the most studied subfamily.

For the autonomous initial-value problem h˙=f(h)\dot \statevec = f(\statevec) — we drop the explicit input dependence for this chapter; everything generalizes — an ss-stage Runge–Kutta method computes

ki=f ⁣(hk+Δj=1saijkj),i=1,,s,hk+1=hk+Δi=1sbiki.\begin{aligned} k_i &= f\!\left(\statevec_k + \stepsize \sum_{j=1}^{s} a_{ij} k_j\right), \qquad i = 1, \ldots, s, \\ \statevec_{k+1} &= \statevec_k + \stepsize \sum_{i=1}^{s} b_i k_i. \end{aligned}

The kik_i are the stage derivatives — evaluations of ff at intermediate states. The constants (aij,bi,ci)(a_{ij}, b_i, c_i) define the method, where ci:=jaijc_i := \sum_j a_{ij} are the stage times. The Butcher tableau is the standard tabular notation

cAbwith A=(aij)Rs×s, bRs, cRs.\begin{array}{c|c} c & A \\ \hline & b^\top \end{array} \qquad \text{with } A = (a_{ij}) \in \R^{s \times s}, \ b \in \R^s, \ c \in \R^s.

The method is explicit if AA is strictly lower triangular (aij=0a_{ij} = 0 for jij \ge i), in which case each kik_i depends only on previously computed k1,,ki1k_1, \ldots, k_{i-1}. It is implicit otherwise, requiring a nonlinear system solve at each step.

Three classical examples make the formalism concrete.

Forward Euler. The simplest explicit method, s=1s = 1:

001hk+1=hk+Δf(hk).\begin{array}{c|c} 0 & 0 \\ \hline & 1 \end{array} \qquad \Longrightarrow \qquad \statevec_{k+1} = \statevec_k + \stepsize f(\statevec_k).

The midpoint rule (RK2). A two-stage explicit method:

0001/21/2001k1=f(hk),k2=f(hk+Δ2k1),hk+1=hk+Δk2.\begin{array}{c|cc} 0 & 0 & 0 \\ 1/2 & 1/2 & 0 \\ \hline & 0 & 1 \end{array} \qquad \Longrightarrow \qquad \begin{aligned} k_1 &= f(\statevec_k), \\ k_2 &= f(\statevec_k + \tfrac{\stepsize}{2} k_1), \\ \statevec_{k+1} &= \statevec_k + \stepsize k_2. \end{aligned}

Classical RK4. The most-quoted explicit method, four stages:

000001/21/20001/201/200100101/61/31/31/6\begin{array}{c|cccc} 0 & 0 & 0 & 0 & 0 \\ 1/2 & 1/2 & 0 & 0 & 0 \\ 1/2 & 0 & 1/2 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ \hline & 1/6 & 1/3 & 1/3 & 1/6 \end{array}

Fourth-order globally, the workhorse of physics simulations for half a century.

The tableau abstraction was introduced by John Butcher in the 1960s and is now the lingua franca of one-step methods: every reference paper on numerical integration presents its method as a Butcher tableau. The constraint jaij=ci\sum_j a_{ij} = c_i (consistency of stage times) and ibi=1\sum_i b_i = 1 (consistency of the weighted sum) are imposed by every tableau worth its name. Without them the method does not even reproduce the trivial constant trajectory h(t)=h0\statevec(t) = \statevec_0 when f0f \equiv 0.

5.2 Order conditions

How do we know that classical RK4 is fourth-order globally? The convergence order of an RK method is determined by order conditions: polynomial identities on the entries of (A,b,c)(A, b, c) that must hold for the local truncation error to be O(Δp+1)O(\stepsize^{p+1}).

The derivation expands h(tk+Δ)\statevec(t_k + \stepsize) as a Taylor series and matches term-by-term against the RK output hk+1\statevec_{k+1} produced from h(tk)\statevec(t_k). For a scalar autonomous ODE the leading conditions are:

  • Order 1: ibi=1\sum_i b_i = 1
  • Order 2: also ibici=1/2\sum_i b_i c_i = 1/2
  • Order 3: also ibici2=1/3\sum_i b_i c_i^2 = 1/3 and i,jbiaijcj=1/6\sum_{i,j} b_i a_{ij} c_j = 1/6
  • Order 4: four more conditions involving products of bi,ci,aijb_i, c_i, a_{ij}

For systems (the case we actually care about), the number of order conditions grows combinatorially: 5 conditions for order 4, 17 for order 5, 37 for order 6 Hairer et al. (1993) . Butcher organized these conditions using a bijection between order-conditions and rooted trees (each tree corresponds to an integer partition of ff-evaluations and gives one polynomial identity). The combinatorial explosion is one reason 8th-order RK methods are rarely used: there are over 200 order conditions to satisfy simultaneously, and the resulting tableaux have dozens of stages.

.

(Butcher order conditions.) Suppose ff is sufficiently smooth. An ss-stage Runge–Kutta method with tableau (A,b,c)(A, b, c) is of order pp — meaning hk+1h(tk+1)=O(Δp+1)\statevec_{k+1} - \statevec(t_{k+1}) = O(\stepsize^{p+1}) — if and only if a specific finite set of polynomial conditions on (A,b,c)(A, b, c) holds (one per Butcher tree of order p\le p). The Lax equivalence theorem (Chapter 4) then guarantees global convergence at order pp.

The proof is non-trivial and we will not reproduce it; the canonical reference is Hairer–Nørsett–Wanner Chapter II.2 Hairer et al. (1993) . For our purposes the takeaway is that the order of an RK method is a property of its tableau entries, derivable by algebraic manipulation alone. You can verify (Exercise 5.1) that RK4’s tableau satisfies all 4 order-4 conditions; you can also verify (Exercise 5.2) that an arbitrary two-stage explicit tableau cannot achieve order higher than 2, regardless of how you choose the free parameters.

Butcher tableau matrix visualizations for forward Euler, midpoint RK2, classical RK4, and Runge-Kutta-Fehlberg 4(5).
Butcher tableau matrices for four canonical methods, visualized as heatmaps of the (A | b, c) augmented matrix. Forward Euler (1×1, trivial). Midpoint RK2 (2×2, single off-diagonal). Classical RK4 (4×4, characteristic strictly-lower-triangular A with the $b = (1/6, 1/3, 1/3, 1/6)$ weights). RKF45 (6×6, dense lower-triangular). The sparsity pattern of A is what distinguishes explicit from implicit methods. Produced by companions/ch05/jax/stability_regions.py.
Log-log error vs step size for forward Euler (slope 1), midpoint RK2 (slope 2), and classical RK4 (slope 4) on a forced damped oscillator.
Empirical verification of Theorem 5.2 (Butcher order conditions). Each method's local truncation error decreases as $\\Delta^{p+1}$, giving global error $\\Delta^p$ — a $p$-th order method has slope $p$ on the log-log plot. RK1, RK2, and RK4 all match their theoretical orders to within ~5% on this forced linear oscillator. Produced by companions/ch05/jax/order_verification.py.

5.3 The stability function and stability regions

Order conditions characterize how a method behaves as Δ0\stepsize \to 0 on a smooth problem. Stability characterizes how it behaves at fixed Δ\stepsize on a stiff problem — when the time scales in the dynamics are well-separated and at least one is much faster than Δ\stepsize.

The standard probe is the Dahlquist test equation

y˙(t)=λy(t),y(0)=1,λC.\dot y(t) = \lambda y(t), \qquad y(0) = 1, \qquad \lambda \in \C.

The exact solution is y(t)=eλty(t) = e^{\lambda t}. Applying an ss-stage RK method to this scalar problem produces, after one step,

yk+1=\stabfn(λΔ)yk,y_{k+1} = \stabfn(\lambda \stepsize) \, y_k,

where \stabfn:CC\stabfn : \C \to \C is the stability function of the method. For an explicit RK method, \stabfn(z)\stabfn(z) is a polynomial in zz of degree s\le s:

\stabfnexplicit(z)=1+zb(IzA)11,\stabfn_{\text{explicit}}(z) = 1 + z b^\top (I - z A)^{-1} \mathbf{1},

where 1\mathbf{1} is the all-ones ss-vector and the inverse expands as a finite geometric series because AA is nilpotent (strictly lower-triangular). For an implicit RK method, \stabfn(z)\stabfn(z) is a rational function P(z)/Q(z)P(z)/Q(z).

Three examples make this concrete.

Forward Euler. With A=0A = 0, b=1b = 1: \stabfn(z)=1+z\stabfn(z) = 1 + z.

Midpoint RK2. Working through the formula gives \stabfn(z)=1+z+z2/2\stabfn(z) = 1 + z + z^2/2 — the second-order Taylor polynomial of eze^z.

Classical RK4. \stabfn(z)=1+z+z2/2+z3/6+z4/24\stabfn(z) = 1 + z + z^2/2 + z^3/6 + z^4/24 — the fourth-order Taylor polynomial.

A pattern emerges: for a pp-th order explicit RK method, \stabfn(z)\stabfn(z) agrees with eze^z through terms of degree pp. This is no coincidence — order pp requires the method to reproduce eze^z correctly on the test problem through that order in Δ\stepsize. The error \stabfn(z)ez\stabfn(z) - e^z is O(zp+1)O(z^{p+1}) near z=0z = 0.

The stability region of the method is the set

\stabregion = \set{z \in \C : \abs{\stabfn(z)} \le 1}.

On the test problem with λ\lambda in the left half-plane (so the exact solution decays), the method’s iterates yk=\stabfn(λΔ)ky_k = \stabfn(\lambda \stepsize)^k stay bounded iff λΔ\stabregion\lambda \stepsize \in \stabregion. The stability region is the constraint on λΔ\lambda \stepsize that keeps the method well-behaved on stiff problems — equivalently, the set of Δ\stepsize values you can take for a given λ\lambda.

Stability regions of forward Euler (disk), midpoint RK2 (lobe), and classical RK4 (kidney) in the complex plane.
Stability regions of three explicit RK methods. Forward Euler: disk of radius 1 centered at $z = -1$. Midpoint RK2: oval extending further into the left half-plane. Classical RK4: kidney-shaped region that includes a portion of the imaginary axis (relevant for oscillatory problems). None of these regions contains the entire left half-plane — explicit RK methods are never A-stable. Produced by companions/ch05/jax/stability_regions.py.

The stability region of forward Euler is the disk 1+z1\abs{1 + z} \le 1 — a disk of radius 1 centered at z=1z = -1. For a real-eigenvalue λ<0\lambda < 0, this requires Δ2/λ\stepsize \le 2/\abs{\lambda}. The tighter this constraint becomes (large λ\abs{\lambda} = stiff problem), the smaller the maximum step size — this is the step-size restriction that motivates implicit methods.

5.4 A-stability, L-stability, and the Dahlquist barrier

A method is A-stable if its stability region contains the entire closed left half-plane: \stabregion{zC:Re(z)0}\stabregion \supseteq \set{z \in \C : \operatorname{Re}(z) \le 0}. On a stable linear problem with Re(λ)0\operatorname{Re}(\lambda) \le 0, an A-stable method preserves stability for every positive step size — no step-size restriction at all. This is the gold standard for stiff problems.

A method is L-stable if it is A-stable and \stabfn(z)0\stabfn(z) \to 0 as Re(z)\operatorname{Re}(z) \to -\infty. L-stability damps fast eigenmodes aggressively, which is desirable when you want the discrete method to not preserve fast oscillations the continuous solution decays. Backward Euler is L-stable; the trapezoidal rule (= bilinear from Chapter 4) is A-stable but not L-stable (its stability function \stabfn(z)=(1+z/2)/(1z/2)\stabfn(z) = (1 + z/2)/(1 - z/2) tends to 1-1, not 0, as Re(z)\operatorname{Re}(z) \to -\infty). The distinction matters when the system has eigenvalues with very large negative real part: bilinear will preserve high-frequency oscillation modes that should physically be damped.

The fundamental theorem on A-stability of explicit methods is the Dahlquist barrier:

.

(Dahlquist barrier for explicit RK.) No explicit Runge–Kutta method is A-stable. The maximum order of an A-stable Runge–Kutta method (explicit or implicit) is unbounded, but explicit methods cannot achieve A-stability at any order.

The proof for the explicit case is direct: an explicit RK method has \stabfn(z)=\stabfn(z) = polynomial in zz, and any polynomial diverges as z\abs{z} \to \infty. So \stabregion\stabregion is bounded, and cannot contain the entire (unbounded) left half-plane. ∎

The implicit case is more subtle: there exists an A-stable Runge–Kutta method of every order, but the constructions are increasingly elaborate. The Gauss–Legendre family — Chapter 6’s main attraction — produces A-stable implicit methods of every even order. The ss-stage Gauss–Legendre method is order 2s2s, optimally accurate among ss-stage implicit RK methods, and A-stable, and (a bonus for Hamiltonian systems) symplectic.

The practical consequence for SSM discretization: you cannot have an explicit, A-stable, high-order one-step method. You must pick at most two of the three properties (explicit, A-stable, order ≥ 3). The Chapter 4 schemes resolve this trilemma differently:

  • Forward Euler: explicit, order 1, not A-stable.
  • ZOH and exp-trapezoidal: A-stable and of arbitrary order on autonomous problems, by virtue of using the matrix exponential — but they sit outside the explicit-RK family (they have non-polynomial stability functions).
  • Bilinear: A-stable, order 2, implicit in form (requires inverting IΔA/2I - \stepsize \statemat / 2).

The exponential integrators of §4.5 are a way to evade the Dahlquist barrier: by computing eAΔe^{\statemat \stepsize} exactly, the stability function inherits the full LHP-stability of the exponential, and the polynomial-order restriction is moved from \stabfn\stabfn to a different object (the approximation of the forcing by φ\varphi-functions). This is why exp-trapezoidal can be both “explicit in the input” (no nonlinear solve in uu) and A-stable.

5.5 Stability regions of the Chapter 4 schemes

Reading the three Chapter 4 schemes through the lens of stability functions:

ZOH. From §4.3, Aˉ=eAΔ\discA = e^{\statemat \stepsize}, so for the scalar test problem \stabfnZOH(z)=ez\stabfn_{\text{ZOH}}(z) = e^z. The stability region is {z:ez1}={z:Re(z)0}\set{z : \abs{e^z} \le 1} = \set{z : \operatorname{Re}(z) \le 0} — exactly the closed left half-plane. ZOH is A-stable, in fact also L-stable because ez0\abs{e^z} \to 0 as Re(z)\operatorname{Re}(z) \to -\infty.

Bilinear (Tustin). From §4.4, \stabfnbilinear(z)=(1+z/2)/(1z/2)\stabfn_{\text{bilinear}}(z) = (1 + z/2)/(1 - z/2). The Möbius geometry from Chapter 4 §4.4 / Exercise 4.5 says the LHP maps bijectively to the open unit disk, so the stability region is exactly the closed left half-plane — bilinear is A-stable. But \stabfnbilinear(z)1\stabfn_{\text{bilinear}}(z) \to -1 as Re(z)\operatorname{Re}(z) \to -\infty, so bilinear is not L-stable. This is the trapezoidal-rule failure mode: oscillatory artifacts on very stiff problems.

Exp-trapezoidal. The homogeneous stability function is identical to ZOH’s: \stabfn(z)=ez\stabfn(z) = e^z on the test problem (the difference between ZOH and exp-trap shows up only in the forcing integrals; the homogeneous behavior is the same). So exp-trapezoidal is A-stable and L-stable, just like ZOH.

Stability regions of forward Euler, ZOH, bilinear, and exp-trapezoidal overlaid in the complex plane.
Stability regions of the schemes from Chapter 4 §4.6, with classical RK1 and RK4 included for reference. Forward Euler (RK1, navy) and classical RK4 (gold): bounded regions in the LHP. ZOH / exp-trapezoidal (rose) and bilinear (gray): the entire closed left half-plane — A-stable. Among the explicit-RK family, no method achieves A-stability (Dahlquist barrier); the schemes that do are the ones using $e^{A \\Delta}$ as the dynamics factor. Produced by companions/ch05/jax/stability_regions.py.

5.6 What’s next

This chapter has laid out the systematic framework for analyzing one-step methods: Butcher tableau, order conditions, stability functions, stability regions, A-stability and L-stability, and the Dahlquist barrier. Chapter 6 uses this framework to motivate the implicit family — backward Euler, BDF, and the Gauss–Legendre IRK schemes that achieve A-stability at every order — and the symplectic family, which trades order or A-stability for the preservation of geometric invariants. Chapter 6 is where the C1 niche pilot lives: symplectic integration of selective SSMs viewed as Hamiltonian flows on the hidden-state manifold.

A natural side path is embedded methods — pairs of RK tableaux of consecutive orders sharing stages, used for adaptive step-size control. The DP54 method (Dormand–Prince 5(4)) used by DifferentialEquations.jl’s default solver Tsit5 is a famous example. Embedded methods are excellent practical tools but tangential to the SSM pedagogy of this book; the companion discretization_atlas.jl references the RKF45 / DP54 tableaux for completeness.

5.7 Exercises

Six problems. Short/numerical (5.1–5.3) have inline collapsible solutions; long/proof exercises (5.4–5.6) have full worked solutions in §5.8.

Exercise 5.1 (computation)

Verify that classical RK4 satisfies the four order-conditions of order 4 — at least the simpler ones: bi=1\sum b_i = 1, bici=1/2\sum b_i c_i = 1/2, bici2=1/3\sum b_i c_i^2 = 1/3. Use the tableau in §5.1.

Solution

From the tableau, b=(1/6,1/3,1/3,1/6)b = (1/6, 1/3, 1/3, 1/6) and c=(0,1/2,1/2,1)c = (0, 1/2, 1/2, 1).

  • bi=1/6+1/3+1/3+1/6=2/6+4/6=1\sum b_i = 1/6 + 1/3 + 1/3 + 1/6 = 2/6 + 4/6 = 1. ✓
  • bici=(1/6)(0)+(1/3)(1/2)+(1/3)(1/2)+(1/6)(1)=0+1/6+1/6+1/6=1/2\sum b_i c_i = (1/6)(0) + (1/3)(1/2) + (1/3)(1/2) + (1/6)(1) = 0 + 1/6 + 1/6 + 1/6 = 1/2. ✓
  • bici2=(1/6)(0)+(1/3)(1/4)+(1/3)(1/4)+(1/6)(1)=0+1/12+1/12+1/6=2/12+2/12=1/3\sum b_i c_i^2 = (1/6)(0) + (1/3)(1/4) + (1/3)(1/4) + (1/6)(1) = 0 + 1/12 + 1/12 + 1/6 = 2/12 + 2/12 = 1/3. ✓

Exercise 5.2 (computation)

Show that an arbitrary two-stage explicit RK tableau (one free parameter c2(0,1]c_2 \in (0, 1], with a21=c2a_{21} = c_2, b2b_2 to be determined, b1=1b2b_1 = 1 - b_2) cannot achieve order 3 no matter how c2,b2c_2, b_2 are chosen.

Solution

For order 2 we need b1+b2=1b_1 + b_2 = 1 (trivially b1=1b2b_1 = 1 - b_2) and b2c2=1/2b_2 c_2 = 1/2, giving b2=1/(2c2)b_2 = 1/(2c_2), b1=11/(2c2)b_1 = 1 - 1/(2c_2). So the two-parameter family (c2,b2)(c_2, b_2) collapses to a one-parameter family of order-2 methods (midpoint c2=1/2c_2 = 1/2, Heun c2=1c_2 = 1, etc.).

For order 3 we additionally need b2c22=1/3b_2 c_2^2 = 1/3 and b2a21c1=1/6b_2 a_{21} c_1 = 1/6. The second condition uses c1=0c_1 = 0, giving b2c20=01/6b_2 \cdot c_2 \cdot 0 = 0 \ne 1/6. Contradiction. So no two-stage explicit method achieves order 3.

This is a small instance of the general fact: order pp requires at least pp stages for explicit methods (Butcher’s order barriers).

Exercise 5.3 (computation + code)

Run companions/ch05/jax/order_verification.py. Confirm the empirical slopes match Forward Euler (1), midpoint RK2 (2), classical RK4 (4). What happens to the RK4 slope at very small Δ\stepsize (say Δ=104\stepsize = 10^{-4})?

Solution

The companion’s order_verification function prints slopes 1.00\approx 1.00, 2.00\approx 2.00, 4.00\approx 4.00 for the three methods in turn. At very small Δ\stepsize, the RK4 error reaches the roundoff floor — typically 101310^{-13} in float64 — and the slope flattens out as machine precision starts to dominate the method error. This is why empirical order is always estimated from the two finest dt where the method error still exceeds roundoff, not from arbitrarily fine dt.

Exercise 5.4 (theory) — solution in §5.8

Prove that for an explicit Runge–Kutta method with tableau (A,b,c)(A, b, c), the stability function is

\stabfn(z)=1+zb(IzA)11,\stabfn(z) = 1 + z b^\top (I - z A)^{-1} \mathbf{1},

where 1=(1,1,,1)\mathbf{1} = (1, 1, \ldots, 1)^\top. Verify that for classical RK4 this evaluates to 1+z+z2/2+z3/6+z4/241 + z + z^2/2 + z^3/6 + z^4/24.

Exercise 5.5 (theory) — solution in §5.8

Prove the Dahlquist-barrier statement for explicit RK methods: that no explicit RK method is A-stable. (The proof was sketched in §5.4; flesh it out, using the fact that explicit methods have polynomial stability functions.)

Exercise 5.6 (theory) — solution in §5.8

For the bilinear/Tustin scheme, show that the stability function \stabfn(z)=(1+z/2)/(1z/2)\stabfn(z) = (1 + z/2)/(1 - z/2) satisfies \stabfn(z)1\stabfn(z) \to -1 as Re(z)\operatorname{Re}(z) \to -\infty along any direction with Im(z)\operatorname{Im}(z) bounded, and explain why this rules out L-stability. Discuss what this means for the discrete dynamics of a bilinear-discretized stable LTI system with very fast-decaying eigenmodes.

5.8 Full solutions to theory exercises

Solution to Exercise 5.4

Apply the ss-stage explicit RK method to the test ODE y˙=λy\dot y = \lambda y, y(0)=yk=1y(0) = y_k = 1. The stage equations become

ki=λ(1+Δj<iaijkj)=λ+λΔj<iaijkj.k_i = \lambda \left(1 + \stepsize \sum_{j<i} a_{ij} k_j\right) = \lambda + \lambda \stepsize \sum_{j<i} a_{ij} k_j.

Set z:=λΔz := \lambda \stepsize and let κ=(k1,k2,,ks)\kappa = (k_1, k_2, \ldots, k_s)^\top. The vector form is

κ=λ1+zAκ(IzA)κ=λ1.\kappa = \lambda \mathbf{1} + z A \kappa \qquad \Longrightarrow \qquad (I - z A) \kappa = \lambda \mathbf{1}.

For an explicit method AA is strictly lower-triangular, hence nilpotent (As=0A^s = 0), and the geometric series (IzA)1=I+zA+z2A2++zs1As1(I - z A)^{-1} = I + z A + z^2 A^2 + \cdots + z^{s-1} A^{s-1} truncates. So κ=λ(IzA)11\kappa = \lambda (I - z A)^{-1} \mathbf{1}, and

yk+1=1+Δbκ=1+zb(IzA)11=\stabfn(z).y_{k+1} = 1 + \stepsize b^\top \kappa = 1 + z \cdot b^\top (I - z A)^{-1} \mathbf{1} = \stabfn(z). \quad \square

For classical RK4, direct computation of b(IzA)11b^\top (I - z A)^{-1} \mathbf{1} — expanding the geometric series and multiplying out — gives, after some algebra,

\stabfnRK4(z)=1+z+z22+z36+z424.\stabfn_{\text{RK4}}(z) = 1 + z + \frac{z^2}{2} + \frac{z^3}{6} + \frac{z^4}{24}.

This is the fourth-order Taylor polynomial of eze^z, as expected for a fourth-order method.

Solution to Exercise 5.5

For an explicit RK method, AA is strictly lower-triangular, so As=0A^s = 0 and the geometric series for (IzA)1(I - z A)^{-1} truncates at zs1z^{s-1}. Therefore

\stabfn(z)=1+zb(IzA)11\stabfn(z) = 1 + z b^\top (I - z A)^{-1} \mathbf{1}

is a polynomial in zz of degree at most ss. Polynomials are unbounded: \stabfn(z)\abs{\stabfn(z)} \to \infty as z\abs{z} \to \infty along any ray (assuming the leading coefficient is nonzero, which holds for any ss-stage method of order s\ge s, the only interesting case).

The closed left half-plane {z:Re(z)0}\set{z : \operatorname{Re}(z) \le 0} is unbounded — it extends to -\infty along the real axis. The stability region \stabregion = \set{z : \abs{\stabfn(z)} \le 1} is therefore bounded. So \stabregion\stabregion cannot contain the entire LHP, and the method is not A-stable. ∎

The proof relies essentially on \stabfn\stabfn being a polynomial. Implicit methods have rational \stabfn(z)=P(z)/Q(z)\stabfn(z) = P(z)/Q(z), which can be bounded at infinity if deg(Q)deg(P)\deg(Q) \ge \deg(P). This is the precise sense in which implicit methods evade the Dahlquist barrier.

Solution to Exercise 5.6

Write z=r+iωz = -r + i\omega with r+r \to +\infty and ω\omega bounded. Then

\stabfn(z)=1+z/21z/2=1+(r/2+iω/2)1(r/2+iω/2)=1r/2+iω/21+r/2iω/2.\stabfn(z) = \frac{1 + z/2}{1 - z/2} = \frac{1 + (-r/2 + i\omega/2)}{1 - (-r/2 + i\omega/2)} = \frac{1 - r/2 + i \omega/2}{1 + r/2 - i \omega/2}.

For large rr the dominant terms are \stabfn(z)(r/2)/(r/2)=1\stabfn(z) \approx (-r/2)/(r/2) = -1. More precisely,

\stabfn(z)=r/2+(1+iω/2)r/2+(1iω/2)=1+O(1/r)1as r.\stabfn(z) = \frac{-r/2 + (1 + i\omega/2)}{r/2 + (1 - i\omega/2)} = -1 + O(1/r) \to -1 \quad \text{as } r \to \infty.

This rules out L-stability: L-stability requires \stabfn0\stabfn \to 0, not 1\to -1. ∎

Practical implication. For a stable LTI system h˙=Ah\dot \statevec = \statemat \statevec with A\statemat having an eigenvalue λ\lambda with very large Re(λ)\abs{\operatorname{Re}(\lambda)} — a fast-decaying mode — bilinear discretization at step size Δ\stepsize produces a discrete eigenvalue μ=(1+λΔ/2)/(1λΔ/2)1\mu = (1 + \lambda \stepsize / 2)/(1 - \lambda \stepsize / 2) \approx -1. The corresponding mode in the discrete dynamics therefore alternates sign at every step before decaying, rather than smoothly decaying. This trapezoidal-rule oscillation is the classic L-stability failure mode: the discrete solution exhibits spurious oscillations not present in the continuous solution. For SSMs this manifests as oscillatory artifacts in the hidden state when the system has eigenvalues with very negative real parts — exactly the regime where stiff implicit methods like BDF (Chapter 6) are preferred over bilinear.

5.9 Companion code

Two JAX companions and one Julia companion for Chapter 5.

JAX (companions/ch05/jax/):

  • stability_regions.py — computes and plots stability regions for forward Euler, midpoint RK2, classical RK4, ZOH, bilinear, and exp-trapezoidal in the complex z=λΔz = \lambda \stepsize plane; emits rk_stability_regions.png, atlas_stability.png, and butcher_tableaux.png.
  • order_verification.py — implements RK1/RK2/RK4 from their tableaux; verifies empirical orders 1, 2, 4 on the forced damped oscillator from Chapter 4; emits order_verification.png.

Julia (companions/ch05/julia/):

  • butcher_tableau_zoo.jl — uses DifferentialEquations.jl solver tables (OrdinaryDiffEqTsit5, etc.) to extract Butcher coefficients programmatically and print them side-by-side with the analytic forms used in §5.1.
  • Project.toml / Manifest.toml — companion-local Julia environment.

To run from the repo root:

# JAX
PYTHONPATH=. python companions/ch05/jax/stability_regions.py
PYTHONPATH=. python companions/ch05/jax/order_verification.py

# Julia (first run will precompile DifferentialEquations.jl)
julia --project=companions/ch05/julia companions/ch05/julia/butcher_tableau_zoo.jl

All figures emit to public/figures/ch05/.

Part I · Foundations Week 6 Published

Implicit methods, stiff systems, symplectic integration

When explicit methods fail — backward Euler, BDF, DIRK as A-stable alternatives; exponential integrators as the family of choice for selective SSMs; Hamiltonian mechanics and symplectic integration as the natural geometric language for hidden-state dynamics.

Implicit methods, stiff systems, symplectic integration

6.1 Why explicit methods fail: stiffness

Chapter 5 ended with the Dahlquist barrier: no explicit Runge–Kutta method is A-stable. The barrier matters because of a phenomenon called stiffness — when the dynamics matrix A\statemat has eigenvalues with very different magnitudes.

A linear ODE h˙=Ah\dot \statevec = \statemat \statevec is stiff if A\statemat has at least one eigenvalue with Re(λi)1/T\operatorname{Re}(\lambda_i) \le -1/T for a time-scale TT much smaller than the simulation horizon, and the eigenvalue is paired with a long-lived (small-λ\abs{\lambda}) mode that one actually wants to track. Stiffness is therefore relative: a system is stiff with respect to a given horizon and given accuracy goal, not in absolute terms. The Robertson chemical-kinetics problem (λ\lambda‘s spanning ten orders of magnitude) is the canonical textbook example; van der Pol at parameter μ1\mu \gg 1 is a popular ML-adjacent stand-in because it generates a nonlinear oscillation whose slow and fast phases differ by μ\mu.

For an explicit method, the step size is bounded by the fastest eigenvalue: Δ2/λmax\stepsize \lesssim 2/\abs{\lambda_{\max}} (forward Euler) or some other small constant times 1/λmax1/\abs{\lambda_{\max}} (RK4). If the slow eigenvalue we want to track is λmin\abs{\lambda_{\min}}, the integrator takes λmax/λmin\abs{\lambda_{\max}}/\abs{\lambda_{\min}} times more steps than the slow time scale would suggest. For Robertson this ratio is 101010^{10}. The simulation never finishes.

For an SSM, stiffness arises naturally for input-dependent dynamics. Mamba’s selective scan generates an effective A(ut)\statemat(u_t) at each token; for inputs that produce eigenvalues with very different magnitudes, the time-discretized recurrence is solving a stiff problem at each step. If the discretization is not A-stable, the recurrence either blows up or requires a step size Δ1\stepsize \ll 1 — at which point you are running the model in an absurd regime. This is the structural reason Mamba-3 switched to exponential-trapezoidal (§4.5), which is A-stable in A\statemat but explicit in uu.

6.2 Implicit methods: backward Euler, BDF, DIRK

The fix for stiffness is to abandon explicit methods. Backward Euler is the simplest implicit method:

hk+1=hk+Δf(hk+1).\statevec_{k+1} = \statevec_k + \stepsize f(\statevec_{k+1}).

Solving for hk+1\statevec_{k+1} requires either a linear solve (for linear ff) or a Newton iteration (for nonlinear ff). The pay-off is the stability function

\stabfnBE(z)=11z,\stabfn_{\text{BE}}(z) = \frac{1}{1 - z},

which has \stabfnBE(z)1\abs{\stabfn_{\text{BE}}(z)} \le 1 for every zz with Re(z)0\operatorname{Re}(z) \le 0 — backward Euler is A-stable. It is also L-stable: \stabfnBE(z)0\stabfn_{\text{BE}}(z) \to 0 as Re(z)\operatorname{Re}(z) \to -\infty, so fast modes get damped aggressively as desired on stiff problems.

.

(Backward Euler is L-stable.) Backward Euler’s stability function \stabfn(z)=1/(1z)\stabfn(z) = 1/(1-z) satisfies \stabfn(z)1\abs{\stabfn(z)} \le 1 for every zCz \in \C with Re(z)0\operatorname{Re}(z) \le 0, and \stabfn(z)0\stabfn(z) \to 0 as Re(z)\operatorname{Re}(z) \to -\infty.

The proof is one line each: 1z2=(1Re(z))2+Im(z)21\abs{1 - z}^2 = (1 - \operatorname{Re}(z))^2 + \operatorname{Im}(z)^2 \ge 1 when Re(z)0\operatorname{Re}(z) \le 0, giving \stabfn(z)1\abs{\stabfn(z)} \le 1. The limit is immediate.

Backward Euler is first-order accurate. Higher-order implicit methods come in several families.

BDFkk (Backward Differentiation Formula, kkth-order) is a multi-step method that approximates h˙(tk+1)\dot \statevec(t_{k+1}) by a kk-point backward difference and equates it to f(hk+1)f(\statevec_{k+1}). BDF1 = backward Euler; BDF2 is second-order and L-stable; BDFkk for k=3,,6k = 3, \ldots, 6 are conditionally stable but widely used in stiff ODE codes. BDF7+ is unstable. The order limit is the second Dahlquist barrier, a structural fact about multi-step methods that we will state without proof: no AA-stable linear multi-step method has order higher than 2.

Diagonally implicit RK (DIRK) sits between explicit RK and fully implicit RK. The AA matrix is lower triangular with non-zero diagonal, so each stage requires solving an independent nonlinear system (cheaper than the coupled system of fully implicit RK, but more expensive than the trivial explicit evaluations). Singly-DIRK (SDIRK) further constrains all diagonal entries to be equal, so the Newton Jacobian factorizes once per step. The Crank–Nicolson scheme is a 2-stage SDIRK of order 2.

Fully implicit RK — the most expensive family — includes Gauss–Legendre methods, which we will return to in §6.5 as the optimal symplectic methods.

The computational trade-off for implicit methods: each step costs O(N3)O(N^3) (linear solve for the Jacobian factor) or O(iterationsN2)O(\text{iterations} \cdot N^2) (Newton iteration for nonlinear ff). For SSMs with N=16,64N = 16, 64 this is acceptable; for NN in the thousands it begins to dominate. The Chapter 4 exponential-trapezoidal scheme has the same A-stability as backward Euler but only costs one matrix-exponential per timestep (vs one Newton solve), and the matrix exponential of a low-rank or diagonalized A\statemat is cheap. Mamba-3’s specific tactical advantage over a backward-Euler implementation comes down to this cost asymmetry.

6.3 Exponential integrators revisited

The exp-trapezoidal scheme of §4.5 is the simplest member of the exponential integrator family. Recall the general idea: given h˙=Ah+Bu(t)\dot \statevec = \statemat \statevec + \inputmat u(t), treat the linear part exactly via eAΔe^{\statemat \stepsize} and approximate only the forcing integral. The φ\varphi-function family φk(z)=(ezj=0k1zj/j!)/zk\varphi_k(z) = (e^z - \sum_{j=0}^{k-1} z^j/j!)/z^k provides building blocks for arbitrarily high order.

Three commonly used schemes:

  • exp-Euler: hk+1=eAΔhk+Δφ1(AΔ)Buk\statevec_{k+1} = e^{\statemat \stepsize} \statevec_k + \stepsize \varphi_1(\statemat \stepsize) \inputmat u_k. Treats the input as piecewise constant (like ZOH); first-order for forced systems, exact on autonomous problems.
  • exp-midpoint: uses the input midpoint instead of uku_k; second-order for symmetric input perturbations.
  • exp-trapezoidal (§4.5): linear interpolation of the input; second-order for C1C^1 inputs.
  • ETDRK4 Hochbruck & Ostermann (2010) : a four-stage exponential RK method that achieves fourth order. The Mamba-3 paper does not (yet) use ETDRK4; its second-order exp-trapezoidal is the empirical sweet spot.

All exponential integrators inherit A-stability from eAΔe^{\statemat \stepsize}. They sidestep the Dahlquist barrier because their stability function is not a polynomial — it is eze^z times a rational-in-zz correction factor — and the polynomial constraint of explicit RK methods does not apply.

The implementation cost: one matrix exponential plus one or two φ\varphi-function evaluations per step. For a low-rank A\statemat (which selective SSMs typically have, since the structured-A\statemat pattern of Chapter 7 carries forward), the matrix exponential is cheap. For dense general A\statemat exponentials cost O(N3)O(N^3) via Padé / scaling-and-squaring, the same as a Newton solve — so the choice between exp-trap and backward Euler often comes down to whether you want autonomous-exactness (exp-trap wins) or L-stability damping of high-frequency modes (backward Euler wins).

6.4 Hamiltonian mechanics, energy conservation, and the symplectic structure

The remainder of this chapter introduces the language of Hamiltonian systems and symplectic integration. This is where the direct-transfer hook from continuum mechanics applies most cleanly. The exposition is brief and standalone; for the full theory see Hairer–Lubich–Wanner Hairer & Wanner (1996) Volume II Chapter VI.

A Hamiltonian system on phase space (q,p)Rn×Rn(q, p) \in \R^n \times \R^n is governed by a scalar function \hamilton(q,p)\hamilton(q, p) — the Hamiltonian, intended to represent total energy — through Hamilton’s equations:

q˙=\hamiltonp,p˙=\hamiltonq.\dot q = \frac{\partial \hamilton}{\partial p}, \qquad \dot p = -\frac{\partial \hamilton}{\partial q}.

The canonical example is the harmonic oscillator \hamilton(q,p)=12(p2+ω2q2)\hamilton(q, p) = \tfrac{1}{2}(p^2 + \omega^2 q^2), giving q˙=p\dot q = p, p˙=ω2q\dot p = -\omega^2 q. The pendulum is \hamilton(q,p)=12p2cos(q)\hamilton(q, p) = \tfrac{1}{2} p^2 - \cos(q); vortex-dynamics flows are higher-dimensional but structurally identical.

Three properties make Hamiltonian systems geometrically special.

Energy conservation. Along any trajectory (q(t),p(t))(q(t), p(t)), ddt\hamilton(q(t),p(t))=\hamiltonqq˙+\hamiltonpp˙=\hamiltonq\hamiltonp\hamiltonp\hamiltonq=0\frac{d}{dt} \hamilton(q(t), p(t)) = \frac{\partial \hamilton}{\partial q} \dot q + \frac{\partial \hamilton}{\partial p} \dot p = \frac{\partial \hamilton}{\partial q} \frac{\partial \hamilton}{\partial p} - \frac{\partial \hamilton}{\partial p} \frac{\partial \hamilton}{\partial q} = 0. The Hamiltonian is a constant of motion.

Phase-space volume preservation (Liouville’s theorem). The flow Φt:(q0,p0)(q(t),p(t))\Phi_t : (q_0, p_0) \mapsto (q(t), p(t)) has Jacobian determinant 1 everywhere: detΦt1\det \nabla \Phi_t \equiv 1. Volumes in phase space are preserved exactly under the dynamics.

Symplectic structure. Define the symplectic 2-form \symform=idqidpi\symform = \sum_i dq_i \wedge dp_i. The flow Φt\Phi_t is a symplectic transformation: it preserves \symform\symform in the sense that Φt\symform=\symform\Phi_t^* \symform = \symform where Φt\Phi_t^* denotes the pullback. Symplecticity is a strict refinement of volume preservation — every symplectic transformation is volume-preserving, but not vice versa.

The reason these properties matter for numerics is that standard order-optimized integrators destroy them. A fourth-order RK method simulates the harmonic oscillator with O(Δ4)O(\stepsize^4) local error per step — but it does not exactly preserve energy. Over T/ΔT/\stepsize steps, the energy error accumulates as O(Δ3)O(\stepsize^3) per period times the number of periods — leading to linear-in-time energy drift that eventually destroys the trajectory’s qualitative behavior. RK4 on a pendulum will, over 1000 periods, gradually inflate the orbit and unbound the system. RK4 on a planetary orbit will (eventually) eject the planet. These are not numerical instabilities in the Chapter 5 sense — the method is A-stable for the small-amplitude regime — they are geometric errors: the wrong qualitative dynamics.

6.5 Symplectic integrators

A symplectic integrator is a numerical map ΨΔ:(q,p)(q,p)\Psi_\stepsize : (q, p) \mapsto (q', p') that preserves the symplectic 2-form: ΨΔ\symform=\symform\Psi_\stepsize^* \symform = \symform. Symplectic integrators do not, in general, preserve \hamilton\hamilton exactly — but they preserve a nearby modified Hamiltonian \hamilton~=\hamilton+O(Δp)\widetilde \hamilton = \hamilton + O(\stepsize^p) for a pp-th order symplectic method. As a consequence, the energy \hamilton(qk,pk)\hamilton(q_k, p_k) along the discrete trajectory oscillates within a bounded band of width O(Δp)O(\stepsize^p) around its initial value — not drifting linearly with tt, but staying near a level set of \hamilton~\widetilde \hamilton. This is the backward error analysis of geometric integrators, and it is the practical reason symplectic methods are mandatory for long-horizon Hamiltonian simulations.

Three canonical schemes.

Symplectic Euler (1st order). For a separable Hamiltonian \hamilton(q,p)=T(p)+V(q)\hamilton(q, p) = T(p) + V(q):

pk+1=pkΔV(qk),qk+1=qk+ΔT(pk+1).p_{k+1} = p_k - \stepsize \, V'(q_k), \qquad q_{k+1} = q_k + \stepsize \, T'(p_{k+1}).

Update momentum first using position-derived force; then update position using the updated momentum. The asymmetry — using pk+1p_{k+1} in the second step rather than pkp_k — is what makes the scheme symplectic. A symmetric version updates qq first, then pp; both are valid.

Störmer–Verlet (2nd order, symmetric). A symmetric version of symplectic Euler that does a half-step in momentum, full step in position, half-step in momentum:

pk+1/2=pkΔ2V(qk),qk+1=qk+ΔT(pk+1/2),pk+1=pk+1/2Δ2V(qk+1).\begin{aligned} p_{k+1/2} &= p_k - \tfrac{\stepsize}{2} V'(q_k), \\ q_{k+1} &= q_k + \stepsize \, T'(p_{k+1/2}), \\ p_{k+1} &= p_{k+1/2} - \tfrac{\stepsize}{2} V'(q_{k+1}). \end{aligned}

This is the algorithm used in essentially every molecular-dynamics simulation. Second-order accurate, symplectic, time-reversible (the same algorithm run with Δ-\stepsize exactly reverses the trajectory). The energy error is bounded by O(Δ2)O(\stepsize^2) over arbitrarily long horizons.

Gauss–Legendre IRK (order 2s2s, A-stable, symplectic). The ss-stage Gauss–Legendre Runge–Kutta method has nodes at the Gauss–Legendre quadrature points and is the unique ss-stage RK method of order 2s2s. The 1-stage method is the implicit midpoint rule (order 2, symplectic, A-stable). The 2-stage method is order 4. Gauss–Legendre IRK is the unique family that is simultaneously A-stable, symplectic, and of optimal order — making it the gold standard for high-accuracy long-horizon Hamiltonian simulation.

Energy of a harmonic oscillator over 100 periods, integrated with classical RK4 (linear drift) versus Störmer-Verlet (bounded oscillation).
Long-horizon energy comparison on the harmonic oscillator $\\dot q = p$, $\\dot p = -q$ over 100 periods at $\\Delta = 0.05$. Classical RK4 (gold) accumulates linear-in-time energy drift (~$10^{-3}$ after 100 periods despite local error $O(\\Delta^4)$). Störmer-Verlet (navy) oscillates within a band of width $O(\\Delta^2)$ but the running average stays at the initial energy. The qualitative difference between linear drift and bounded oscillation is the defining property of symplectic vs non-symplectic integrators. Produced by companions/ch06/jax/symplectic_demo.py.
.

(Modified Hamiltonian for symplectic methods.) For a symplectic integrator ΨΔ\Psi_\stepsize of order pp applied to a Hamiltonian system, there exists a modified Hamiltonian \hamilton~=\hamilton+ΔpHp+Δp+1Hp+1+\widetilde \hamilton = \hamilton + \stepsize^p H_p + \stepsize^{p+1} H_{p+1} + \cdots such that the discrete trajectory exactly preserves \hamilton~\widetilde \hamilton to within exponentially small terms over exponentially long times. The original \hamilton\hamilton therefore oscillates within a bounded O(Δp)O(\stepsize^p) band along the discrete trajectory.

The proof — Hairer–Lubich–Wanner Chapter IX — uses backward error analysis to expand the modified Hamiltonian as an asymptotic series in Δ\stepsize and shows the series can be truncated with controlled remainder. The takeaway for practice: a pp-th order symplectic integrator is exponentially better than a pp-th order non-symplectic one on Hamiltonian problems, even though the local truncation error of both is O(Δp+1)O(\stepsize^{p+1}). Local error is misleading for long-horizon problems; geometric structure is what matters.

6.6 Toward selective SSMs as Hamiltonian flows

The C1 niche pilot proposes the following research program. The hidden-state dynamics of a selective SSM can sometimes be written as a Hamiltonian flow on the hidden-state manifold; when they can, applying a symplectic integrator should reduce long-horizon error in a way that order-optimized integrators cannot match. The first step is to characterize when this Hamiltonian structure exists.

For a linear-time-invariant SSM h˙=Ah\dot \statevec = \statemat \statevec, the dynamics are Hamiltonian iff there exists a symmetric positive-definite PP such that AP+PA=0\statemat^\top P + P \statemat = 0. Equivalently, A\statemat is similar (via P1/2P^{1/2}) to a skew-symmetric matrix. The quadratic Hamiltonian is then \hamilton(h)=12hPh\hamilton(\statevec) = \tfrac{1}{2} \statevec^\top P \statevec, and the symplectic 2-form is \symform=P1\symform = P^{-1} (in the standard symplectic-flat coordinates). For HiPPO-LegS the eigenvalues are all real-negative, not purely imaginary, so HiPPO-LegS dynamics are dissipative, not Hamiltonian. Symplectic methods do not apply. This is consistent with the empirical observation that ZOH and bilinear discretizations of HiPPO-LegS perform well — dissipation handles the long-horizon stability story.

For a selective SSM with input-dependent A(u)\statemat(u), the relevant question becomes: does A(u)\statemat(u) have a Hamiltonian decomposition for some range of inputs, even if not for all? The C1 pilot’s empirical hypothesis is that during certain training phases the learned A(u)\statemat(u) matrices drift toward eigenvalue spectra that are closer to purely imaginary than to the LHP, and that this drift is associated with the long-horizon recall artifacts observed in Mamba’s failure modes Anonymous (2025) . If this hypothesis holds, then symplectic discretization of selective SSMs should reduce those artifacts. The Chapter 17 pilot integration develops this thread; this chapter has supplied the necessary integration-theory vocabulary.

Phase portrait of the pendulum: closed orbits preserved by symplectic Euler, spirals (energy drift) under RK4.
Pendulum $\\hamilton = p^2/2 - \\cos q$ phase portrait over 50 periods at $\\Delta = 0.1$ (large step size, chosen to expose qualitative differences). Symplectic Euler (navy): trajectory stays on a closed orbit (constant modified energy). RK4 (gold): orbit slowly spirals outward (linear energy drift). Both methods are formally stable; the qualitative difference is geometric. Produced by companions/ch06/jax/symplectic_demo.py.

6.7 What’s next

This chapter completes the foundations layer of the book (Part I). The next part — Chapters 7–10 — introduces structured SSMs (S4, S4D, S5, Mamba-1/2, Mamba-3) as discretizations of the continuous systems developed here. Chapter 7 sets up HiPPO theory; Chapter 8 derives the S4 / S4D / S5 architectures and shows where bilinear discretization shines; Chapter 9 introduces selective scans (Mamba-1/2); Chapter 10 brings the exp-trapezoidal scheme of §4.5 to its home in Mamba-3.

The C1 pilot integration in Chapter 17 returns to this chapter’s symplectic story, applying the modified-Hamiltonian framework to learned selective dynamics. Readers focused on the SSM applications can skip to Chapter 7 now and consult §6.5–§6.6 when the symplectic discussion comes back.

6.8 Exercises

Six problems. Short/numerical (6.1–6.3) have inline collapsible solutions; long/proof exercises (6.4–6.6) have full worked solutions in §6.9.

Exercise 6.1 (computation)

Verify that backward Euler’s stability function is \stabfnBE(z)=1/(1z)\stabfn_{\text{BE}}(z) = 1/(1-z) by directly applying the scheme to the test problem y˙=λy\dot y = \lambda y and solving for yk+1y_{k+1}.

Solution

Backward Euler: yk+1=yk+Δλyk+1y_{k+1} = y_k + \stepsize \lambda y_{k+1}. Rearranging, (1Δλ)yk+1=yk(1 - \stepsize \lambda) y_{k+1} = y_k, so yk+1=yk/(1Δλ)y_{k+1} = y_k / (1 - \stepsize \lambda). Setting z=Δλz = \stepsize \lambda, the stability function is \stabfnBE(z)=1/(1z)\stabfn_{\text{BE}}(z) = 1/(1 - z). ∎

Exercise 6.2 (computation)

For the harmonic oscillator q˙=p\dot q = p, p˙=q\dot p = -q (so \hamilton=(p2+q2)/2\hamilton = (p^2 + q^2)/2), compute one step of symplectic Euler (“p first” variant) starting from (q0,p0)=(1,0)(q_0, p_0) = (1, 0) at Δ=0.1\stepsize = 0.1. What are (q1,p1)(q_1, p_1), and what is \hamilton(q1,p1)\hamilton(q_1, p_1)?

Solution

Symplectic Euler (pp first): pk+1=pkΔqkp_{k+1} = p_k - \stepsize \cdot q_k, then qk+1=qk+Δpk+1q_{k+1} = q_k + \stepsize \cdot p_{k+1}.

Starting from (1,0)(1, 0): p1=00.11=0.1p_1 = 0 - 0.1 \cdot 1 = -0.1. Then q1=1+0.1(0.1)=0.99q_1 = 1 + 0.1 \cdot (-0.1) = 0.99. So (q1,p1)=(0.99,0.1)(q_1, p_1) = (0.99, -0.1).

Energy: \hamilton(q1,p1)=(0.992+0.12)/2=(0.9801+0.01)/2=0.49505\hamilton(q_1, p_1) = (0.99^2 + 0.1^2)/2 = (0.9801 + 0.01)/2 = 0.49505. Initial energy was 0.50.5. The energy is not exactly preserved (it changed by 0.00495Δ2/2-0.00495 \approx -\stepsize^2/2), but it is bounded — over many steps it will oscillate, not drift.

Exercise 6.3 (computation + code)

Run companions/ch06/jax/symplectic_demo.py. Verify that classical RK4 produces linear-in-time energy drift on the harmonic oscillator while Störmer–Verlet’s energy stays in a bounded band. What is the drift rate (energy per period) of RK4 at Δ=0.05\stepsize = 0.05?

Solution

The companion’s main output shows two energy-vs-time traces over 100 periods. RK4 drifts approximately linearly; the drift rate is roughly 10510^{-5} per period at Δ=0.05\stepsize = 0.05 — consistent with O(Δ5)O(\stepsize^5) local error compounded over 125\sim 125 steps per period. Störmer–Verlet’s energy stays within a band of width 103=O(Δ2)\sim 10^{-3} = O(\stepsize^2), with no secular drift. The contrast — linear drift vs bounded oscillation — is the key takeaway of §6.5.

Exercise 6.4 (theory) — solution in §6.9

Prove the L-stability of backward Euler (Theorem 6.1): \stabfnBE(z)1\abs{\stabfn_{\text{BE}}(z)} \le 1 for Re(z)0\operatorname{Re}(z) \le 0, and \stabfnBE(z)0\stabfn_{\text{BE}}(z) \to 0 as Re(z)\operatorname{Re}(z) \to -\infty.

Exercise 6.5 (theory) — solution in §6.9

Show that symplectic Euler is symplectic. That is, prove that the Jacobian (qk+1,pk+1)/(qk,pk)\partial(q_{k+1}, p_{k+1})/\partial(q_k, p_k) has determinant exactly 1 for the harmonic oscillator. (Then note: for separable Hamiltonians, the same argument applies via the chain rule.)

Exercise 6.6 (theory) — solution in §6.9

Show that for a linear-time-invariant ODE h˙=Ah\dot \statevec = \statemat \statevec with A\statemat skew-symmetric (A=A\statemat^\top = -\statemat), the function \hamilton(h)=12hh\hamilton(\statevec) = \tfrac{1}{2} \statevec^\top \statevec is a constant of motion. Use this to give a sufficient condition (in terms of A\statemat only) for an SSM to be Hamiltonian.

6.9 Full solutions to theory exercises

Solution to Exercise 6.4

Write z=r+iωz = -r + i\omega with r0r \ge 0. Then 1z2=(1+r)2+ω2\abs{1 - z}^2 = (1 + r)^2 + \omega^2. Since r0r \ge 0, (1+r)21(1 + r)^2 \ge 1, so 1z21\abs{1 - z}^2 \ge 1. Therefore \stabfnBE(z)=1/1z1\abs{\stabfn_{\text{BE}}(z)} = 1/\abs{1 - z} \le 1. ∎

For the L-stability limit: as rr \to \infty with ω\omega bounded, 1z2=(1+r)2+ω2\abs{1 - z}^2 = (1 + r)^2 + \omega^2 \to \infty, so \stabfnBE(z)0\stabfn_{\text{BE}}(z) \to 0. ∎

Solution to Exercise 6.5

For the harmonic oscillator q˙=p\dot q = p, p˙=q\dot p = -q, symplectic Euler (pp first) gives

pk+1=pkΔqk,qk+1=qk+Δpk+1=qk+Δ(pkΔqk)=(1Δ2)qk+Δpk.\begin{aligned} p_{k+1} &= p_k - \stepsize \, q_k, \\ q_{k+1} &= q_k + \stepsize \, p_{k+1} = q_k + \stepsize (p_k - \stepsize \, q_k) = (1 - \stepsize^2) q_k + \stepsize p_k. \end{aligned}

The Jacobian is

J=(qk+1,pk+1)(qk,pk)=(1Δ2ΔΔ1).J = \frac{\partial(q_{k+1}, p_{k+1})}{\partial(q_k, p_k)} = \begin{pmatrix} 1 - \stepsize^2 & \stepsize \\ -\stepsize & 1 \end{pmatrix}.

The determinant: detJ=(1Δ2)(1)Δ(Δ)=1Δ2+Δ2=1\det J = (1 - \stepsize^2)(1) - \stepsize \cdot (-\stepsize) = 1 - \stepsize^2 + \stepsize^2 = 1. ∎

So the discrete map preserves the symplectic 2-form dqdpdq \wedge dp exactly. This is the algebraic content of “symplectic Euler is symplectic” — the determinant being exactly 11, not 1+O(Δp)1 + O(\stepsize^p).

For a general separable Hamiltonian \hamilton=T(p)+V(q)\hamilton = T(p) + V(q), symplectic Euler gives pk+1=pkΔV(qk)p_{k+1} = p_k - \stepsize V'(q_k) and qk+1=qk+ΔT(pk+1)q_{k+1} = q_k + \stepsize T'(p_{k+1}). The Jacobian factorizes as

J=(10ΔV(qk)1)(1ΔT(pk+1)01),J = \begin{pmatrix} 1 & 0 \\ -\stepsize V''(q_k) & 1 \end{pmatrix} \cdot \begin{pmatrix} 1 & \stepsize T''(p_{k+1}) \\ 0 & 1 \end{pmatrix},

each of which has determinant 1 (lower- and upper-triangular with 1s on the diagonal). The product determinant is therefore 1. ∎

This “shear-decomposition” argument generalizes to all symplectic integrators that update qq and pp in alternating steps: each individual sub-step is a shear, with unit determinant. Verlet, leapfrog, Yoshida composition methods — all are products of unit-determinant shears.

Solution to Exercise 6.6

For h˙=Ah\dot \statevec = \statemat \statevec with A=A\statemat^\top = -\statemat, compute

ddt\hamilton(h)=ddt12hh=12(h˙h+hh˙)=12(hAh+hAh)=12h(A+A)h=0,\frac{d}{dt} \hamilton(\statevec) = \frac{d}{dt} \tfrac{1}{2} \statevec^\top \statevec = \tfrac{1}{2} (\dot \statevec^\top \statevec + \statevec^\top \dot \statevec) = \tfrac{1}{2} (\statevec^\top \statemat^\top \statevec + \statevec^\top \statemat \statevec) = \tfrac{1}{2} \statevec^\top (\statemat^\top + \statemat) \statevec = 0,

since A+A=0\statemat^\top + \statemat = 0 by skew-symmetry. So \hamilton(h(t))\hamilton(h0)\hamilton(\statevec(t)) \equiv \hamilton(\statevec_0) — the squared norm is exactly preserved. ∎

Sufficient condition for an SSM to be Hamiltonian. Any SSM whose continuous dynamics matrix A\statemat is skew-symmetric is Hamiltonian with Hamiltonian \hamilton(h)=12hh\hamilton(\statevec) = \tfrac{1}{2} \statevec^\top \statevec. The eigenvalues of a real skew-symmetric matrix are purely imaginary (or zero), so this regime is exactly the purely oscillatory corner of the SSM landscape — the opposite of the LHP-stable corner where HiPPO and S4 live. A more general sufficient condition: A\statemat has a Hamiltonian structure (in the matrix-Lie-algebra sense) — equivalently, there exists a symmetric positive-definite PP with AP+PA=0\statemat^\top P + P \statemat = 0. The C1 pilot’s empirical project is to identify selective-SSM regimes where this condition holds approximately, and to apply symplectic discretization there.

6.10 Companion code

Two JAX companions and two Julia companions for Chapter 6.

JAX (companions/ch06/jax/):

  • stiff_demo.py — van der Pol oscillator at μ=30\mu = 30 (mildly stiff); compares explicit RK4 (blows up at large Δ\stepsize) against backward Euler (stable at every Δ\stepsize); emits stiff_blowup.png.
  • symplectic_demo.py — harmonic oscillator + pendulum; runs classical RK4 vs Störmer–Verlet over 100 periods; emits energy_drift.png and phase_portrait.png.

Julia (companions/ch06/julia/):

  • implicit_methods.jl — backward Euler, BDF2, and DIRK from-scratch implementations on a stiff test problem; uses only LinearAlgebra + Printf.
  • symplectic_methods.jl — symplectic Euler, Störmer–Verlet, and the 2-stage Gauss–Legendre IRK method on a Hamiltonian test; emits empirical order-of-accuracy table.
  • Project.toml / Manifest.toml.

To run from the repo root:

# JAX
PYTHONPATH=. python companions/ch06/jax/stiff_demo.py
PYTHONPATH=. python companions/ch06/jax/symplectic_demo.py

# Julia (lightweight — no DifferentialEquations.jl dependency)
julia --project=companions/ch06/julia companions/ch06/julia/implicit_methods.jl
julia --project=companions/ch06/julia companions/ch06/julia/symplectic_methods.jl

All figures emit to public/figures/ch06/.

Part II · SSM Core Week 7 Coming Soon

HiPPO theory: orthogonal-basis projection operators

Legendre/Laguerre/Fourier bases; optimal polynomial approximation of input history.

HiPPO theory: orthogonal-basis projection operators

This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.

Part II · SSM Core Week 8 Coming Soon

LTI SSMs: S4, S4D, S5 as discretized continuous systems

Three discrete realizations of the same continuous SSM; convolutional vs recurrent views.

LTI SSMs: S4, S4D, S5 as discretized continuous systems

This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.

Part II · SSM Core Week 9 Coming Soon

Selective SSMs: Mamba-1, Mamba-2, SSD framework

Input-dependent parameters; selective scan; state-space duality with attention.

Selective SSMs: Mamba-1, Mamba-2, SSD framework

This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.

Part II · SSM Core Week 10 Coming Soon

Mamba-3 and the exponential-trapezoidal integrator

Complex-valued state, second-order discretization, MIMO formulation — ICLR 2026 Oral; C1 empirical anchor.

Mamba-3 and the exponential-trapezoidal integrator

This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.

Part III · Beyond SSMs Week 11 Coming Soon

Linear attention and Hyena: long convolution and gated recurrence

Katharopoulos lineage; Hyena operator; gated linear attention (GLA/RetNet/HGRN2).

Linear attention and Hyena: long convolution and gated recurrence

This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.

Part III · Beyond SSMs Week 12 Coming Soon

Delta-rule lineage: DeltaNet, Gated DeltaNet, Kimi Linear

Online-learning lens on SSMs; Longhorn (implicit) vs DeltaNet (explicit); chunkwise parallelization.

Delta-rule lineage: DeltaNet, Gated DeltaNet, Kimi Linear

This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.

Part III · Beyond SSMs Week 13 Coming Soon

Exponential gates and matrix memory: xLSTM and RWKV-7

Exponential gating vs sigmoid; matrix-valued memory; vector-valued gating; in-context learning rates.

Exponential gates and matrix memory: xLSTM and RWKV-7

This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.

Part IV · Integration Week 14 Coming Soon

Hybrid architectures and gating mechanisms

MAD methodology; production hybrids (Jamba/Bamba/Nemotron-H/SambaY); gating design space — B pilot home chapter.

Hybrid architectures and gating mechanisms

This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.

Part IV · Integration Week 15 Coming Soon

Counter-evidence and diagnostic tools: where SSMs fail

TC^0 bounds; copying limitations; Gather-and-Aggregate mechanism; mechanistic evaluation.

Counter-evidence and diagnostic tools: where SSMs fail

This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.

Part IV · Integration Week 16 Coming Soon

Empirical methodology: benchmark protocols and evaluation

Synthetic probes; regime-aware evaluation; the 5-axis decomposition B uses.

Empirical methodology: benchmark protocols and evaluation

This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.

Part V · Synthesis Week 17 Coming Soon

Niche-pilot integration: from curriculum to research output

How the 17-chapter foundations map to the C1 + B pilots and the broader 13-niche portfolio.

Niche-pilot integration: from curriculum to research output

This chapter is planned. See the chapter-at-a-glance (foundations §3) for the full design; the crosswalk (foundations §22) maps this chapter to the relevant post_transformers/roadmap.md weeks.